WebScraper

2023-04-24 867 words 5 minutes

Contents

Idea

The idea for this project came from a classmate of mine, Simone Zheng.

The idea is to create a scraper that goes to analyze and “study” the data on real estate sites, so as to be able to analyze the various listings, selected by area, and calculate which among the houses/apartments for sale is worth buying and which ones are better not to look at at all

Currently, all the functions that are concerned with calculating which among the listings is the best, based on average salary, IRPEF and tc , are still under development

Currently, the scraper is only concerned with taking data from real estate sites, and calculating the average cost per area of properties.

Code

The scraper code is done in python. The libraries used were :

BeautifulSoup
Pandas
Requests
Sys
Getopt

Functions have also been implemented to take in input, from the terminal, values that will go to set the area of interest and the type of property you want to search for

This was implemented in this way

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22


arg_zone = ""
    arg_type = ""
    arg_help = "Help: -z <zone> -t <type>".format(argv[0])
    
    try:
        opts, args = getopt.getopt(argv[1:], "hz:t:", ["help", "zone=", 
        "type="])
    except:
        print(arg_help)
        sys.exit(2)
    
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            print(arg_help)  # print the help message
            sys.exit(2)
        elif opt in ("-z", "--zone"):
            arg_zone = arg
        elif opt in ("-t", "--type"):
            arg_type = arg

    if arg_type == "all":
        arg_type = "case"

Inside we also find a function that goes to look up how many results have been found on the site, so we can scroll through the various pages via a for loop, which increments the page= value of the site.

The final code turns out to be this

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111


# Importare moduli
import requests
import pandas as pd
from bs4 import BeautifulSoup
import sys
import getopt
import math
# Indirizzo sito web
def FindNumber(string):
    empty_string = ""
    for m in string:
        if m.isdigit():
            empty_string = empty_string+m
    return int(empty_string)

def SearchNumberOfElements(type,zone):
    url = f"https://www.immobiliare.it/vendita-{type}/roma/{zone}/?criterio=rilevanza&noAste=1"
    response = requests.get(url)
        # Analizzare documento HTML del codice sorgente con BeautifulSoup
    html = BeautifulSoup(response.text, 'html.parser')
    number_of_el = html.find('div', class_="in-searchList__title")
    value = number_of_el.text
    split = value.split()
    number = split[0]
    return int(number)

def search(argv):
    # Import argument from terminal
    arg_zone = ""
    arg_type = ""
    arg_help = "Help: -z <zone> -t <type>".format(argv[0])
    
    try:
        opts, args = getopt.getopt(argv[1:], "hz:t:", ["help", "zone=", 
        "type="])
    except:
        print(arg_help)
        sys.exit(2)
    
    for opt, arg in opts:
        if opt in ("-h", "--help"):
            print(arg_help)  # print the help message
            sys.exit(2)
        elif opt in ("-z", "--zone"):
            arg_zone = arg
        elif opt in ("-t", "--type"):
            arg_type = arg

    if arg_type == "all":
        arg_type = "case"
    locali_tot = []
    price_tot = []
    cycle = SearchNumberOfElements(arg_type,arg_zone)
    print(f"Numeri di elementi trovati : {cycle}")
    pagine = math.ceil(cycle/25)
    print("Inizio scansione e raccolta dati")
    if pagine  == 1:
        url = f"https://www.immobiliare.it/vendita-{arg_type}/roma/{arg_zone}/?criterio=rilevanza&noAste=1"
        print(url)
        # Eseguire richiesta GET
        response = requests.get(url)
        # Analizzare documento HTML del codice sorgente con BeautifulSoup
        html = BeautifulSoup(response.text, 'html.parser')
        # Estrarre tutte le citazioni e gli autori dal documento HTML
        locali_html = html.find_all('a', class_="in-card__title")
        price_html = html.find_all(
            'li', class_="nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main")
        # Raccogliere le citazioni in un elenco
        locali = list()
        for locale in locali_html:
            locali.append(locale.text)
        # Raccogliere gli autori in un elenco
        prices = list()
        for price in price_html:
            prices.append(price.text)

        locali_tot += locali
        price_tot += prices
    else:
        for i in range(1, pagine+1):
            url = f"https://www.immobiliare.it/vendita-{arg_type}/roma/{arg_zone}/?criterio=rilevanza&pag={i}&noAste=1"
            print(url)
            # Eseguire richiesta GET
            response = requests.get(url)
            # Analizzare documento HTML del codice sorgente con BeautifulSoup
            html = BeautifulSoup(response.text, 'html.parser')
            # Estrarre tutte le citazioni e gli autori dal documento HTML
            locali_html = html.find_all('a', class_="in-card__title")
            price_html = html.find_all(
                'li', class_="nd-list__item in-feat__item in-feat__item--main in-realEstateListCard__features--main")
            locali = list()
            for locale in locali_html:
                locali.append(locale.text)
            prices = list()
            for price in price_html:
                prices.append(price.text)

            locali_tot += locali
            price_tot += prices

    return locali_tot, price_tot,arg_type


locali, prezzi,tipo_immobile = search(sys.argv)
somma_prezzi, count_locali, media_prezzo = 0, 0, 0

print("Elaboro csv personalizzato")
df = pd.DataFrame({'Locali': locali, 'Prezzi': prezzi,
                  'Media prezzo:': media_prezzo})
df.to_csv(f'Immobiliare_{tipo_immobile}.csv', index=False, encoding='utf-8')
print("Fine elaborazione.")

P. S

Small reminder: Please remember that scraping data online is completely LEGAL, the imporatate thing is that you have to stay enter a certain range of scans per day, but otherwise it is all legal since since a certain data is posted on a website, any person can access it without any problem.

Thank you 😄

P. P. S

The scraper is not even at its beta version, I dare say it’s in the gamma version yet 😄, so any criticism/modification/advice is welcome