Vic

Posted on Jul 23, 2023

Scraping Real State Website

#scraping #python #tutorial

This Python script uses the Scrapy, requests, and price_parser libraries to scrape a website that lists properties for sale. It extracts details about each property such as price, title, address, number of baths and rooms, area, owner info, owner url, and coordinates (latitude, longitude).

Libraries

Scrapy: An open-source web-crawling framework for Python.
requests: A library to send all kinds of HTTP requests.
price_parser: A library to extract price and currency from raw text strings.

Let's dissect this script step-by-step:

Import Libraries

from scrapy import Selector
import requests
from urllib.parse import urljoin
from price_parser import Price

The above lines import the necessary Python libraries for the script.

Setting the Initial Variables

response = requests.get("https://www.pisos.com/venta/pisos-cedeira/")
sel = Selector(response)

home_url = "https://www.pisos.com"

The script sends a GET request to the URL of the website and uses the Selectorclass from Scrapy to create an object that can be used for parsing the HTML.

Number Filtering Function

def number_filtering(number):
    if type(number) == int:
        return number
    if type(number) == float:
        return(round(number))
    if type(number) == str:
        number = Price.fromstring(number)
        number = number.amount
        if number is None:
            return None
        try:
            return int(number)
        except Exception:
            return float(number)

This function converts string-based numbers into their integer or float representations. If the input is already an integer or a float, it returns the input as it is.

Get Text Between Substrings Function

def get_text_between(full_string, start_substring, end_substring):
    start = full_string.find(start_substring) + len(start_substring)
    end = full_string.find(end_substring, start)
    return "" if start == -1 or end == -1 else full_string[start:end]

This function takes three arguments: the full string and two substrings. It finds the text located between the two substrings.

Get Latitude and Longitude Function

def get_lat_lon(response):
    selector = Selector(response)
    lat = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Lat = ", ";")
    lon = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Long = ", ";")
    return lat, lon

This function extracts the latitude and longitude values from the JavaScript included in the page's HTML.

Parse Ad Function

def parse_ad(ad_response):
    ...
    print(f"Price: {price}")
    print(f"Title: {title}")
    print(f"Address: {address}")
    print(f"N_baths: {n_baths}")
    print(f"N_rooms: {n_rooms}")
    print(f"Area: {area}")
    print(f"Owner info: {owner_info}")
    print(f"Owner url: {owner_url}")
    print(f"Description: {description}")
    print(f"Source id: {source_id}")
    print(f"Latitude: {lat}")
    print(f"Longitude: {lon}")
    print("=============================================================")

This function parses the HTML of an ad and prints out the data about the property. It extracts the price, title, address, number of baths and rooms, area, owner info, owner url, description, source id, and coordinates (latitude, longitude) from the ad's HTML.

Parse All Ads

all_ads = sel.css("div.ad-preview")
for ad in all_ads:
    url = ad.css("a::attr(href)").get()
    ad_response = requests.get(urljoin(home_url, url))
    parse_ad(ad_response)

Finally, the script iterates over all ad preview divs, sends a request to each ad's URL, and then parses the response with the parse_ad() function.

Full code -> https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a

Video on my Youtube -> linktree

DEV Community

Scraping Real State Website

Libraries

Import Libraries

Setting the Initial Variables

Number Filtering Function

Get Text Between Substrings Function

Get Latitude and Longitude Function

Parse Ad Function

Parse All Ads

Top comments (0)

Read next

How to Open Pull Requests from Jira Issues with GitAuto

Reviving Defense Technology: Silicon Valley's Next Chapter

A Media Server on Steroids - Walkthrough

The Future of Software Development and Web Development: Trends to Watch in 2025