This Python script uses the Scrapy, requests, and price_parser libraries to scrape a website that lists properties for sale. It extracts details about each property such as price, title, address, number of baths and rooms, area, owner info, owner url, and coordinates (latitude, longitude).
Libraries
- Scrapy: An open-source web-crawling framework for Python.
- requests: A library to send all kinds of HTTP requests.
- price_parser: A library to extract price and currency from raw text strings.
Let's dissect this script step-by-step:
Import Libraries
from scrapy import Selector
import requests
from urllib.parse import urljoin
from price_parser import Price
The above lines import the necessary Python libraries for the script.
Setting the Initial Variables
response = requests.get("https://www.pisos.com/venta/pisos-cedeira/")
sel = Selector(response)
home_url = "https://www.pisos.com"
The script sends a GET request to the URL of the website and uses the Selectorclass from Scrapy to create an object that can be used for parsing the HTML.
Number Filtering Function
def number_filtering(number):
if type(number) == int:
return number
if type(number) == float:
return(round(number))
if type(number) == str:
number = Price.fromstring(number)
number = number.amount
if number is None:
return None
try:
return int(number)
except Exception:
return float(number)
This function converts string-based numbers into their integer or float representations. If the input is already an integer or a float, it returns the input as it is.
Get Text Between Substrings Function
def get_text_between(full_string, start_substring, end_substring):
start = full_string.find(start_substring) + len(start_substring)
end = full_string.find(end_substring, start)
return "" if start == -1 or end == -1 else full_string[start:end]
This function takes three arguments: the full string and two substrings. It finds the text located between the two substrings.
Get Latitude and Longitude Function
def get_lat_lon(response):
selector = Selector(response)
lat = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Lat = ", ";")
lon = get_text_between(selector.css("script[type='text/javascript'] ::text").get(), "_Long = ", ";")
return lat, lon
This function extracts the latitude and longitude values from the JavaScript included in the page's HTML.
Parse Ad Function
def parse_ad(ad_response):
...
print(f"Price: {price}")
print(f"Title: {title}")
print(f"Address: {address}")
print(f"N_baths: {n_baths}")
print(f"N_rooms: {n_rooms}")
print(f"Area: {area}")
print(f"Owner info: {owner_info}")
print(f"Owner url: {owner_url}")
print(f"Description: {description}")
print(f"Source id: {source_id}")
print(f"Latitude: {lat}")
print(f"Longitude: {lon}")
print("=============================================================")
This function parses the HTML of an ad and prints out the data about the property. It extracts the price, title, address, number of baths and rooms, area, owner info, owner url, description, source id, and coordinates (latitude, longitude) from the ad's HTML.
Parse All Ads
all_ads = sel.css("div.ad-preview")
for ad in all_ads:
url = ad.css("a::attr(href)").get()
ad_response = requests.get(urljoin(home_url, url))
parse_ad(ad_response)
Finally, the script iterates over all ad preview divs, sends a request to each ad's URL, and then parses the response with the parse_ad() function.
Full code -> https://gist.github.com/VictorLG98/994874841e52213cf20e7c2a91ee781a
Video on my Youtube -> linktree
Top comments (0)