Real estate market insights are invaluable, and Zillow, one of the largest real estate databases, is an excellent source of this data. Whether you’re analyzing market trends or exploring investment opportunities, scraping Zillow data using Python gives you an edge. To go from raw HTML to usable data, follow these steps.
Preparing What You’ll Need
Before we get started, ensure you have Python up and running on your machine. Then, grab the following libraries with just a couple of commands:
pip install requests
pip install lxml
Step 1: Get to Know Zillow’s HTML Structure
To scrape data from Zillow, understanding how the page is structured is key. Open up a property listing on Zillow and inspect the page. You're looking for key elements like:
Property title
Rent estimate price
Assessment price
By identifying these elements in the page’s HTML, you can tell your script exactly what to look for.
Step 2: Make Your First HTTP Request
Now that we know what we’re after, it's time to send a request to Zillow and grab the page. We'll use requests
for this task. Here’s how you do it:
import requests
url = "https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/"
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
response = requests.get(url, headers=headers)
response.raise_for_status() # Will raise an error if the request fails
Step 3: Parse the HTML with lxml
Once you have the page, you need to extract the data. To do this, we’ll use lxml
, a powerful HTML/XML parser. We’ll turn the raw HTML into something Python can easily read and search through.
from lxml.html import fromstring
parser = fromstring(response.text)
Now, the HTML content is in the parser
variable, and we can start pulling out specific details.
Step 4: Extract Key Data Points Using XPath
XPath is a powerful query language that lets you search the HTML like a pro. Here’s how you can grab the property title, rent estimate, and assessment price:
# Extracting the title
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
# Extracting the rent estimate price
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
# Extracting the assessment price
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
You’ll use XPath expressions that match the HTML tags and classes associated with each piece of data.
Step 5: Save the Data to JSON
Once you’ve got your data, it’s time to store it. The easiest way is by saving it as a JSON file, which is perfect for further analysis or storage.
import json
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
# Save data to a JSON file
with open('zillow_property_data.json', 'w') as f:
json.dump(property_data, f, indent=4)
print("Data saved to zillow_property_data.json")
Step 6: Deal with Multiple Listings
Want to scrape data from multiple Zillow pages? It’s easy. You just need to loop through a list of URLs. Here's how:
urls = [
"https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
"https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]
all_properties = []
for url in urls:
response = requests.get(url, headers=headers)
response.raise_for_status()
parser = fromstring(response.text)
# Extract data for each property
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
all_properties.append(property_data)
# Save all property data to a JSON file
with open('all_zillow_properties.json', 'w') as f:
json.dump(all_properties, f, indent=4)
print("All property data saved to all_zillow_properties.json")
Full Script for Scraping Multiple Listings
Here's the complete script, from sending requests to saving data:
import requests
from lxml.html import fromstring
import json
# Define URLs to scrape
urls = [
"https://www.zillow.com/homedetails/1234-Main-St-Some-City-CA-90210/12345678_zpid/",
"https://www.zillow.com/homedetails/5678-Another-St-Some-City-CA-90210/87654321_zpid/"
]
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
}
all_properties = []
for url in urls:
response = requests.get(url, headers=headers)
response.raise_for_status()
parser = fromstring(response.text)
# Extract data for each property
title = ' '.join(parser.xpath('//h1[@class="Text-c11n-8-99-3__sc-aiai24-0 dFxMdJ"]/text()'))
rent_estimate_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-2]
assessment_price = parser.xpath('//span[@class="Text-c11n-8-99-3__sc-aiai24-0 dFhjAe"]//text()')[-1]
property_data = {
'title': title,
'Rent estimate price': rent_estimate_price,
'Assessment price': assessment_price
}
all_properties.append(property_data)
# Save the data
with open('all_zillow_properties.json', 'w') as f:
json.dump(all_properties, f, indent=4)
print("All property data saved successfully!")
Final Thoughts
By following these steps, you can easily extract property data from Zillow to fuel your real estate analysis. Be sure to respect the website's terms of use, and if you’re scraping at scale, consider using proxies and rotating user agents to avoid being blocked.
Top comments (0)