Imagine having the ability to collect real-time hotel data from Booking.com with just a few lines of code. Whether you’re a data analyst, a developer, or a business owner looking to dive into competitive analysis, scraping data from a site like Booking.com can open up a treasure trove of insights. But how do you go about it? Here’s your guide to extracting crucial information using Python.
Setting Up the Necessary Tools
Before diving into the code, let's first get your environment ready. You’ll need a few Python libraries that will do the heavy lifting for you. Here’s what you’ll be using:
- Requests – Sends HTTP requests to fetch the HTML content.
- LXML – Parses the HTML and extracts the necessary data using XPath.
- JSON – Handles embedded JSON data.
- CSV – Saves the extracted data into a CSV file for easy analysis.
To get these libraries up and running, simply install them using pip:
pip install requests lxml
No need to worry about JSON and CSV—they come pre-installed with Python.
How Data is Structured on Booking.com
Now, let’s talk about how data is structured on Booking.com. Each hotel page is loaded with JSON-LD, a format that’s perfect for extracting structured data like names, locations, and pricing. This makes scraping a breeze, once we understand the layout.
How to Get the Data
Booking.com, like many websites, takes steps to prevent automated scraping. But don't worry, we’ll handle it the right way.
Mimic a Legitimate User with Custom HTTP Headers
To avoid triggering anti-scraping systems, we need to make our requests look like they’re coming from a regular user. This means adding custom headers to our HTTP requests. Here’s the code to send a request with the proper headers:
import requests
from lxml.html import fromstring
# List of hotel URLs to scrape
urls_list = ["https://example.com/hotel1", "https://example.com/hotel2"]
for url in urls_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers)
Using Proxies to Avoid Detection
Booking.com applies strict request rate limits and tracks IP addresses. To prevent blocks, we use proxies. These help distribute requests across multiple IPs, keeping you safe from bans. Here’s how you can implement proxies:
proxies = {
'http': 'http://your_proxy_here',
'https': 'https://your_proxy_here'
}
response = requests.get(url, headers=headers, proxies=proxies)
HTML Parsing and JSON Extraction
Once we’ve made a successful request, we need to parse the page and pull out the data. We’ll extract the JSON-LD data embedded within the HTML, which contains all the hotel details you’re after.
parser = fromstring(response.text)
# Extract JSON data embedded in the page
embedded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embedded_jsons[0])
Pulling Out Specific Hotel Information
Now comes the fun part: extracting the data you need. Below is a snippet that pulls specific fields like the hotel name, rating, price range, and more:
name = json_data['name']
location = json_data['hasMap']
price_range = json_data['priceRange']
rating_value = json_data['aggregateRating']['ratingValue']
review_count = json_data['aggregateRating']['reviewCount']
address = json_data['address']['streetAddress']
Saving Data in a CSV File
Once you’ve extracted the data, it’s time to save it for analysis. We’ll store the results in a CSV file for easy access later:
import csv
all_data = []
# Append data for each hotel to the list
all_data.append({
"Name": name,
"Location": location,
"Price Range": price_range,
"Rating": rating_value,
"Review Count": review_count,
"Address": address
})
# Write data to a CSV
with open('hotel_data.csv', 'w', newline='') as csvfile:
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(all_data)
print("Data saved to hotel_data.csv!")
The Full Code
Here’s a quick recap of the entire scraping process from start to finish:
import requests
from lxml.html import fromstring
import json
import csv
urls_list = ["https://example.com/hotel1", "https://example.com/hotel2"]
all_data = []
proxies = {'http': 'http://your_proxy_here', 'https': 'https://your_proxy_here'}
for url in urls_list:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.9',
}
response = requests.get(url, headers=headers, proxies=proxies)
parser = fromstring(response.text)
embedded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embedded_jsons[0])
name = json_data['name']
location = json_data['hasMap']
price_range = json_data['priceRange']
rating_value = json_data['aggregateRating']['ratingValue']
review_count = json_data['aggregateRating']['reviewCount']
address = json_data['address']['streetAddress']
all_data.append({
"Name": name,
"Location": location,
"Price Range": price_range,
"Rating": rating_value,
"Review Count": review_count,
"Address": address
})
with open('hotel_data.csv', 'w', newline='') as csvfile:
fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address"]
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
writer.writerows(all_data)
print("Data saved to hotel_data.csv!")
Wrapping It Up
By using these tools to scrape Booking.com data, you can efficiently collect valuable information for analysis and decision-making. Just remember to always check a website's terms of service before scraping to ensure you're staying within the legal bounds.
Top comments (0)