DEV Community

Scrape Booking.com Data for Insights with Python

Imagine having the ability to collect real-time hotel data from Booking.com with just a few lines of code. Whether you’re a data analyst, a developer, or a business owner looking to dive into competitive analysis, scraping data from a site like Booking.com can open up a treasure trove of insights. But how do you go about it? Here’s your guide to extracting crucial information using Python.

Setting Up the Necessary Tools

Before diving into the code, let's first get your environment ready. You’ll need a few Python libraries that will do the heavy lifting for you. Here’s what you’ll be using:

  1. Requests – Sends HTTP requests to fetch the HTML content.
  2. LXML – Parses the HTML and extracts the necessary data using XPath.
  3. JSON – Handles embedded JSON data.
  4. CSV – Saves the extracted data into a CSV file for easy analysis.

To get these libraries up and running, simply install them using pip:

pip install requests lxml
Enter fullscreen mode Exit fullscreen mode

No need to worry about JSON and CSV—they come pre-installed with Python.

How Data is Structured on Booking.com

Now, let’s talk about how data is structured on Booking.com. Each hotel page is loaded with JSON-LD, a format that’s perfect for extracting structured data like names, locations, and pricing. This makes scraping a breeze, once we understand the layout.

How to Get the Data

Booking.com, like many websites, takes steps to prevent automated scraping. But don't worry, we’ll handle it the right way.

Mimic a Legitimate User with Custom HTTP Headers
To avoid triggering anti-scraping systems, we need to make our requests look like they’re coming from a regular user. This means adding custom headers to our HTTP requests. Here’s the code to send a request with the proper headers:

import requests
from lxml.html import fromstring

# List of hotel URLs to scrape
urls_list = ["https://example.com/hotel1", "https://example.com/hotel2"]

for url in urls_list:
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    response = requests.get(url, headers=headers)
Enter fullscreen mode Exit fullscreen mode

Using Proxies to Avoid Detection
Booking.com applies strict request rate limits and tracks IP addresses. To prevent blocks, we use proxies. These help distribute requests across multiple IPs, keeping you safe from bans. Here’s how you can implement proxies:

proxies = {
    'http': 'http://your_proxy_here',
    'https': 'https://your_proxy_here'
}
response = requests.get(url, headers=headers, proxies=proxies)
Enter fullscreen mode Exit fullscreen mode

HTML Parsing and JSON Extraction
Once we’ve made a successful request, we need to parse the page and pull out the data. We’ll extract the JSON-LD data embedded within the HTML, which contains all the hotel details you’re after.

parser = fromstring(response.text)

# Extract JSON data embedded in the page
embedded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
json_data = json.loads(embedded_jsons[0])
Enter fullscreen mode Exit fullscreen mode

Pulling Out Specific Hotel Information
Now comes the fun part: extracting the data you need. Below is a snippet that pulls specific fields like the hotel name, rating, price range, and more:

name = json_data['name']
location = json_data['hasMap']
price_range = json_data['priceRange']
rating_value = json_data['aggregateRating']['ratingValue']
review_count = json_data['aggregateRating']['reviewCount']
address = json_data['address']['streetAddress']
Enter fullscreen mode Exit fullscreen mode

Saving Data in a CSV File
Once you’ve extracted the data, it’s time to save it for analysis. We’ll store the results in a CSV file for easy access later:

import csv

all_data = []

# Append data for each hotel to the list
all_data.append({
    "Name": name,
    "Location": location,
    "Price Range": price_range,
    "Rating": rating_value,
    "Review Count": review_count,
    "Address": address
})

# Write data to a CSV
with open('hotel_data.csv', 'w', newline='') as csvfile:
    fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(all_data)

print("Data saved to hotel_data.csv!")
Enter fullscreen mode Exit fullscreen mode

The Full Code

Here’s a quick recap of the entire scraping process from start to finish:

import requests
from lxml.html import fromstring
import json
import csv

urls_list = ["https://example.com/hotel1", "https://example.com/hotel2"]
all_data = []

proxies = {'http': 'http://your_proxy_here', 'https': 'https://your_proxy_here'}

for url in urls_list:
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.9',
    }

    response = requests.get(url, headers=headers, proxies=proxies)
    parser = fromstring(response.text)
    embedded_jsons = parser.xpath('//script[@type="application/ld+json"]/text()')
    json_data = json.loads(embedded_jsons[0])

    name = json_data['name']
    location = json_data['hasMap']
    price_range = json_data['priceRange']
    rating_value = json_data['aggregateRating']['ratingValue']
    review_count = json_data['aggregateRating']['reviewCount']
    address = json_data['address']['streetAddress']

    all_data.append({
        "Name": name,
        "Location": location,
        "Price Range": price_range,
        "Rating": rating_value,
        "Review Count": review_count,
        "Address": address
    })

with open('hotel_data.csv', 'w', newline='') as csvfile:
    fieldnames = ["Name", "Location", "Price Range", "Rating", "Review Count", "Address"]
    writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
    writer.writeheader()
    writer.writerows(all_data)

print("Data saved to hotel_data.csv!")
Enter fullscreen mode Exit fullscreen mode

Wrapping It Up

By using these tools to scrape Booking.com data, you can efficiently collect valuable information for analysis and decision-making. Just remember to always check a website's terms of service before scraping to ensure you're staying within the legal bounds.

AWS Q Developer image

Your AI Code Assistant

Ask anything about your entire project, code and get answers and even architecture diagrams. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Start free in your IDE

Top comments (0)

AWS Q Developer image

Your AI Code Assistant

Automate your code reviews. Catch bugs before your coworkers. Fix security issues in your code. Built to handle large projects, Amazon Q Developer works alongside you from idea to production code.

Get started free in your IDE

👋 Kindness is contagious

DEV offers a tailored experience (with personalized settings such as dark mode) when you're signed in!

Get Started