Swiftproxy - Residential Proxies

Posted on Jan 9

How to Use Python to Scrape Yelp Data

#scrapeyelp

Ever wondered how to quickly gather data on local restaurants—think names, ratings, menus, and URLs—from Yelp? With the right tools, you can scrape Yelp data efficiently, gaining valuable insights on your target market. In this guide, we’ll show you how to scrape Yelp’s search results using Python, all while handling headers, proxies, and XPath to extract data effectively.

Step 1: Get Your Environment Ready

Before we start scraping, you’ll need a Python environment with a couple of essential libraries: requests and lxml.
To install them, simply run:

pip install requests
pip install lxml

These libraries allow us to send HTTP requests to Yelp, retrieve the HTML content, and parse it for the information we need.

Step 2: Send a Yelp Data Request

We’re scraping Yelp’s search results page, so we need to send a GET request to get the HTML. Here’s the code to do that:

import requests

# Yelp search URL (we're targeting restaurants in San Francisco)
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

# Send GET request
response = requests.get(url)

# Check if the request was successful
if response.status_code == 200:
    print("Page content fetched successfully!")
else:
    print(f"Failed to retrieve page content. Status code: {response.status_code}")

Step 3: Set Up Headers

Websites like Yelp might block your requests if they suspect you’re a bot. To avoid this, you need to use HTTP headers, specifically the User-Agent. This tells Yelp that the request is coming from a legitimate browser, not a scraper.
Here’s how to set up the headers:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.5'
}

response = requests.get(url, headers=headers)

Step 4: Proxy Rotation

When scraping large volumes of data, your IP might get blocked. The solution? Proxy rotation. By using a pool of rotating proxies, your IP address changes periodically, making it harder for Yelp to detect and block your requests.
Here’s how you can add a proxy setup:

proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

Make sure you use proxies that rotate automatically, so you don’t have to configure them manually every time.

Step 5: Parse HTML Content with `lxml`

Once we’ve successfully fetched the HTML, it’s time to extract the data. We’ll use lxml for parsing. Here's how to do that:

from lxml import html

# Parse HTML content
parser = html.fromstring(response.content)

Now we need to identify the specific HTML elements that contain the restaurant data. For Yelp, restaurant entries are within <div> elements marked with a specific attribute. We'll target those with XPath.

Step 6: Identify and Extract the Elements

Use XPath to pinpoint exactly what you want to extract: restaurant name, URL, cuisines, and rating.
Here’s how you’d extract the individual restaurant elements:

elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

This XPath expression grabs the relevant sections from the search results. You can adjust it to target more or fewer items depending on your needs.

Step 7: Extract Specific Data

Now, let’s dive into extracting the specific data points for each restaurant—name, URL, cuisines, and rating. Here’s how:

restaurants_data = []

for element in elements:
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    restaurants_data.append(restaurant_info)

Step 8: Output Data as JSON

Now that we’ve scraped the data, we want to save it in a structured format. JSON is a popular choice. Here's how to do it:

import json

# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete! Saved to yelp_restaurants.json")

Full Code

Here’s the complete script that ties everything together:

import requests
from lxml import html
import json

url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.5'
}

proxies = {
    'http': 'http://username:password@proxy-server:port',
    'https': 'https://username:password@proxy-server:port'
}

response = requests.get(url, headers=headers, proxies=proxies)

if response.status_code == 200:
    print("Page content fetched successfully!")
else:
    print(f"Failed to retrieve page content. Status code: {response.status_code}")

parser = html.fromstring(response.content)

elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]

restaurants_data = []

for element in elements:
    name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
    url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
    cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
    rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]

    restaurant_info = {
        "name": name,
        "url": url,
        "cuisines": cuisines,
        "rating": rating
    }

    restaurants_data.append(restaurant_info)

with open('yelp_restaurants.json', 'w') as f:
    json.dump(restaurants_data, f, indent=4)

print("Data extraction complete! Saved to yelp_restaurants.json")

Final Thoughts

Web scraping can be powerful, but don’t forget the importance of ethical practices and respecting website terms of service. Use proxies to minimize the risk of getting blocked, and make sure your scraping doesn’t overload the target site.

DEV Community

How to Use Python to Scrape Yelp Data

Step 1: Get Your Environment Ready

Step 2: Send a Yelp Data Request

Step 3: Set Up Headers

Step 4: Proxy Rotation

Step 5: Parse HTML Content with `lxml`

Step 6: Identify and Extract the Elements

Step 7: Extract Specific Data

Step 8: Output Data as JSON

Full Code

Final Thoughts

Top comments (0)

Read next

A Complete Guide to Web Application Architecture to Run A Successful Business

💭 UX Empathy Pitfalls, Labels vs Icons & Git Made Simple

From a Founder's Desk: Building a SaaS MVP That Captivates Investors

Introducing the New and Improved Cookies package for Meteor.js

Step 1: Get Your Environment Ready

Step 2: Send a Yelp Data Request

Step 3: Set Up Headers

Step 4: Proxy Rotation

Step 5: Parse HTML Content with lxml

Step 6: Identify and Extract the Elements

Step 7: Extract Specific Data

Step 8: Output Data as JSON

Full Code

Final Thoughts

Read next

A Complete Guide to Web Application Architecture to Run A Successful Business

💭 UX Empathy Pitfalls, Labels vs Icons & Git Made Simple

From a Founder's Desk: Building a SaaS MVP That Captivates Investors

Introducing the New and Improved Cookies package for Meteor.js

Step 5: Parse HTML Content with `lxml`