Ever wondered how to quickly gather data on local restaurants—think names, ratings, menus, and URLs—from Yelp? With the right tools, you can scrape Yelp data efficiently, gaining valuable insights on your target market. In this guide, we’ll show you how to scrape Yelp’s search results using Python, all while handling headers, proxies, and XPath to extract data effectively.
Step 1: Get Your Environment Ready
Before we start scraping, you’ll need a Python environment with a couple of essential libraries: requests
and lxml
.
To install them, simply run:
pip install requests
pip install lxml
These libraries allow us to send HTTP requests to Yelp, retrieve the HTML content, and parse it for the information we need.
Step 2: Send a Yelp Data Request
We’re scraping Yelp’s search results page, so we need to send a GET request to get the HTML. Here’s the code to do that:
import requests
# Yelp search URL (we're targeting restaurants in San Francisco)
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
# Send GET request
response = requests.get(url)
# Check if the request was successful
if response.status_code == 200:
print("Page content fetched successfully!")
else:
print(f"Failed to retrieve page content. Status code: {response.status_code}")
Step 3: Set Up Headers
Websites like Yelp might block your requests if they suspect you’re a bot. To avoid this, you need to use HTTP headers, specifically the User-Agent
. This tells Yelp that the request is coming from a legitimate browser, not a scraper.
Here’s how to set up the headers:
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.5'
}
response = requests.get(url, headers=headers)
Step 4: Proxy Rotation
When scraping large volumes of data, your IP might get blocked. The solution? Proxy rotation. By using a pool of rotating proxies, your IP address changes periodically, making it harder for Yelp to detect and block your requests.
Here’s how you can add a proxy setup:
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
response = requests.get(url, headers=headers, proxies=proxies)
Make sure you use proxies that rotate automatically, so you don’t have to configure them manually every time.
Step 5: Parse HTML Content with lxml
Once we’ve successfully fetched the HTML, it’s time to extract the data. We’ll use lxml
for parsing. Here's how to do that:
from lxml import html
# Parse HTML content
parser = html.fromstring(response.content)
Now we need to identify the specific HTML elements that contain the restaurant data. For Yelp, restaurant entries are within <div>
elements marked with a specific attribute. We'll target those with XPath.
Step 6: Identify and Extract the Elements
Use XPath to pinpoint exactly what you want to extract: restaurant name, URL, cuisines, and rating.
Here’s how you’d extract the individual restaurant elements:
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
This XPath expression grabs the relevant sections from the search results. You can adjust it to target more or fewer items depending on your needs.
Step 7: Extract Specific Data
Now, let’s dive into extracting the specific data points for each restaurant—name, URL, cuisines, and rating. Here’s how:
restaurants_data = []
for element in elements:
name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
restaurants_data.append(restaurant_info)
Step 8: Output Data as JSON
Now that we’ve scraped the data, we want to save it in a structured format. JSON is a popular choice. Here's how to do it:
import json
# Save the data to a JSON file
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete! Saved to yelp_restaurants.json")
Full Code
Here’s the complete script that ties everything together:
import requests
from lxml import html
import json
url = "https://www.yelp.com/search?find_desc=restaurants&find_loc=San+Francisco%2C+CA"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.5'
}
proxies = {
'http': 'http://username:password@proxy-server:port',
'https': 'https://username:password@proxy-server:port'
}
response = requests.get(url, headers=headers, proxies=proxies)
if response.status_code == 200:
print("Page content fetched successfully!")
else:
print(f"Failed to retrieve page content. Status code: {response.status_code}")
parser = html.fromstring(response.content)
elements = parser.xpath('//div[@data-testid="serp-ia-card"]')[2:-1]
restaurants_data = []
for element in elements:
name = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/text()')[0]
url = element.xpath('.//div[@class="businessName__09f24__HG_pC y-css-ohs7lg"]/div/h3/a/@href')[0]
cuisines = element.xpath('.//div[@class="priceCategory__09f24___4Wsg iaPriceCategory__09f24__x9YrM y-css-2hdccn"]/div/div/div/a/button/span/text()')
rating = element.xpath('.//div[@class="y-css-9tnml4"]/@aria-label')[0]
restaurant_info = {
"name": name,
"url": url,
"cuisines": cuisines,
"rating": rating
}
restaurants_data.append(restaurant_info)
with open('yelp_restaurants.json', 'w') as f:
json.dump(restaurants_data, f, indent=4)
print("Data extraction complete! Saved to yelp_restaurants.json")
Final Thoughts
Web scraping can be powerful, but don’t forget the importance of ethical practices and respecting website terms of service. Use proxies to minimize the risk of getting blocked, and make sure your scraping doesn’t overload the target site.
Top comments (0)