How to Scrape Yelp Data: Complete Guide for 2026
This guide covers extracting publicly accessible data. Always review a site's robots.txt and Terms of Service before scraping.
TL;DR
To scrape Yelp with Python, use AlterLab’s API to render JavaScript, extract public business details via CSS selectors, and respect rate limits. A single request returns clean HTML you can parse with BeautifulSoup or lxml.
Why collect local data from Yelp?
Yelp hosts a wealth of public business information useful for several engineering workflows:
- Market research: Track competitor listings, review counts, and rating trends across categories.
- Price monitoring: Extract menu items or service prices from restaurant and salon pages for dynamic pricing models.
- Data enrichment: Augment internal databases with business hours, location coordinates, and category tags for local search features.
These use cases rely solely on data visible on public pages—no login or private data required.
Technical challenges
Yelp’s modern site presents three core obstacles for scrapers:
-
JavaScript‑heavy rendering: Business details load client‑side, so a plain
requests.getreturns an empty container. - Rate limiting & IP bans: Exceeding a modest request threshold triggers temporary blocks or CAPTCHAs.
- Bot detection headers: The server checks for typical automation signatures (missing user‑agent, lack of TLS fingerprinting).
Raw HTTP clients fail because they cannot execute the page’s React hydrate cycle. AlterLab’s Smart Rendering API solves this by launching a headless browser, applying rotating proxies, and waiting for network idle before returning the fully rendered DOM.
Quick start with AlterLab API
First, install the official Python SDK (see the Getting started guide for full setup). Then authenticate and scrape a public Yelp page.
```python title="scrape_yelp-com.py" {3-5}
client = alterlab.Client("YOUR_API_KEY")
Target a public business page – no login required
response = client.scrape(
url="https://www.yelp.com/biz/example-restaurant-san-francisco",
params={"render": True, "wait_for": "networkidle"}
)
print(response.status_code) # 200 if successful
html = response.text
The equivalent cURL request looks like this:
```bash title="Terminal"
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.yelp.com/biz/example-restaurant-san-francisco",
"render": true,
"wait_for": "networkidle"
}'
Both examples ask AlterLab to render the page (render: true) and wait until network activity settles, ensuring the business name, rating, and address are present in the returned HTML.
Extracting structured data
Once you have the HTML, use a parser to pull the fields you need. Below are CSS selectors for common public data points on a Yelp business page (as of 2026). Adjust if the class names change.
```python title="parse_yelp.py" {4-10}
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
Business name – typically in an h1 with a specific data‑test attribute
name_tag = soup.select_one('h1[data-testid="business-name"]')
business_name = name_tag.get_text(strip=True) if name_tag else None
Rating – often stored in a div with aria-label
rating_tag = soup.select_one('div[role="img"][aria-label*="star rating"]')
rating = rating_tag["aria-label"].split()[0] if rating_tag else None
Review count – adjacent to the rating
review_tag = soup.select_one('p[class*="review-count"]')
review_count = review_tag.get_text(strip=True).split()[0] if review_tag else None
Address – first line of the address block
address_tag = soup.select_one('address p')
address = address_tag.get_text(strip=True) if address_tag else None
print({
"business_name": business_name,
"rating": rating,
"review_count": review_count,
"address": address
})
If you prefer JSON‑style extraction, AlterLab can return structured data directly via its Cortex AI add‑on, but the CSS approach works for pure HTML output.
## Best practices
Scraping responsibly keeps your pipelines running smoothly and respects the target site:
- **Rate limit yourself**: Even with AlterLab’s proxy pool, send no more than 2–3 requests per second per IP to avoid triggering Yelp’s anti‑bot thresholds.
- **Honor robots.txt**: Check `https://www.yelp.com/robots.txt` for disallowed paths (e.g., `/ajax/*`, `/user/*`). Stick to `/biz/*` and `/search/*` for public data.
- **Handle dynamic content**: Use AlterLab’s `wait_for` parameter (`networkidle` or a specific selector) to ensure the DOM is ready before extracting.
- **Rotate user‑agents**: Though AlterLab does this automatically, if you build a custom scraper, rotate a list of realistic browser strings.
- **Log failures**: Capture HTTP 429 or 503 responses and implement exponential backoff.
Following these rules reduces the chance of temporary bans and keeps your data fresh.
## Scaling up
When you need to scrape hundreds or thousands of Yelp pages, consider these patterns:
- **Batch requests**: Send multiple URLs in a single API call using AlterLab’s `batch` endpoint (up to 20 URLs per request) to cut connection overhead.
- **Scheduling**: Use the platform’s cron feature to run a nightly scrape of a changing dataset (e.g., new restaurant openings).
- **Cost awareness**: Review the [pricing](/pricing) page to estimate monthly spend based on your request volume and rendering tier. AlterLab’s pay‑as‑you‑go model means you only pay for successful scrapes.
- **Storage**: Stream results directly to a data warehouse or object store; avoid holding large HTML strings in memory longer than necessary.
A typical scaling workflow might look like:
<div data-infographic="steps">
<div data-step data-number="1" data-title="Prepare URL list" data-description="Generate Yelp biz URLs from a CSV of categories or zip codes."/>
<div data-step data-number="2" data-title="Batch scrape" data-description="Send groups of 20 URLs to AlterLab with render:true."/>
<div data-step data-number="3" data-title="Parse & store" data-description="Extract name, rating, address; insert into Postgres or BigQuery."/>
<div data-step data-number="4" data-title="Handle errors" data-description="Retry failed items with a longer backoff; alert on persistent 429s."/>
</div>
## Key takeaways
- Use AlterLab’s headless browser rendering to bypass Yelp’s JavaScript and anti‑bot measures.
- Extract only publicly visible fields with reliable CSS selectors; avoid scraping behind login walls.
- Apply polite rate limits, respect robots.txt, and log errors to maintain a sustainable scraper.
- Leverage batching and scheduling to scale efficiently while monitoring cost via AlterLab’s pricing page.
Hit reply if you have questions.
Top comments (0)