agenthustler

Posted on Apr 9

Yelp Scraping: Extract Business Listings, Reviews and Local Data

#webdev #programming #javascript #webscraping

Yelp is one of the largest repositories of local business data on the internet. With over 200 million reviews and listings for restaurants, shops, services, and more across dozens of countries, it's a goldmine for market researchers, data analysts, and entrepreneurs.

In this guide, we'll explore how Yelp organizes its data, what you can extract, and how to build scrapers that collect business listings, reviews, and ratings reliably at scale.

Why Scrape Yelp?

Yelp data powers a wide range of use cases:

Market research: Understand the competitive landscape for any business category in any city
Lead generation: Build targeted lists of businesses by location, category, and rating
Sentiment analysis: Mine review text for customer pain points and trends
Location intelligence: Map business density, pricing trends, and service quality across neighborhoods
Academic research: Study consumer behavior, review authenticity, and platform dynamics
Competitor monitoring: Track how competing businesses' ratings and review counts change over time

While Yelp offers a Fusion API, it has strict rate limits (5,000 calls/day), doesn't include full review text, and restricts commercial use. For comprehensive data collection, scraping fills the gaps.

Understanding Yelp's Data Structure

Yelp organizes data around a few core entities. Understanding these will help you build effective scrapers.

Business Listings

Every business on Yelp has a profile page with a URL pattern like:

https://www.yelp.com/biz/business-name-city

Each business listing contains:

Basic info: Name, address, phone number, website, hours of operation
Categories: One or more business categories (e.g., "Italian Restaurant", "Pizza")
Rating: Overall star rating (1-5, in half-star increments)
Review count: Total number of reviews
Price range: Dollar signs ($, $$, $$$, $$$$)
Photos: User-uploaded and business photos
Attributes: Accepts credit cards, outdoor seating, delivery, etc.
Claimed status: Whether the business owner has claimed the listing

Search Results

Yelp search combines location and category/keyword queries:

https://www.yelp.com/search?find_desc=pizza&find_loc=New+York

Search results are paginated with 10 results per page and a start parameter for pagination. Yelp typically caps results at around 240 entries per search query (24 pages).

Reviews

Reviews are displayed on the business page and in a dedicated reviews section. Each review includes:

Reviewer name and profile link
Star rating (1-5)
Date posted
Review text (can be lengthy)
Photos attached to the review
Useful/Funny/Cool vote counts
Business owner response (if any)

Reviews are paginated with 10 per page and can be sorted by:

Yelp Sort (default algorithm)
Newest First
Oldest First
Highest Rated
Lowest Rated
Elites (from Yelp Elite reviewers)

Using the Yelp Fusion API

The official API is a good starting point for moderate data needs:

const fetch = require('node-fetch');

const YELP_API_KEY = 'YOUR_YELP_API_KEY';

async function searchBusinesses(term, location, limit = 20, offset = 0) {
    const url = new URL('https://api.yelp.com/v3/businesses/search');
    url.searchParams.set('term', term);
    url.searchParams.set('location', location);
    url.searchParams.set('limit', limit.toString());
    url.searchParams.set('offset', offset.toString());

    const response = await fetch(url.toString(), {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    const data = await response.json();
    return data.businesses.map(biz => ({
        id: biz.id,
        name: biz.name,
        rating: biz.rating,
        reviewCount: biz.review_count,
        price: biz.price,
        phone: biz.phone,
        address: biz.location.display_address.join(', '),
        categories: biz.categories.map(c => c.title),
        coordinates: biz.coordinates,
        url: biz.url,
        isClosed: biz.is_closed,
        distance: biz.distance,
    }));
}

// Search for pizza places in New York
searchBusinesses('pizza', 'New York, NY').then(businesses => {
    businesses.forEach(biz => {
        console.log(`${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews) - ${biz.price || 'N/A'}`);
    });
});

Getting Business Details

async function getBusinessDetails(businessId) {
    const url = `https://api.yelp.com/v3/businesses/${businessId}`;

    const response = await fetch(url, {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    return response.json();
}

async function getBusinessReviews(businessId) {
    const url = `https://api.yelp.com/v3/businesses/${businessId}/reviews?limit=3&sort_by=yelp_sort`;

    const response = await fetch(url, {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    const data = await response.json();
    return data.reviews;
}

// Get details and reviews for a specific business
async function fullBusinessProfile(businessId) {
    const [details, reviews] = await Promise.all([
        getBusinessDetails(businessId),
        getBusinessReviews(businessId),
    ]);

    return {
        ...details,
        sampleReviews: reviews,
    };
}

API Limitations

The Yelp Fusion API has significant restrictions:

5,000 API calls per day — not enough for large-scale data collection
Reviews limited to 3 per business via the API
No full review text — reviews are truncated in API responses
No historical data — can't access past reviews or rating changes
Commercial use restrictions — you must display the Yelp logo and link back
Search capped at 1,000 results per query

For comprehensive data, web scraping is the way forward.

Scraping Yelp Search Results

Here's how to extract business listings from Yelp search pages:

const cheerio = require('cheerio');
const fetch = require('node-fetch');

async function scrapeYelpSearch(term, location, maxPages = 5) {
    const businesses = [];

    for (let page = 0; page < maxPages; page++) {
        const start = page * 10;
        const url = `https://www.yelp.com/search?find_desc=${encodeURIComponent(term)}&find_loc=${encodeURIComponent(location)}&start=${start}`;

        const response = await fetch(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept': 'text/html,application/xhtml+xml',
            }
        });

        const html = await response.text();
        const $ = cheerio.load(html);

        // Yelp embeds structured data in JSON-LD and script tags
        const scriptTags = $('script[type="application/ld+json"]');
        scriptTags.each((i, el) => {
            try {
                const data = JSON.parse($(el).html());
                if (data['@type'] === 'LocalBusiness' || Array.isArray(data)) {
                    const items = Array.isArray(data) ? data : [data];
                    items.forEach(item => {
                        if (item['@type'] && item.name) {
                            businesses.push({
                                name: item.name,
                                address: item.address ? item.address.streetAddress : '',
                                city: item.address ? item.address.addressLocality : '',
                                rating: item.aggregateRating ? item.aggregateRating.ratingValue : null,
                                reviewCount: item.aggregateRating ? item.aggregateRating.reviewCount : null,
                                phone: item.telephone || '',
                                priceRange: item.priceRange || '',
                            });
                        }
                    });
                }
            } catch (e) {
                // Skip malformed JSON
            }
        });

        console.log(`Page ${page + 1}: found ${businesses.length} total businesses`);

        // Rate limiting
        await new Promise(resolve => setTimeout(resolve, 2000));
    }

    return businesses;
}

// Scrape Italian restaurants in San Francisco
scrapeYelpSearch('Italian restaurants', 'San Francisco, CA', 3).then(results => {
    console.log(`\nTotal businesses: ${results.length}`);
    results.forEach(biz => {
        console.log(`${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews)`);
    });
});

Scraping Business Pages for Full Details

To get comprehensive data for a specific business:

async function scrapeBusinessPage(yelpUrl) {
    const response = await fetch(yelpUrl, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
        }
    });

    const html = await response.text();
    const $ = cheerio.load(html);

    // Extract from JSON-LD structured data
    let structuredData = {};
    $('script[type="application/ld+json"]').each((i, el) => {
        try {
            const data = JSON.parse($(el).html());
            if (data['@type'] === 'LocalBusiness' || data['@type'] === 'Restaurant') {
                structuredData = data;
            }
        } catch (e) {}
    });

    const business = {
        name: structuredData.name || $('h1').first().text().trim(),
        address: structuredData.address ? {
            street: structuredData.address.streetAddress,
            city: structuredData.address.addressLocality,
            state: structuredData.address.addressRegion,
            zip: structuredData.address.postalCode,
        } : {},
        phone: structuredData.telephone || '',
        website: '', // Usually found in the sidebar
        rating: structuredData.aggregateRating ? structuredData.aggregateRating.ratingValue : null,
        reviewCount: structuredData.aggregateRating ? structuredData.aggregateRating.reviewCount : null,
        priceRange: structuredData.priceRange || '',
        categories: [],
        hours: [],
        attributes: [],
        coordinates: structuredData.geo ? {
            lat: structuredData.geo.latitude,
            lng: structuredData.geo.longitude,
        } : null,
    };

    // Extract categories
    if (structuredData.servesCuisine) {
        business.categories = Array.isArray(structuredData.servesCuisine)
            ? structuredData.servesCuisine
            : [structuredData.servesCuisine];
    }

    // Extract hours from structured data
    if (structuredData.openingHoursSpecification) {
        business.hours = structuredData.openingHoursSpecification.map(h => ({
            days: h.dayOfWeek,
            opens: h.opens,
            closes: h.closes,
        }));
    }

    return business;
}

// Scrape a specific business
scrapeBusinessPage('https://www.yelp.com/biz/joes-pizza-new-york').then(biz => {
    console.log(JSON.stringify(biz, null, 2));
});

Extracting Reviews at Scale

Scraping Yelp reviews requires handling pagination and dynamic content loading:

async function scrapeReviews(businessUrl, maxPages = 5) {
    const allReviews = [];

    for (let page = 0; page < maxPages; page++) {
        const start = page * 10;
        const url = start === 0 ? businessUrl : `${businessUrl}?start=${start}`;

        const response = await fetch(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
            }
        });

        const html = await response.text();
        const $ = cheerio.load(html);

        // Look for review data in embedded JSON
        const scripts = $('script').toArray();
        for (const script of scripts) {
            const content = $(script).html() || '';
            if (content.includes('reviewText') && content.includes('userDisplayName')) {
                try {
                    // Yelp embeds review data in various script formats
                    const reviewMatches = content.match(/"reviewText":"(.*?)"/g);
                    const ratingMatches = content.match(/"rating":(\d)/g);
                    const dateMatches = content.match(/"localizedDate":"(.*?)"/g);
                    const nameMatches = content.match(/"userDisplayName":"(.*?)"/g);

                    if (reviewMatches) {
                        for (let i = 0; i < reviewMatches.length; i++) {
                            allReviews.push({
                                text: reviewMatches[i] ? reviewMatches[i].replace(/"reviewText":"/, '').replace(/"$/, '') : '',
                                rating: ratingMatches[i] ? parseInt(ratingMatches[i].replace(/"rating":/, '')) : null,
                                date: dateMatches[i] ? dateMatches[i].replace(/"localizedDate":"/, '').replace(/"$/, '') : '',
                                author: nameMatches[i] ? nameMatches[i].replace(/"userDisplayName":"/, '').replace(/"$/, '') : '',
                            });
                        }
                    }
                } catch (e) {
                    // Continue to next script tag
                }
            }
        }

        console.log(`Page ${page + 1}: ${allReviews.length} total reviews`);

        // Respect rate limits
        await new Promise(resolve => setTimeout(resolve, 2500));
    }

    return allReviews;
}

// Extract reviews
scrapeReviews('https://www.yelp.com/biz/joes-pizza-new-york', 3).then(reviews => {
    console.log(`\nTotal reviews scraped: ${reviews.length}`);
    reviews.slice(0, 3).forEach(r => {
        console.log(`${r.author} - ${r.rating}⭐ - ${r.date}`);
        console.log(`  ${r.text.substring(0, 100)}...\n`);
    });
});

Handling Yelp's Anti-Scraping Measures

Yelp has some of the most aggressive anti-scraping protections among review sites. Here's what to expect and how to handle it:

CAPTCHA Challenges

Yelp frequently serves CAPTCHAs to automated traffic. Strategies:

class YelpScraper {
    constructor(proxies = []) {
        this.proxies = proxies;
        this.proxyIndex = 0;
        this.requestCount = 0;
        this.maxRequestsPerProxy = 15;
    }

    getNextProxy() {
        if (this.proxies.length === 0) return null;
        const proxy = this.proxies[this.proxyIndex % this.proxies.length];
        this.proxyIndex++;
        return proxy;
    }

    async fetch(url) {
        this.requestCount++;

        // Rotate proxy every N requests
        if (this.requestCount % this.maxRequestsPerProxy === 0) {
            this.proxyIndex++;
        }

        const headers = {
            'User-Agent': this.getRandomUserAgent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
        };

        // Add random delay between 2-5 seconds
        const delay = 2000 + Math.random() * 3000;
        await new Promise(resolve => setTimeout(resolve, delay));

        const response = await fetch(url, { headers });

        // Check for CAPTCHA
        const text = await response.text();
        if (text.includes('distil_r_captcha') || text.includes('Are you a human')) {
            console.warn('CAPTCHA detected, rotating proxy...');
            this.proxyIndex++;
            return null; // Signal to retry with new proxy
        }

        return text;
    }

    getRandomUserAgent() {
        const agents = [
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
        ];
        return agents[Math.floor(Math.random() * agents.length)];
    }
}

Session Management

Yelp tracks sessions aggressively. Tips for avoiding detection:

Rotate cookies and sessions regularly
Don't scrape more than 50-100 pages per session
Vary your request patterns (don't hit pages in sequential order)
Use residential proxies rather than datacenter IPs

Scaling Yelp Scraping with Apify

Given Yelp's aggressive anti-scraping measures, running your own infrastructure for large-scale collection can be challenging. Apify provides managed infrastructure that handles many of these challenges automatically.

Using a Yelp Actor on Apify

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

async function scrapeYelpWithApify(searchTerms, locations) {
    const run = await client.actor('apify/yelp-scraper').call({
        searchTerms: searchTerms,
        locations: locations,
        maxResults: 100,
        includeReviews: true,
        maxReviews: 50,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL'],
        },
    });

    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    return items;
}

// Scrape coffee shops in multiple cities
scrapeYelpWithApify(
    ['coffee shops', 'specialty coffee'],
    ['San Francisco, CA', 'Portland, OR', 'Seattle, WA']
).then(results => {
    console.log(`Found ${results.length} businesses`);

    // Group by city
    const byCity = {};
    results.forEach(biz => {
        const city = biz.city || 'Unknown';
        if (!byCity[city]) byCity[city] = [];
        byCity[city].push(biz);
    });

    Object.entries(byCity).forEach(([city, businesses]) => {
        const avgRating = businesses.reduce((s, b) => s + b.rating, 0) / businesses.length;
        console.log(`${city}: ${businesses.length} shops, avg rating ${avgRating.toFixed(2)}`);
    });
});

Why Use Apify for Yelp Scraping

Residential proxy pools: Apify's residential proxies are far less likely to trigger Yelp's CAPTCHA than datacenter IPs
Automatic retries: Failed requests are retried with different proxies automatically
Browser fingerprinting: Headless browser actors can mimic real user behavior more effectively
Scheduling and monitoring: Set up daily scrapes with alerting on failures
Data export: Export to JSON, CSV, Excel, or push directly to databases and APIs
Compliance: Apify actors can be configured to respect robots.txt and rate limits

Data Analysis Patterns

Once you've collected Yelp data, here are some practical analysis examples:

Competitive Analysis

function competitiveAnalysis(businesses) {
    // Sort by rating (weighted by review count)
    const ranked = businesses
        .filter(b => b.reviewCount >= 10) // Minimum review threshold
        .map(b => ({
            ...b,
            weightedScore: b.rating * Math.log10(b.reviewCount + 1),
        }))
        .sort((a, b) => b.weightedScore - a.weightedScore);

    console.log('\nTop 10 businesses by weighted score:');
    ranked.slice(0, 10).forEach((biz, i) => {
        console.log(`${i + 1}. ${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews) Score: ${biz.weightedScore.toFixed(2)}`);
    });

    // Price range distribution
    const priceDistribution = {};
    businesses.forEach(b => {
        const price = b.priceRange || 'Unknown';
        priceDistribution[price] = (priceDistribution[price] || 0) + 1;
    });

    console.log('\nPrice distribution:');
    Object.entries(priceDistribution).sort().forEach(([price, count]) => {
        console.log(`  ${price}: ${count} businesses (${(count/businesses.length*100).toFixed(1)}%)`);
    });
}

Review Sentiment Tracking

function reviewSentimentAnalysis(reviews) {
    // Group reviews by month
    const monthly = {};
    reviews.forEach(r => {
        const month = r.date ? r.date.substring(0, 7) : 'unknown';
        if (!monthly[month]) monthly[month] = { ratings: [], count: 0 };
        monthly[month].ratings.push(r.rating);
        monthly[month].count++;
    });

    console.log('\nMonthly sentiment trend:');
    Object.entries(monthly).sort().forEach(([month, data]) => {
        const avg = data.ratings.reduce((s, r) => s + r, 0) / data.count;
        const bar = '█'.repeat(Math.round(avg * 4));
        console.log(`${month}: ${avg.toFixed(2)} ${bar} (${data.count} reviews)`);
    });

    // Find common keywords in negative reviews
    const negativeReviews = reviews.filter(r => r.rating <= 2);
    const wordFreq = {};
    const stopWords = new Set(['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'is', 'it', 'was', 'i', 'we', 'they']);

    negativeReviews.forEach(r => {
        const words = r.text.toLowerCase().split(/\s+/);
        words.forEach(w => {
            const clean = w.replace(/[^a-z]/g, '');
            if (clean.length > 3 && !stopWords.has(clean)) {
                wordFreq[clean] = (wordFreq[clean] || 0) + 1;
            }
        });
    });

    const topComplaints = Object.entries(wordFreq)
        .sort((a, b) => b[1] - a[1])
        .slice(0, 15);

    console.log('\nTop words in negative reviews:');
    topComplaints.forEach(([word, count]) => {
        console.log(`  "${word}": ${count} mentions`);
    });
}

Geographic Mapping

function geographicAnalysis(businesses) {
    // Filter businesses with coordinates
    const geoBusinesses = businesses.filter(b => b.coordinates);

    if (geoBusinesses.length === 0) {
        console.log('No geographic data available');
        return;
    }

    // Calculate center point
    const avgLat = geoBusinesses.reduce((s, b) => s + b.coordinates.lat, 0) / geoBusinesses.length;
    const avgLng = geoBusinesses.reduce((s, b) => s + b.coordinates.lng, 0) / geoBusinesses.length;

    console.log(`Center point: ${avgLat.toFixed(4)}, ${avgLng.toFixed(4)}`);

    // Find highest rated neighborhoods (approximate by clustering)
    const gridSize = 0.01; // roughly 1km
    const grid = {};

    geoBusinesses.forEach(b => {
        const key = `${Math.round(b.coordinates.lat / gridSize) * gridSize},${Math.round(b.coordinates.lng / gridSize) * gridSize}`;
        if (!grid[key]) grid[key] = { businesses: [], totalRating: 0 };
        grid[key].businesses.push(b);
        grid[key].totalRating += b.rating;
    });

    const topAreas = Object.entries(grid)
        .filter(([_, data]) => data.businesses.length >= 3)
        .map(([coords, data]) => ({
            coords,
            count: data.businesses.length,
            avgRating: data.totalRating / data.businesses.length,
        }))
        .sort((a, b) => b.avgRating - a.avgRating);

    console.log('\nHighest rated areas:');
    topAreas.slice(0, 5).forEach(area => {
        console.log(`  ${area.coords}: ${area.avgRating.toFixed(2)}⭐ avg (${area.count} businesses)`);
    });
}

Legal and Ethical Considerations

Yelp scraping requires careful attention to legal boundaries:

Yelp's Terms of Service: Yelp explicitly prohibits scraping in its ToS. Understand the legal implications in your jurisdiction before proceeding.
hiQ v. LinkedIn precedent: While not directly about Yelp, this case established that scraping publicly available data isn't necessarily a CFAA violation — but it's not a blanket permission either.
Rate limiting: Aggressive scraping can degrade Yelp's service for real users. Always use reasonable delays between requests.
Personal data: Review author names and other personal information may be subject to GDPR, CCPA, and other privacy regulations.
Data accuracy: Yelp's recommendation algorithm hides some reviews. Scraped data may not represent the full picture.

Best practices:

Scrape only what you need — don't collect data speculatively
Respect robots.txt directives
Use delays of 2+ seconds between requests
Don't republish raw review data without proper attribution
Consult a lawyer if using data commercially

Conclusion

Yelp is a rich source of local business data, but extracting it at scale requires thoughtful engineering. Start with the Fusion API for small-scale needs, graduate to targeted scraping for specific data points the API doesn't cover, and leverage platforms like Apify when you need to collect data across hundreds of cities or categories without managing infrastructure.

The key to successful Yelp scraping is patience — slow, respectful scraping with proper proxy rotation and session management will get you far more data than aggressive approaches that trigger blocks. Whether you're doing competitive analysis, lead generation, or academic research, the combination of official APIs and carefully built scrapers gives you access to one of the most comprehensive local business datasets on the web.

Need reliable Yelp data without the infrastructure headaches? Browse the Apify Store for ready-made Yelp scrapers that handle proxy rotation, CAPTCHA solving, and data export automatically.

DEV Community