DEV Community

agenthustler
agenthustler

Posted on

Yelp Scraping: Extract Business Listings, Reviews and Local Data

Yelp is one of the largest repositories of local business data on the internet. With over 200 million reviews and listings for restaurants, shops, services, and more across dozens of countries, it's a goldmine for market researchers, data analysts, and entrepreneurs.

In this guide, we'll explore how Yelp organizes its data, what you can extract, and how to build scrapers that collect business listings, reviews, and ratings reliably at scale.

Why Scrape Yelp?

Yelp data powers a wide range of use cases:

  • Market research: Understand the competitive landscape for any business category in any city
  • Lead generation: Build targeted lists of businesses by location, category, and rating
  • Sentiment analysis: Mine review text for customer pain points and trends
  • Location intelligence: Map business density, pricing trends, and service quality across neighborhoods
  • Academic research: Study consumer behavior, review authenticity, and platform dynamics
  • Competitor monitoring: Track how competing businesses' ratings and review counts change over time

While Yelp offers a Fusion API, it has strict rate limits (5,000 calls/day), doesn't include full review text, and restricts commercial use. For comprehensive data collection, scraping fills the gaps.

Understanding Yelp's Data Structure

Yelp organizes data around a few core entities. Understanding these will help you build effective scrapers.

Business Listings

Every business on Yelp has a profile page with a URL pattern like:

https://www.yelp.com/biz/business-name-city
Enter fullscreen mode Exit fullscreen mode

Each business listing contains:

  • Basic info: Name, address, phone number, website, hours of operation
  • Categories: One or more business categories (e.g., "Italian Restaurant", "Pizza")
  • Rating: Overall star rating (1-5, in half-star increments)
  • Review count: Total number of reviews
  • Price range: Dollar signs ($, $$, $$$, $$$$)
  • Photos: User-uploaded and business photos
  • Attributes: Accepts credit cards, outdoor seating, delivery, etc.
  • Claimed status: Whether the business owner has claimed the listing

Search Results

Yelp search combines location and category/keyword queries:

https://www.yelp.com/search?find_desc=pizza&find_loc=New+York
Enter fullscreen mode Exit fullscreen mode

Search results are paginated with 10 results per page and a start parameter for pagination. Yelp typically caps results at around 240 entries per search query (24 pages).

Reviews

Reviews are displayed on the business page and in a dedicated reviews section. Each review includes:

  • Reviewer name and profile link
  • Star rating (1-5)
  • Date posted
  • Review text (can be lengthy)
  • Photos attached to the review
  • Useful/Funny/Cool vote counts
  • Business owner response (if any)

Reviews are paginated with 10 per page and can be sorted by:

  • Yelp Sort (default algorithm)
  • Newest First
  • Oldest First
  • Highest Rated
  • Lowest Rated
  • Elites (from Yelp Elite reviewers)

Using the Yelp Fusion API

The official API is a good starting point for moderate data needs:

const fetch = require('node-fetch');

const YELP_API_KEY = 'YOUR_YELP_API_KEY';

async function searchBusinesses(term, location, limit = 20, offset = 0) {
    const url = new URL('https://api.yelp.com/v3/businesses/search');
    url.searchParams.set('term', term);
    url.searchParams.set('location', location);
    url.searchParams.set('limit', limit.toString());
    url.searchParams.set('offset', offset.toString());

    const response = await fetch(url.toString(), {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    const data = await response.json();
    return data.businesses.map(biz => ({
        id: biz.id,
        name: biz.name,
        rating: biz.rating,
        reviewCount: biz.review_count,
        price: biz.price,
        phone: biz.phone,
        address: biz.location.display_address.join(', '),
        categories: biz.categories.map(c => c.title),
        coordinates: biz.coordinates,
        url: biz.url,
        isClosed: biz.is_closed,
        distance: biz.distance,
    }));
}

// Search for pizza places in New York
searchBusinesses('pizza', 'New York, NY').then(businesses => {
    businesses.forEach(biz => {
        console.log(`${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews) - ${biz.price || 'N/A'}`);
    });
});
Enter fullscreen mode Exit fullscreen mode

Getting Business Details

async function getBusinessDetails(businessId) {
    const url = `https://api.yelp.com/v3/businesses/${businessId}`;

    const response = await fetch(url, {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    return response.json();
}

async function getBusinessReviews(businessId) {
    const url = `https://api.yelp.com/v3/businesses/${businessId}/reviews?limit=3&sort_by=yelp_sort`;

    const response = await fetch(url, {
        headers: {
            'Authorization': `Bearer ${YELP_API_KEY}`,
        }
    });

    const data = await response.json();
    return data.reviews;
}

// Get details and reviews for a specific business
async function fullBusinessProfile(businessId) {
    const [details, reviews] = await Promise.all([
        getBusinessDetails(businessId),
        getBusinessReviews(businessId),
    ]);

    return {
        ...details,
        sampleReviews: reviews,
    };
}
Enter fullscreen mode Exit fullscreen mode

API Limitations

The Yelp Fusion API has significant restrictions:

  • 5,000 API calls per day — not enough for large-scale data collection
  • Reviews limited to 3 per business via the API
  • No full review text — reviews are truncated in API responses
  • No historical data — can't access past reviews or rating changes
  • Commercial use restrictions — you must display the Yelp logo and link back
  • Search capped at 1,000 results per query

For comprehensive data, web scraping is the way forward.

Scraping Yelp Search Results

Here's how to extract business listings from Yelp search pages:

const cheerio = require('cheerio');
const fetch = require('node-fetch');

async function scrapeYelpSearch(term, location, maxPages = 5) {
    const businesses = [];

    for (let page = 0; page < maxPages; page++) {
        const start = page * 10;
        const url = `https://www.yelp.com/search?find_desc=${encodeURIComponent(term)}&find_loc=${encodeURIComponent(location)}&start=${start}`;

        const response = await fetch(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
                'Accept': 'text/html,application/xhtml+xml',
            }
        });

        const html = await response.text();
        const $ = cheerio.load(html);

        // Yelp embeds structured data in JSON-LD and script tags
        const scriptTags = $('script[type="application/ld+json"]');
        scriptTags.each((i, el) => {
            try {
                const data = JSON.parse($(el).html());
                if (data['@type'] === 'LocalBusiness' || Array.isArray(data)) {
                    const items = Array.isArray(data) ? data : [data];
                    items.forEach(item => {
                        if (item['@type'] && item.name) {
                            businesses.push({
                                name: item.name,
                                address: item.address ? item.address.streetAddress : '',
                                city: item.address ? item.address.addressLocality : '',
                                rating: item.aggregateRating ? item.aggregateRating.ratingValue : null,
                                reviewCount: item.aggregateRating ? item.aggregateRating.reviewCount : null,
                                phone: item.telephone || '',
                                priceRange: item.priceRange || '',
                            });
                        }
                    });
                }
            } catch (e) {
                // Skip malformed JSON
            }
        });

        console.log(`Page ${page + 1}: found ${businesses.length} total businesses`);

        // Rate limiting
        await new Promise(resolve => setTimeout(resolve, 2000));
    }

    return businesses;
}

// Scrape Italian restaurants in San Francisco
scrapeYelpSearch('Italian restaurants', 'San Francisco, CA', 3).then(results => {
    console.log(`\nTotal businesses: ${results.length}`);
    results.forEach(biz => {
        console.log(`${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews)`);
    });
});
Enter fullscreen mode Exit fullscreen mode

Scraping Business Pages for Full Details

To get comprehensive data for a specific business:

async function scrapeBusinessPage(yelpUrl) {
    const response = await fetch(yelpUrl, {
        headers: {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
            'Accept-Language': 'en-US,en;q=0.9',
        }
    });

    const html = await response.text();
    const $ = cheerio.load(html);

    // Extract from JSON-LD structured data
    let structuredData = {};
    $('script[type="application/ld+json"]').each((i, el) => {
        try {
            const data = JSON.parse($(el).html());
            if (data['@type'] === 'LocalBusiness' || data['@type'] === 'Restaurant') {
                structuredData = data;
            }
        } catch (e) {}
    });

    const business = {
        name: structuredData.name || $('h1').first().text().trim(),
        address: structuredData.address ? {
            street: structuredData.address.streetAddress,
            city: structuredData.address.addressLocality,
            state: structuredData.address.addressRegion,
            zip: structuredData.address.postalCode,
        } : {},
        phone: structuredData.telephone || '',
        website: '', // Usually found in the sidebar
        rating: structuredData.aggregateRating ? structuredData.aggregateRating.ratingValue : null,
        reviewCount: structuredData.aggregateRating ? structuredData.aggregateRating.reviewCount : null,
        priceRange: structuredData.priceRange || '',
        categories: [],
        hours: [],
        attributes: [],
        coordinates: structuredData.geo ? {
            lat: structuredData.geo.latitude,
            lng: structuredData.geo.longitude,
        } : null,
    };

    // Extract categories
    if (structuredData.servesCuisine) {
        business.categories = Array.isArray(structuredData.servesCuisine)
            ? structuredData.servesCuisine
            : [structuredData.servesCuisine];
    }

    // Extract hours from structured data
    if (structuredData.openingHoursSpecification) {
        business.hours = structuredData.openingHoursSpecification.map(h => ({
            days: h.dayOfWeek,
            opens: h.opens,
            closes: h.closes,
        }));
    }

    return business;
}

// Scrape a specific business
scrapeBusinessPage('https://www.yelp.com/biz/joes-pizza-new-york').then(biz => {
    console.log(JSON.stringify(biz, null, 2));
});
Enter fullscreen mode Exit fullscreen mode

Extracting Reviews at Scale

Scraping Yelp reviews requires handling pagination and dynamic content loading:

async function scrapeReviews(businessUrl, maxPages = 5) {
    const allReviews = [];

    for (let page = 0; page < maxPages; page++) {
        const start = page * 10;
        const url = start === 0 ? businessUrl : `${businessUrl}?start=${start}`;

        const response = await fetch(url, {
            headers: {
                'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36',
                'Accept-Language': 'en-US,en;q=0.9',
            }
        });

        const html = await response.text();
        const $ = cheerio.load(html);

        // Look for review data in embedded JSON
        const scripts = $('script').toArray();
        for (const script of scripts) {
            const content = $(script).html() || '';
            if (content.includes('reviewText') && content.includes('userDisplayName')) {
                try {
                    // Yelp embeds review data in various script formats
                    const reviewMatches = content.match(/"reviewText":"(.*?)"/g);
                    const ratingMatches = content.match(/"rating":(\d)/g);
                    const dateMatches = content.match(/"localizedDate":"(.*?)"/g);
                    const nameMatches = content.match(/"userDisplayName":"(.*?)"/g);

                    if (reviewMatches) {
                        for (let i = 0; i < reviewMatches.length; i++) {
                            allReviews.push({
                                text: reviewMatches[i] ? reviewMatches[i].replace(/"reviewText":"/, '').replace(/"$/, '') : '',
                                rating: ratingMatches[i] ? parseInt(ratingMatches[i].replace(/"rating":/, '')) : null,
                                date: dateMatches[i] ? dateMatches[i].replace(/"localizedDate":"/, '').replace(/"$/, '') : '',
                                author: nameMatches[i] ? nameMatches[i].replace(/"userDisplayName":"/, '').replace(/"$/, '') : '',
                            });
                        }
                    }
                } catch (e) {
                    // Continue to next script tag
                }
            }
        }

        console.log(`Page ${page + 1}: ${allReviews.length} total reviews`);

        // Respect rate limits
        await new Promise(resolve => setTimeout(resolve, 2500));
    }

    return allReviews;
}

// Extract reviews
scrapeReviews('https://www.yelp.com/biz/joes-pizza-new-york', 3).then(reviews => {
    console.log(`\nTotal reviews scraped: ${reviews.length}`);
    reviews.slice(0, 3).forEach(r => {
        console.log(`${r.author} - ${r.rating}⭐ - ${r.date}`);
        console.log(`  ${r.text.substring(0, 100)}...\n`);
    });
});
Enter fullscreen mode Exit fullscreen mode

Handling Yelp's Anti-Scraping Measures

Yelp has some of the most aggressive anti-scraping protections among review sites. Here's what to expect and how to handle it:

CAPTCHA Challenges

Yelp frequently serves CAPTCHAs to automated traffic. Strategies:

class YelpScraper {
    constructor(proxies = []) {
        this.proxies = proxies;
        this.proxyIndex = 0;
        this.requestCount = 0;
        this.maxRequestsPerProxy = 15;
    }

    getNextProxy() {
        if (this.proxies.length === 0) return null;
        const proxy = this.proxies[this.proxyIndex % this.proxies.length];
        this.proxyIndex++;
        return proxy;
    }

    async fetch(url) {
        this.requestCount++;

        // Rotate proxy every N requests
        if (this.requestCount % this.maxRequestsPerProxy === 0) {
            this.proxyIndex++;
        }

        const headers = {
            'User-Agent': this.getRandomUserAgent(),
            'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language': 'en-US,en;q=0.9',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
        };

        // Add random delay between 2-5 seconds
        const delay = 2000 + Math.random() * 3000;
        await new Promise(resolve => setTimeout(resolve, delay));

        const response = await fetch(url, { headers });

        // Check for CAPTCHA
        const text = await response.text();
        if (text.includes('distil_r_captcha') || text.includes('Are you a human')) {
            console.warn('CAPTCHA detected, rotating proxy...');
            this.proxyIndex++;
            return null; // Signal to retry with new proxy
        }

        return text;
    }

    getRandomUserAgent() {
        const agents = [
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.0 Safari/605.1.15',
        ];
        return agents[Math.floor(Math.random() * agents.length)];
    }
}
Enter fullscreen mode Exit fullscreen mode

Session Management

Yelp tracks sessions aggressively. Tips for avoiding detection:

  • Rotate cookies and sessions regularly
  • Don't scrape more than 50-100 pages per session
  • Vary your request patterns (don't hit pages in sequential order)
  • Use residential proxies rather than datacenter IPs

Scaling Yelp Scraping with Apify

Given Yelp's aggressive anti-scraping measures, running your own infrastructure for large-scale collection can be challenging. Apify provides managed infrastructure that handles many of these challenges automatically.

Using a Yelp Actor on Apify

const { ApifyClient } = require('apify-client');

const client = new ApifyClient({
    token: 'YOUR_APIFY_TOKEN',
});

async function scrapeYelpWithApify(searchTerms, locations) {
    const run = await client.actor('apify/yelp-scraper').call({
        searchTerms: searchTerms,
        locations: locations,
        maxResults: 100,
        includeReviews: true,
        maxReviews: 50,
        proxy: {
            useApifyProxy: true,
            apifyProxyGroups: ['RESIDENTIAL'],
        },
    });

    const { items } = await client.dataset(run.defaultDatasetId).listItems();
    return items;
}

// Scrape coffee shops in multiple cities
scrapeYelpWithApify(
    ['coffee shops', 'specialty coffee'],
    ['San Francisco, CA', 'Portland, OR', 'Seattle, WA']
).then(results => {
    console.log(`Found ${results.length} businesses`);

    // Group by city
    const byCity = {};
    results.forEach(biz => {
        const city = biz.city || 'Unknown';
        if (!byCity[city]) byCity[city] = [];
        byCity[city].push(biz);
    });

    Object.entries(byCity).forEach(([city, businesses]) => {
        const avgRating = businesses.reduce((s, b) => s + b.rating, 0) / businesses.length;
        console.log(`${city}: ${businesses.length} shops, avg rating ${avgRating.toFixed(2)}`);
    });
});
Enter fullscreen mode Exit fullscreen mode

Why Use Apify for Yelp Scraping

  1. Residential proxy pools: Apify's residential proxies are far less likely to trigger Yelp's CAPTCHA than datacenter IPs
  2. Automatic retries: Failed requests are retried with different proxies automatically
  3. Browser fingerprinting: Headless browser actors can mimic real user behavior more effectively
  4. Scheduling and monitoring: Set up daily scrapes with alerting on failures
  5. Data export: Export to JSON, CSV, Excel, or push directly to databases and APIs
  6. Compliance: Apify actors can be configured to respect robots.txt and rate limits

Data Analysis Patterns

Once you've collected Yelp data, here are some practical analysis examples:

Competitive Analysis

function competitiveAnalysis(businesses) {
    // Sort by rating (weighted by review count)
    const ranked = businesses
        .filter(b => b.reviewCount >= 10) // Minimum review threshold
        .map(b => ({
            ...b,
            weightedScore: b.rating * Math.log10(b.reviewCount + 1),
        }))
        .sort((a, b) => b.weightedScore - a.weightedScore);

    console.log('\nTop 10 businesses by weighted score:');
    ranked.slice(0, 10).forEach((biz, i) => {
        console.log(`${i + 1}. ${biz.name} - ${biz.rating}⭐ (${biz.reviewCount} reviews) Score: ${biz.weightedScore.toFixed(2)}`);
    });

    // Price range distribution
    const priceDistribution = {};
    businesses.forEach(b => {
        const price = b.priceRange || 'Unknown';
        priceDistribution[price] = (priceDistribution[price] || 0) + 1;
    });

    console.log('\nPrice distribution:');
    Object.entries(priceDistribution).sort().forEach(([price, count]) => {
        console.log(`  ${price}: ${count} businesses (${(count/businesses.length*100).toFixed(1)}%)`);
    });
}
Enter fullscreen mode Exit fullscreen mode

Review Sentiment Tracking

function reviewSentimentAnalysis(reviews) {
    // Group reviews by month
    const monthly = {};
    reviews.forEach(r => {
        const month = r.date ? r.date.substring(0, 7) : 'unknown';
        if (!monthly[month]) monthly[month] = { ratings: [], count: 0 };
        monthly[month].ratings.push(r.rating);
        monthly[month].count++;
    });

    console.log('\nMonthly sentiment trend:');
    Object.entries(monthly).sort().forEach(([month, data]) => {
        const avg = data.ratings.reduce((s, r) => s + r, 0) / data.count;
        const bar = ''.repeat(Math.round(avg * 4));
        console.log(`${month}: ${avg.toFixed(2)} ${bar} (${data.count} reviews)`);
    });

    // Find common keywords in negative reviews
    const negativeReviews = reviews.filter(r => r.rating <= 2);
    const wordFreq = {};
    const stopWords = new Set(['the', 'a', 'an', 'and', 'or', 'but', 'in', 'on', 'at', 'to', 'for', 'of', 'is', 'it', 'was', 'i', 'we', 'they']);

    negativeReviews.forEach(r => {
        const words = r.text.toLowerCase().split(/\s+/);
        words.forEach(w => {
            const clean = w.replace(/[^a-z]/g, '');
            if (clean.length > 3 && !stopWords.has(clean)) {
                wordFreq[clean] = (wordFreq[clean] || 0) + 1;
            }
        });
    });

    const topComplaints = Object.entries(wordFreq)
        .sort((a, b) => b[1] - a[1])
        .slice(0, 15);

    console.log('\nTop words in negative reviews:');
    topComplaints.forEach(([word, count]) => {
        console.log(`  "${word}": ${count} mentions`);
    });
}
Enter fullscreen mode Exit fullscreen mode

Geographic Mapping

function geographicAnalysis(businesses) {
    // Filter businesses with coordinates
    const geoBusinesses = businesses.filter(b => b.coordinates);

    if (geoBusinesses.length === 0) {
        console.log('No geographic data available');
        return;
    }

    // Calculate center point
    const avgLat = geoBusinesses.reduce((s, b) => s + b.coordinates.lat, 0) / geoBusinesses.length;
    const avgLng = geoBusinesses.reduce((s, b) => s + b.coordinates.lng, 0) / geoBusinesses.length;

    console.log(`Center point: ${avgLat.toFixed(4)}, ${avgLng.toFixed(4)}`);

    // Find highest rated neighborhoods (approximate by clustering)
    const gridSize = 0.01; // roughly 1km
    const grid = {};

    geoBusinesses.forEach(b => {
        const key = `${Math.round(b.coordinates.lat / gridSize) * gridSize},${Math.round(b.coordinates.lng / gridSize) * gridSize}`;
        if (!grid[key]) grid[key] = { businesses: [], totalRating: 0 };
        grid[key].businesses.push(b);
        grid[key].totalRating += b.rating;
    });

    const topAreas = Object.entries(grid)
        .filter(([_, data]) => data.businesses.length >= 3)
        .map(([coords, data]) => ({
            coords,
            count: data.businesses.length,
            avgRating: data.totalRating / data.businesses.length,
        }))
        .sort((a, b) => b.avgRating - a.avgRating);

    console.log('\nHighest rated areas:');
    topAreas.slice(0, 5).forEach(area => {
        console.log(`  ${area.coords}: ${area.avgRating.toFixed(2)}⭐ avg (${area.count} businesses)`);
    });
}
Enter fullscreen mode Exit fullscreen mode

Legal and Ethical Considerations

Yelp scraping requires careful attention to legal boundaries:

  • Yelp's Terms of Service: Yelp explicitly prohibits scraping in its ToS. Understand the legal implications in your jurisdiction before proceeding.
  • hiQ v. LinkedIn precedent: While not directly about Yelp, this case established that scraping publicly available data isn't necessarily a CFAA violation — but it's not a blanket permission either.
  • Rate limiting: Aggressive scraping can degrade Yelp's service for real users. Always use reasonable delays between requests.
  • Personal data: Review author names and other personal information may be subject to GDPR, CCPA, and other privacy regulations.
  • Data accuracy: Yelp's recommendation algorithm hides some reviews. Scraped data may not represent the full picture.

Best practices:

  • Scrape only what you need — don't collect data speculatively
  • Respect robots.txt directives
  • Use delays of 2+ seconds between requests
  • Don't republish raw review data without proper attribution
  • Consult a lawyer if using data commercially

Conclusion

Yelp is a rich source of local business data, but extracting it at scale requires thoughtful engineering. Start with the Fusion API for small-scale needs, graduate to targeted scraping for specific data points the API doesn't cover, and leverage platforms like Apify when you need to collect data across hundreds of cities or categories without managing infrastructure.

The key to successful Yelp scraping is patience — slow, respectful scraping with proper proxy rotation and session management will get you far more data than aggressive approaches that trigger blocks. Whether you're doing competitive analysis, lead generation, or academic research, the combination of official APIs and carefully built scrapers gives you access to one of the most comprehensive local business datasets on the web.


Need reliable Yelp data without the infrastructure headaches? Browse the Apify Store for ready-made Yelp scrapers that handle proxy rotation, CAPTCHA solving, and data export automatically.

Top comments (0)