Ava Torres

Posted on Mar 13

How to Scrape YellowPages.com for Business Leads in 2026

#webscraping #leadgeneration #automation #go

YellowPages.com is one of the largest US business directories -- over 20 million listings with phone numbers, addresses, websites, emails, and ratings. If you sell to local businesses, this is one of the most valuable public data sources available.

The problem: YellowPages sits behind Cloudflare. Basic HTTP requests get blocked. Most scraping tutorials give you code that worked in 2022 and fails silently today.

I built a scraper that handles this reliably. Here's what I learned and how you can use it.

Why YellowPages is hard to scrape in 2026

Three things make YP harder than it looks:

Cloudflare JS challenges. Every request goes through a JavaScript challenge. Simple HTTP clients (requests, axios, urllib) fail immediately. You need a real browser.
IP blocking. Datacenter IPs are blocked outright. You need residential proxies, and specifically US-based ones -- non-US residential IPs also get blocked.
Session detection. Even with a browser and proxies, Cloudflare tracks session behavior. Reusing the same browser page across too many navigations triggers detection. Fresh browser contexts are needed for reliable extraction.

What data is available?

Each YellowPages listing includes structured data embedded as schema.org JSON-LD:

Business name, phone, full address
Categories (e.g., "Plumbers", "Air Conditioning Contractors")
Rating and review count
Website URL
Opening hours and current status
Years in business

Detail pages add:

Email addresses (~22% of businesses have them)
Payment methods (Visa, Mastercard, cash, etc.)
Neighborhood
Amenities (Licensed, Insured, etc.)

The approach that works

After testing multiple approaches, here's what reliably bypasses Cloudflare on YellowPages:

Browser automation with stealth

Use a headless browser with anti-detection patches. The browser needs to:

Use the --headless=new flag (old headless mode has a detectable fingerprint)
Apply WebGL, canvas, and navigator spoofing
Run with SwiftShader for consistent GPU fingerprinting

Residential proxy rotation

Every browser session gets a unique US residential IP. If Cloudflare challenges the session, the browser closes and relaunches with a new IP. Most challenges resolve in 1-2 attempts.

JSON-LD extraction

Instead of parsing brittle CSS selectors, extract data from the schema.org LocalBusiness JSON-LD embedded in each page. This is more stable across site redesigns and provides cleaner data.

{
  "name": "Clarke Kent Plumbing",
  "phone": "(512) 766-0970",
  "streetAddress": "1408 W Ben White Blvd",
  "city": "Austin",
  "state": "TX",
  "postalCode": "78704",
  "categories": ["Plumbers", "Air Conditioning Contractors"],
  "rating": 2.87,
  "reviewCount": 15,
  "website": "http://www.clarkekentplumbing.com/"
}

The easy way: use a pre-built scraper

If you don't want to build and maintain your own Cloudflare bypass, I published a ready-to-use scraper on Apify:

Yellow Pages Scraper - US Business Leads & Emails

It handles everything described above -- Cloudflare bypass, residential proxy rotation, JSON-LD extraction, and optional email scraping from detail pages. Written in Go for speed and efficiency.

Quick start

Open the actor link above
Enter search terms (e.g., plumbers) and locations (e.g., austin-tx)
Set proxy to Residential + US
Click Run

You get structured JSON, CSV, or Excel output with 15+ fields per business.

Pricing

Pay per result: $0.005 per business lead + $0.10 per run start. No monthly subscription. 1,000 leads costs about $6.

Performance

~30 results per page, up to 100 pages per search
90 results in under 15 seconds (3 pages)
Automatic retry on Cloudflare challenges
Deduplication removes sponsored listing repeats

Tips for scaling

Multiple searches in parallel. Pass arrays of search terms and locations. The scraper processes each combination sequentially with fresh browser sessions.

Format locations correctly. Use city-state format: new-york-ny, los-angeles-ca, chicago-il. Check YellowPages.com URLs for the exact format.

Skip detail pages unless you need emails. Detail scraping fetches each business page individually -- it's 10x more expensive but the only way to get email addresses.

Export to CSV for CRM import. Apify datasets export directly to CSV. Most CRMs (HubSpot, Salesforce, Pipedrive) import CSV natively.

Common mistakes

Using datacenter proxies (instant block)
Using non-US proxies (also blocked)
Parsing CSS selectors instead of JSON-LD (breaks on site changes)
Reusing browser sessions across too many pages (triggers detection)
Not deduplicating sponsored listings (inflates results with repeats)

Conclusion

YellowPages is a goldmine for B2B lead generation if you can get past Cloudflare. The techniques above work reliably as of March 2026. If you want the data without building the infrastructure, try the pre-built scraper on Apify -- it's pay-per-result with no setup required.

Questions? Open an issue on the actor page or find me on the Apify Discord.

DEV Community