Scraping YellowPages with Python in 2026: What Actually Works (and What Doesn't)

#webscraping #leadgeneration #python #automation

I spent a week trying to scrape YellowPages.com with Python. Most tutorials online are from 2021-2023 and silently fail now. Here's what I found.

The tutorials are all broken

If you Google "scrape yellowpages python", you'll find guides using requests + BeautifulSoup. They look clean, they make sense, and they don't work.

import requests
from bs4 import BeautifulSoup

url = "https://www.yellowpages.com/search?search_terms=plumbers&geo_location_terms=Austin%2C+TX"
resp = requests.get(url)
# resp.status_code == 403. Every time.

YellowPages.com moved behind Cloudflare sometime in 2023. Every request now passes through a JavaScript challenge. requests can't execute JavaScript, so it gets a 403 or an empty challenge page. Same with httpx, urllib3, or any pure HTTP library.

What about Selenium/Playwright?

Headless browsers can execute the JS challenge:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://www.yellowpages.com/search?search_terms=plumbers&geo_location_terms=Austin%2C+TX")
    page.wait_for_selector(".search-results")
    html = page.content()

This works... sometimes. Three problems kill it at scale:

1. IP blocks. Default Chromium uses your IP. After 10-20 requests, Cloudflare blocks you. You need residential proxies, and specifically US ones -- non-US IPs get rejected.

2. Browser fingerprinting. Headless Chrome has detectable fingerprint differences. Cloudflare catches navigator.webdriver=true, missing plugins, and other signals. You need stealth patches.

3. Session reuse kills extraction. If you navigate to 50 business detail pages in one browser session, Cloudflare starts returning challenge pages instead of data after the 8th or 9th page. You need fresh browser contexts per page.

The proxy problem

Even with Playwright + stealth, you need proxies. Not just any proxies -- residential US proxies specifically.

Free proxy lists? Dead within hours. Datacenter proxies? Blocked instantly. Shared residential proxies? Rate-limited because 100 other scrapers are on the same pool.

You'll spend $50-100/month on a residential proxy service just to keep the scraper running. Then you need rotation logic, error handling for burned IPs, and retry logic.

What I actually use now

After fighting this for a week, I switched to using a pre-built scraper that handles all of this: Yellow Pages Scraper on Apify.

It handles:

Cloudflare bypass with residential proxies (built in)
Stealth browser automation (no fingerprint detection)
Fresh browser sessions per detail page
Automatic retries on blocked requests
Structured output (JSON, CSV, Excel)

You put in search terms + locations and get back clean data:

{
  "name": "Joe's Plumbing LLC",
  "phone": "(512) 555-0142",
  "email": "joe@joesplumbing.com",
  "website": "https://joesplumbing.com",
  "address": "1234 Main St, Austin, TX 78701",
  "rating": 4.5,
  "reviewCount": 47,
  "categories": ["Plumbers", "Water Heaters"],
  "yearsInBusiness": 12
}

Cost: about $0.005 per result. A run of 100 businesses costs $0.60. Cheaper than maintaining your own proxy subscription.

When to roll your own

Building your own scraper makes sense if:

You need custom post-processing that can't be done after export
You're scraping at massive scale (10K+ results/day) and want to optimize costs
You need real-time streaming rather than batch results
You want to learn how browser automation and anti-bot bypass work

If you just need business leads from YellowPages, the pre-built tool saves you a week of proxy debugging.

Key technical lessons if you do build it yourself

Use Go or Playwright, not Selenium. Selenium is slower and has more detectable fingerprints.
Residential US proxies are mandatory. Budget $50-100/month minimum.
Fresh browser context per detail page. Session reuse drops email extraction from ~22% to ~3%.
Parse JSON-LD, not HTML. YellowPages embeds structured data as application/ld+json. It's cleaner and more stable than CSS selectors.
Respect rate limits. 1-3 seconds between requests. Faster than that triggers immediate blocks.

I'm a data engineer building automation tools for lead generation. If you have questions about scraping public business directories, drop a comment.