DEV Community

Cover image for Why Your Requests + BeautifulSoup Stack Will Fail in Production
SIÁN Agency
SIÁN Agency

Posted on • Originally published at apify.com

Why Your Requests + BeautifulSoup Stack Will Fail in Production

TL;DRrequests plus BeautifulSoup is the right tool for tutorials, side projects, and one-off audits. It is the wrong tool for any scraper that has to run unsupervised, longer than a quarter, against a site that has even basic bot defenses. I've watched a dozen teams discover this the expensive way. Here's the diagnosis and the replacement.

I'm not anti-requests. The library is fast, predictable, and elegant. For 30% of scraping tasks it's still what I reach for first. The problem is that the rest of the scraping pipeline — JavaScript-rendered content, fingerprinting checks, modern auth flows, lazy loading — silently breaks the assumptions requests is built on.

Most teams discover this in stages. Here's the timeline.

Month 1 — "It works"

You write the first version. requests.get(url) returns 200, BeautifulSoup parses the response, you find your selectors, you ship. Tests pass against the small URL set you tested with. Lunch.

Month 2 — "Some pages return empty"

You notice maybe 5% of pages return rows where half the fields are None. You add a check, log the URL, retry. The retry sometimes works.

What's actually happening: those pages render their data in JavaScript after the initial response. requests got the HTML skeleton. The data was never in it. The retries that "work" are coincidence — sometimes the cached page has stale rendered data; sometimes a CDN ships a different variant.

Month 3 — "We're getting 403s"

The target site rolled out a fingerprinting check. requests sends a default User-Agent that screams python-requests/2.31.0. You add headers. It works for two days. They tightened the check — now they look at TLS fingerprint, not just User-Agent. requests uses the system OpenSSL TLS stack, which is different from any real browser's. The block returns.

Month 4 — "We need a session, but it's stateful"

Login flow now requires a CSRF token, which is rendered in JavaScript, which requests can't run. You spend two days reverse-engineering the login flow, find the API endpoint behind it, hit that directly. Works for six weeks. They rotate the auth scheme.

Month 5 — "Let's just use Playwright"

You finally migrate. Most of the team is annoyed because the rewrite took longer than they wanted. The team that does it later is annoyed for the same reason.

The teardown

The fundamental issue: requests is an HTTP client. Modern websites are browser applications. The thing you're scraping is the output of running JavaScript, not a static document. You can fight that for a while — by reverse-engineering APIs, faking TLS fingerprints, hand-rolling JS interpreters — but you're paying interest on a debt you took on the day you reached for requests instead of a real browser.

Specific failure modes you're going to hit:

  • JavaScript-rendered content. The HTML you fetch contains <div id="root"></div> and not much else.
  • TLS fingerprinting. requests looks like Python; real browsers look like Chrome/Firefox. Block lists distinguish them easily.
  • Lazy-loading. Data appears in the DOM only after scroll, click, or visibility events. Static fetch never triggers them.
  • Modern auth. OAuth, CSRF tokens injected via JS, cookie-based session validation that requires running scripts.
  • Anti-automation challenges. Cloudflare, PerimeterX, DataDome — all rely on running JavaScript to validate the client.

requests answers none of these. Playwright (or Puppeteer) answers all of them, because Playwright is a browser.

The replacement pattern

Skip the year of pain. Start with Playwright. Use requests only when you've measured that the data is in the static HTML and the site has no fingerprinting:

from playwright.async_api import async_playwright

async def scrape(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        ctx = await browser.new_context(
            user_agent="Mozilla/5.0 (...)",
            viewport={"width": 1920, "height": 1080},
        )
        page = await ctx.new_page()

        # Block heavy resources for speed.
        await page.route("**/*.{png,jpg,jpeg,gif,svg,woff,woff2}",
                         lambda r: r.abort())

        await page.goto(url, wait_until="domcontentloaded")
        # Wait for the *data* to appear, not just the document.
        await page.wait_for_selector('[data-product-id]', timeout=15_000)

        return await extract_fields(page)
Enter fullscreen mode Exit fullscreen mode

Five things requests can't give you that Playwright does for free:

  1. JavaScript execution — your selectors target rendered DOM, not the source.
  2. Realistic TLS fingerprint — Chromium does this for you.
  3. Cookie/session handling that matches a real browser.
  4. wait_for_selector — semantic waits instead of time.sleep.
  5. Routing controls — block what you don't need, accelerate what you do.

When requests is still right

Static documentation sites. Open RSS/Atom feeds. JSON APIs that don't require login. PDFs and CSVs hosted on S3. Anything where you've actually fetched the URL, looked at the response body, and confirmed your data is in it.

That's a real category. Just don't assume the next site you scrape will fall into it.

Fig. 1 — Failure modes by stack. requests+BS4 hits four walls a real browser doesn't.

Result

Across our actor portfolio, the migration ratio settled around 80/20 — Playwright for 80% of jobs, requests for the 20% where the data is genuinely static. The 80% includes our entire Sephora catalog pipeline, which spent its first version as a requests + BeautifulSoup script and never made it past month 2. The Playwright rewrite has been running unsupervised for 14 months.

If your scraper is currently 100% requests, your sample size isn't "this works fine." Your sample size is "the sites I've scraped so far happen to have static HTML."

Which of the five failure modes have you shipped to production? Drop the symptom in the comments — I'll point at the fix.


Written by **Jonas Keller, Senior Automation Architect at SIÁN Agency. Find more from Jonas on dev.to. For custom scraping or automation work, hire SIÁN Agency.

Top comments (0)