DEV Community

Cover image for The Web Scraping Stack I Use After Building 35 Apify Actors (Tools, Patterns, Pitfalls)
Vhub Systems
Vhub Systems

Posted on • Edited on

The Web Scraping Stack I Use After Building 35 Apify Actors (Tools, Patterns, Pitfalls)

The Web Scraping Stack I Use After Building 35 Apify Actors (Tools, Patterns, Pitfalls)

Three years ago I spent four hours debugging why my scraper was returning empty results. Turned out the site had migrated to a React SPA six weeks earlier and nobody updated the docs I was referencing. The HTML I was parsing was a nearly-empty shell. <div id="root"></div>. Beautiful.

That's the kind of thing that forces you to actually develop opinions about tooling rather than just reaching for whatever you used last time. After 35 actors, I've converged on a stack. Not because it's perfect — it absolutely isn't — but because I know exactly where each piece breaks. That's worth more than theoretical elegance.


The Tool Selection Decision Tree

I don't spend much time debating this anymore. It comes down to three questions.

Is the content in the initial HTML response? curl the URL, look at the raw source. If your data is there, you don't need a browser. Use requests + BeautifulSoup. It's faster, cheaper on resources, and dramatically easier to debug. I've run jobs scraping 50,000 product pages with this combo in under 40 minutes on a modest VPS. The moment you introduce a headless browser you're looking at 4-10x slower throughput and a lot more moving parts.

Does the page render content via JavaScript, require scroll events, or sit behind a login with session cookies? This is Playwright territory. I use the Python bindings. Yes, the async model adds cognitive overhead. Yes, it's slower. But when you need to click "Load More" 47 times or intercept XHR responses to grab the actual API endpoint the frontend is calling — there's no clean alternative. One thing people miss: half the time you can use Playwright once to reverse-engineer the underlying API call, then switch back to plain requests for the actual scraping. Do that whenever you can.

Do you need queue management, auto-scaling, and retry logic baked in without writing it yourself? That's Crawlee. I reach for it when the crawl graph is complex — like spidering an entire e-commerce category tree — or when I'm handing the project off to someone else and I need the infrastructure to be self-explanatory. The opinionated structure is annoying until the day it saves you from a subtle queue duplication bug at 3am.


The Anti-Detection Checklist

I'm going to be direct: most "anti-bot" measures aren't catching sophisticated fingerprinting, they're catching lazy scrapers. Fix the lazy stuff first.

Randomize your User-Agent from a real pool. Not a list of five strings from a Stack Overflow answer from 2019. Maintain a pool of 10-15 actual current browser UAs — Chrome on Windows, Chrome on Mac, Firefox, Edge. Rotate them. A static UA hitting 10,000 requests is a blinking red light.

Set the supporting headers. A real Chrome request doesn't just send User-Agent. It sends Accept-Language, sec-ch-ua, sec-ch-ua-mobile, sec-ch-ua-platform, sec-fetch-dest, sec-fetch-mode, sec-fetch-site. If you're sending a modern Chrome UA but missing all the sec-* headers, you look like a bot wearing a Halloween costume. Match your UA to your sec-ch-ua values — they need to be internally consistent.

Add jitter to your timing. Not time.sleep(1). That's a metronome. Sites with basic traffic analysis can spot uniform request intervals. Use random.uniform(0.5, 3.0). For high-sensitivity targets I'll add a secondary jitter: time.sleep(random.uniform(0.5, 3.0) + random.gauss(0, 0.3)). It sounds paranoid until you watch your ban rate drop.

Rotate sessions, not just IPs. This is the one people get wrong. You swap the proxy IP but keep the same session cookie and accept-language header and TLS fingerprint. The site's fraud system doesn't just track IPs — it tracks behavioral sessions. When you rotate IP, rotate the full session context.


The Error Handling Pattern I Actually Use

Early on I had bespoke retry logic scattered across every scraper. Different timeout values, different backoff logic, inconsistent logging. After the fifth time I copied and modified the same block, I wrote this decorator and stopped thinking about it:

import time
import random
import functools
import logging

def retry_with_backoff(max_retries=4, base_delay=1.0, max_delay=60.0):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    if attempt == max_retries - 1:
                        logging.error(f"Failed after {max_retries} attempts: {e}")
                        raise
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    jitter = random.uniform(0, delay * 0.3)
                    logging.warning(f"Attempt {attempt + 1} failed: {e}. Retrying in {delay + jitter:.1f}s")
                    time.sleep(delay + jitter)
        return wrapper
    return decorator

@retry_with_backoff(max_retries=4, base_delay=2.0)
def fetch_page(url, session):
    response = session.get(url, timeout=15)
    response.raise_for_status()
    return response
Enter fullscreen mode Exit fullscreen mode

The exponential backoff is table stakes. The jitter on the backoff delay itself is what matters — without it, if you have multiple workers all failing simultaneously, they all retry at the same time and hammer the target again. Real situation I hit on a 12-worker job: synchronised retry waves were triggering rate limits on their own. Adding jitter to the backoff solved it.

One thing this decorator doesn't do: distinguish between errors worth retrying and errors that aren't. A 404 isn't worth retrying. A 429 or a 503 is. I usually add a status code check inside fetch_page to raise a custom non-retriable exception for 404s, rather than baking it into the decorator.


Production Monitoring: What I Actually Watch

I've been burned enough times by "the scraper ran successfully but the data is garbage" that I now treat monitoring as non-negotiable.

Request success rate. Obvious, but define it precisely. I track: total requests, 2xx responses, 4xx responses, 5xx responses, and timeouts separately. A spike in 403s means something different than a spike in 503s. At the end of a run, if my success rate is below 95%, I'm not shipping that data.

Proxy pool health. Track which proxy IPs are returning non-2xx responses at above-average rates and rotate them out proactively. I've had proxy providers deliver pools where 20% of the IPs were already flagged on day one. You only discover this if you're tracking per-IP success rates.

Output record count vs. expected range. This one catches the silent failures. The scraper runs, exits cleanly, zero errors logged — but instead of 4,800 product records you got 847. Something changed on the target site: pagination broke, a category was removed, a JavaScript render started gating content that was previously static. I set an expected range based on historical runs (with some tolerance for genuine inventory changes) and alert if the output falls outside it.

This last check has caught actual site changes I would have shipped as "good data" at least six times.


The Honest Tradeoffs

Playwright adds real overhead. On one job I had to choose between Playwright for full fidelity and a requests-based approach targeting the mobile API endpoint I'd found in network tab. The API approach was 12x faster and ran on a quarter of the memory. Always look for the underlying API first.

Crawlee is opinionated and that's genuinely annoying when your use case doesn't fit its model cleanly. I've had to work around its abstractions a few times. Still worth it for long-running production crawls.

Proxy costs will surprise you. Budget for them from the start.


Most of the actors I built using this stack are live at apify.com/lanky_quantifier — if you want to see how these patterns look in production without maintaining the infrastructure yourself.


Benchmarks: requests vs Playwright vs Crawlee

I get asked this constantly, so here are real numbers from production runs — not synthetic benchmarks:

Scenario Tool Throughput Memory Relative Cost
Static product pages (50k URLs) requests + BS4 ~1,200 pages/min ~80 MB 1x
SPA with JS rendering Playwright ~130 pages/min ~620 MB 9x
Multi-domain category spider Crawlee ~400 pages/min ~210 MB 3x
API endpoint (reverse-engineered) requests ~2,800 req/min ~55 MB 0.7x

The last row is the one worth burning into memory. Finding the underlying API call that a React frontend makes to its backend is almost always worth the 30 minutes of network tab spelunking. You skip the browser entirely, get structured JSON instead of HTML, and run at 2-3x the speed of even plain requests on raw HTML (because JSON parsing is cheaper than DOM parsing).

How to find it: open DevTools → Network tab → filter by Fetch/XHR → reload the page → look for requests returning JSON with the data you need. Nine times out of ten it's right there.

Handling JavaScript-Heavy Pages Without Playwright

Playwright is the right answer when you genuinely need browser behaviour — clicks, scrolls, form fills. But there are three cheaper alternatives worth trying first:

1. Splash (lightweight JS rendering)

A Docker-based headless browser specifically for scraping. Far lower memory footprint than Playwright. Useful when you need basic JS execution but not full browser automation.

2. Pre-rendered HTML via Google Cache

For public, well-indexed pages: https://webcache.googleusercontent.com/search?q=cache:URL. Google's cached version is often fully rendered HTML. Obviously not reliable for fresh data, but fine for historical snapshots.

3. Mobile endpoints

Many sites serve simpler, less JS-heavy markup to mobile user agents. Switch your UA to a mobile string and compare the response. I've had cases where the mobile version rendered static HTML where the desktop version was a full React SPA. Instant 8x throughput improvement.

My decision flow: try the API first → try mobile UA → try Splash → reach for Playwright as last resort.


What does your current scraping stack look like? Specifically curious whether anyone's found a cleaner solution for the "detect if the site went SPA on you" problem — I'm still doing the manual curl check and it's not elegant.


Need a Scraper Without the Stack Complexity?

If this post made you think "I just want the data, not the infrastructure" — fair.

👉 Ready-to-run scrapers on Apify — 30+ actors built with exactly this stack, handling proxies, retries, and monitoring so you don't have to. Pay per compute unit.

Or if you need something custom-built for your specific target site or data pipeline:

📩 vhubsystems@gmail.com | Hire on Upwork

Top comments (0)