George Kioko

Posted on Mar 18

I Built 34 Web Scrapers — Here's What I Learned About Anti-Bot Detection

#webscraping #tutorial #javascript #beginners

Over the past year, I've built 34 production web scrapers serving 300+ users with 4,200+ combined runs. Some of these scrapers have been running daily for months without breaking. Others? They died within a week.

Here's everything I learned about keeping scrapers alive in 2026.

Most Scrapers Break Within Weeks

If you've built a scraper that worked perfectly on Monday and returned empty results by Friday, welcome to the club. The three killers:

1. Selector rot. Sites redesign, A/B test, or just shuffle class names. That div.product-card-v2__title\ you relied on? Gone.

2. Rate limiting. Hammer a site with 100 requests/second and you'll get blocked before your first dataset completes. Most developers underestimate how aggressive modern rate limiting is.

3. Anti-bot systems. Cloudflare, DataDome, PerimeterX, reCAPTCHA — these aren't just CAPTCHAs anymore. They fingerprint your browser, analyze mouse movements, and flag headless Chrome in milliseconds.

Why Crawlee + Puppeteer Is My Go-To Stack

After trying Scrapy, Playwright, raw Axios, and half a dozen other tools, I settled on Crawlee with Puppeteer. Here's why:

\`javascript
import { PuppeteerCrawler, Dataset } from 'crawlee';

const crawler = new PuppeteerCrawler({
maxRequestsPerCrawl: 500,
maxConcurrency: 5,
requestHandlerTimeoutSecs: 120,

async requestHandler({ page, request, enqueueLinks }) {
    const data = await page.evaluate(() => {
        return {
            title: document.querySelector('h1')?.textContent?.trim(),
            price: document.querySelector('[data-price]')?.textContent?.trim(),
        };
    });

    await Dataset.pushData(data);
},

});

await crawler.run(['https://example.com/products']);
`\

Crawlee handles retries, session management, and request queuing out of the box. When a request fails, it automatically rotates sessions and retries with exponential backoff. That alone saves hundreds of lines of custom code.

The real magic is the RequestQueue\. Instead of managing URLs in arrays or databases, Crawlee persists the queue to disk. If your scraper crashes at item 847 of 2,000, it picks up right where it left off. I've had scrapers survive server restarts mid-crawl without losing a single result.

Proxy Rotation Strategies That Actually Work

Forget free proxy lists. They're dead on arrival. Here's what actually works in production:

\`javascript
import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee';

const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@residential.provider.com:8000',
],
});

The rules I follow:

Residential proxies for tough targets. Datacenter IPs get flagged instantly on sites with Cloudflare Bot Management.
Rotate per session, not per request. Changing IP every request is actually a red flag. Real users keep the same IP for a session.
Geographic targeting matters. Scraping a US e-commerce site from a Romanian IP? Suspicious. Match your proxy location to the target audience.
Keep session state. Cookies, localStorage, and fingerprint consistency across requests from the same "user."

Handling Cloudflare, DataDome, and reCAPTCHA

This is where most scrapers die. Here's my playbook:

Cloudflare (Turnstile/Bot Management):

Use a real browser with full JavaScript execution — no HTTP-only scrapers
Maintain consistent TLS fingerprints across requests
Don't strip cookies between requests
Residential proxies are almost mandatory for sites with Bot Management enabled
Add realistic delays between actions (1-3 seconds, randomized)

DataDome:

This one's brutal. DataDome tracks canvas fingerprints, WebGL renders, and behavioral signals
I've had success with slower request rates (2-3s delays) and realistic viewport sizes
Sometimes the answer is: find an API endpoint the mobile app uses instead

reCAPTCHA v3:

It scores your "humanness" silently. No clicking fire hydrants
Real browser + realistic behavior patterns (scrolling, mouse movement) keep scores high
For v2 challenges, third-party solving services work but eat into margins fast

The honest truth: Some sites are not worth scraping directly. My Crunchbase scraper still sits in private because their Cloudflare setup is that aggressive. Know when to pivot to an official API or alternative data source.

Real Numbers From Production

My most popular scrapers on Apify Store:

Scraper	Users	Total Runs
LinkedIn Employee Scraper	91	623
YouTube Transcript Extractor	40	327
TikTok Shop Scraper	—	294
CoinMarketCap Scraper	—	239
Google Scholar Scraper	Active	Growing
Telegram Channel Scraper	Active	Growing

Across all 34 scrapers: 300+ active users, 4,200+ total runs, and zero tolerance for flaky results.

The LinkedIn scraper alone taught me more about anti-bot detection than any blog post ever could. LinkedIn rotates their DOM structure regularly, throttles based on account age, and will shadowban scraper accounts within hours if you're not careful.

5 Lessons I'd Tell My Past Self

Build for resilience, not speed. A scraper that runs 50% slower but never breaks is infinitely more valuable than a fast one that dies weekly.
Monitor everything. I get alerts when success rates drop below 95%. By the time a user reports a bug, I've usually already fixed it.
Anti-detect isn't optional anymore. In 2024 you could get away with basic Puppeteer. In 2026, every major site has some form of bot detection. Budget for proxies and browser fingerprinting from day one.
Charge for value, not for compute. My scrapers use pay-per-event pricing. Users pay for results (emails validated, transcripts extracted, profiles scraped), not for server time. This aligns incentives perfectly.
Open source your knowledge, sell your infrastructure. Sharing how I build scrapers hasn't hurt my business — it's the main reason people find my tools.

Try Them Out

All 27 of my public scrapers are available on the Apify Store. Most have free tiers so you can test before committing.

If you're building scrapers and hitting walls, drop a comment — I've probably hit the same wall and found a way around it.

Building in public. Follow the journey on X/Twitter.

Top comments (1)

Apex Stack • Mar 19

Your point about "build for resilience, not speed" really hits home. I run a financial data platform that pulls live stock data for 8,000+ tickers across 12 languages, and the single biggest lesson I learned is that a pipeline that gracefully handles API rate limits and stale data is worth 10x more than one that's fast but brittle. We use yfinance as our primary data source and the number of times Yahoo quietly changes response formats or throttles requests without clear error codes is... humbling.

The proxy rotation insight about rotating per session rather than per request is something I wish I'd understood earlier. We had a similar realization with our content generation pipeline — consistency of behavior matters more than volume. Curious whether you've experimented with using headless browsers in serverless environments (Lambda/Cloud Run) for the tougher targets, or if the cold start overhead makes that impractical at your scale?