Over the past year, I've built 34 production web scrapers serving 300+ users with 4,200+ combined runs. Some of these scrapers have been running daily for months without breaking. Others? They died within a week.
Here's everything I learned about keeping scrapers alive in 2026.
Most Scrapers Break Within Weeks
If you've built a scraper that worked perfectly on Monday and returned empty results by Friday, welcome to the club. The three killers:
1. Selector rot. Sites redesign, A/B test, or just shuffle class names. That div.product-card-v2__title\ you relied on? Gone.
2. Rate limiting. Hammer a site with 100 requests/second and you'll get blocked before your first dataset completes. Most developers underestimate how aggressive modern rate limiting is.
3. Anti-bot systems. Cloudflare, DataDome, PerimeterX, reCAPTCHA — these aren't just CAPTCHAs anymore. They fingerprint your browser, analyze mouse movements, and flag headless Chrome in milliseconds.
Why Crawlee + Puppeteer Is My Go-To Stack
After trying Scrapy, Playwright, raw Axios, and half a dozen other tools, I settled on Crawlee with Puppeteer. Here's why:
\`javascript
import { PuppeteerCrawler, Dataset } from 'crawlee';
const crawler = new PuppeteerCrawler({
maxRequestsPerCrawl: 500,
maxConcurrency: 5,
requestHandlerTimeoutSecs: 120,
async requestHandler({ page, request, enqueueLinks }) {
const data = await page.evaluate(() => {
return {
title: document.querySelector('h1')?.textContent?.trim(),
price: document.querySelector('[data-price]')?.textContent?.trim(),
};
});
await Dataset.pushData(data);
},
});
await crawler.run(['https://example.com/products']);
`\
Crawlee handles retries, session management, and request queuing out of the box. When a request fails, it automatically rotates sessions and retries with exponential backoff. That alone saves hundreds of lines of custom code.
The real magic is the RequestQueue\. Instead of managing URLs in arrays or databases, Crawlee persists the queue to disk. If your scraper crashes at item 847 of 2,000, it picks up right where it left off. I've had scrapers survive server restarts mid-crawl without losing a single result.
Proxy Rotation Strategies That Actually Work
Forget free proxy lists. They're dead on arrival. Here's what actually works in production:
\`javascript
import { PuppeteerCrawler, ProxyConfiguration } from 'crawlee';
const proxyConfiguration = new ProxyConfiguration({
proxyUrls: [
'http://user:pass@residential.provider.com:8000',
],
});
const crawler = new PuppeteerCrawler({
proxyConfiguration,
sessionPoolOptions: {
maxPoolSize: 50,
sessionOptions: {
maxUsageCount: 10,
},
},
preNavigationHooks: [
async ({ page }) => {
const width = 1280 + Math.floor(Math.random() * 200);
const height = 720 + Math.floor(Math.random() * 200);
await page.setViewport({ width, height });
},
],
});
`\
The rules I follow:
- Residential proxies for tough targets. Datacenter IPs get flagged instantly on sites with Cloudflare Bot Management.
- Rotate per session, not per request. Changing IP every request is actually a red flag. Real users keep the same IP for a session.
- Geographic targeting matters. Scraping a US e-commerce site from a Romanian IP? Suspicious. Match your proxy location to the target audience.
- Keep session state. Cookies, localStorage, and fingerprint consistency across requests from the same "user."
Handling Cloudflare, DataDome, and reCAPTCHA
This is where most scrapers die. Here's my playbook:
Cloudflare (Turnstile/Bot Management):
- Use a real browser with full JavaScript execution — no HTTP-only scrapers
- Maintain consistent TLS fingerprints across requests
- Don't strip cookies between requests
- Residential proxies are almost mandatory for sites with Bot Management enabled
- Add realistic delays between actions (1-3 seconds, randomized)
DataDome:
- This one's brutal. DataDome tracks canvas fingerprints, WebGL renders, and behavioral signals
- I've had success with slower request rates (2-3s delays) and realistic viewport sizes
- Sometimes the answer is: find an API endpoint the mobile app uses instead
reCAPTCHA v3:
- It scores your "humanness" silently. No clicking fire hydrants
- Real browser + realistic behavior patterns (scrolling, mouse movement) keep scores high
- For v2 challenges, third-party solving services work but eat into margins fast
The honest truth: Some sites are not worth scraping directly. My Crunchbase scraper still sits in private because their Cloudflare setup is that aggressive. Know when to pivot to an official API or alternative data source.
Real Numbers From Production
My most popular scrapers on Apify Store:
| Scraper | Users | Total Runs |
|---|---|---|
| LinkedIn Employee Scraper | 91 | 623 |
| YouTube Transcript Extractor | 40 | 327 |
| TikTok Shop Scraper | — | 294 |
| CoinMarketCap Scraper | — | 239 |
| Google Scholar Scraper | Active | Growing |
| Telegram Channel Scraper | Active | Growing |
Across all 34 scrapers: 300+ active users, 4,200+ total runs, and zero tolerance for flaky results.
The LinkedIn scraper alone taught me more about anti-bot detection than any blog post ever could. LinkedIn rotates their DOM structure regularly, throttles based on account age, and will shadowban scraper accounts within hours if you're not careful.
5 Lessons I'd Tell My Past Self
Build for resilience, not speed. A scraper that runs 50% slower but never breaks is infinitely more valuable than a fast one that dies weekly.
Monitor everything. I get alerts when success rates drop below 95%. By the time a user reports a bug, I've usually already fixed it.
Anti-detect isn't optional anymore. In 2024 you could get away with basic Puppeteer. In 2026, every major site has some form of bot detection. Budget for proxies and browser fingerprinting from day one.
Charge for value, not for compute. My scrapers use pay-per-event pricing. Users pay for results (emails validated, transcripts extracted, profiles scraped), not for server time. This aligns incentives perfectly.
Open source your knowledge, sell your infrastructure. Sharing how I build scrapers hasn't hurt my business — it's the main reason people find my tools.
Try Them Out
All 27 of my public scrapers are available on the Apify Store. Most have free tiers so you can test before committing.
If you're building scrapers and hitting walls, drop a comment — I've probably hit the same wall and found a way around it.
Building in public. Follow the journey on X/Twitter.
Top comments (0)