I spent 3 months scraping data for market research projects. My client wanted pricing data from 50 e-commerce sites — updated daily.
Bright Data quoted me $500/month for residential proxies.
I'm a solo developer. That's my entire monthly tool budget.
So I built the entire pipeline with free tools. Here's exactly what I used, with code you can copy.
1. Crawlee — The Swiss Army Knife
Crawlee is what I wish existed 5 years ago. It handles anti-bot detection, proxy rotation, and browser fingerprinting — all built-in.
import { PlaywrightCrawler } from 'crawlee';
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 100,
async requestHandler({ page, request, enqueueLinks }) {
const title = await page.title();
const price = await page.$eval('.price', el => el.textContent);
console.log(`${request.url}: ${price}`);
await enqueueLinks({ globs: ['https://example.com/product/*'] });
},
});
await crawler.run(['https://example.com/products']);
Why it replaces Bright Data: Crawlee's built-in SessionPool rotates browser fingerprints automatically. For most sites, you don't need residential proxies at all — the fingerprint rotation alone gets you past basic bot detection.
Best for: JavaScript-rendered pages, sites with moderate anti-bot protection.
2. Scrapy + scrapy-playwright — The Production Workhorse
If you're scraping at scale (10K+ pages/day), Scrapy is still king. Add scrapy-playwright for JavaScript-heavy sites.
import scrapy
class PriceSpider(scrapy.Spider):
name = 'prices'
custom_settings = {
'DOWNLOAD_HANDLERS': {
'https': 'scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler',
},
'CONCURRENT_REQUESTS': 16,
'AUTOTHROTTLE_ENABLED': True,
}
def start_requests(self):
urls = open('urls.txt').read().splitlines()
for url in urls:
yield scrapy.Request(url, meta={'playwright': True})
def parse(self, response):
yield {
'url': response.url,
'price': response.css('.price::text').get(),
'name': response.css('h1::text').get(),
}
Why it replaces Bright Data: Scrapy's AutoThrottle + concurrent requests handle rate limiting intelligently. Combined with free proxy lists from free-proxy-list.net, I scrape 50K pages/day without paying a cent.
Best for: Large-scale scraping, data pipelines, production systems.
3. Playwright Stealth — When Sites Fight Back
Some sites (think Amazon, LinkedIn, Zillow) use advanced bot detection. playwright-stealth patches Playwright to look like a real browser.
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
stealth_sync(page)
page.goto('https://example.com/protected-page')
# Now you look like a real Chrome user
data = page.query_selector_all('.product-card')
for item in data:
print(item.inner_text())
browser.close()
Why it replaces Bright Data: Bright Data's "unlocker" is essentially a managed version of stealth browsing + proxy rotation. With playwright-stealth, you get the stealth part for free. Combine with rotating free proxies and you cover 80% of use cases.
Best for: Sites with fingerprinting detection, login-required scraping.
4. Apify Free Tier — Cloud Scraping Without Infrastructure
Apify gives you $5/month free credits — enough to run lightweight scrapers. I use it for scheduled daily scrapes that would otherwise need a VPS.
import { Actor } from 'apify';
await Actor.init();
const { page } = await Actor.launchPlaywrightBrowser();
await page.goto('https://example.com');
const results = await page.$$eval('.item', items =>
items.map(item => ({
name: item.querySelector('.name')?.textContent,
price: item.querySelector('.price')?.textContent,
}))
);
await Actor.pushData(results);
await Actor.exit();
Why it replaces Bright Data: For small-to-medium projects, Apify's free tier + their proxy infrastructure handles everything. You get managed browser instances, automatic retries, and result storage — no server setup.
Best for: Scheduled scrapes, small datasets (<10K items), MVPs.
5. curl-impersonate — The Lightweight Option
Sometimes you don't need a browser at all. curl-impersonate makes HTTP requests that look identical to Chrome or Firefox.
# Install
brew install curl-impersonate
# Scrape like Chrome
curl_chrome116 'https://example.com/api/products' \
-H 'Accept: application/json' | jq '.products[].price'
Or in Python:
import curl_cffi.requests as requests
response = requests.get(
'https://example.com/api/products',
impersonate='chrome'
)
for product in response.json()['products']:
print(f"{product['name']}: ${product['price']}")
Why it replaces Bright Data: Many APIs that block requests or axios work perfectly with curl-impersonate. It's 100x faster than browser scraping and uses almost no resources. For API-based sites, this is all you need.
Best for: API scraping, speed-critical tasks, resource-constrained environments.
When You Actually Need Bright Data
Let me be honest — these free tools don't cover everything:
- Massive scale (100K+ requests/day) → you'll eventually need paid proxies
- Sites with CAPTCHA on every request → no free solution handles this well
- Real-time data from heavily protected sites → residential proxies are the only reliable option
But for 80% of scraping projects? These five tools are all you need.
My Stack (What I Actually Use Daily)
| Task | Tool | Cost |
|---|---|---|
| JS-heavy sites | Crawlee | Free |
| Large datasets | Scrapy + playwright | Free |
| Anti-bot sites | playwright-stealth | Free |
| Scheduled scrapes | Apify free tier | $0-5/mo |
| API scraping | curl-impersonate | Free |
| Total | $0-5/mo |
Compare that to Bright Data's $500/month minimum.
Want a complete list of 100+ web scraping tools? I maintain an open-source collection: awesome-web-scraping-2026 — frameworks, proxies, anti-detection, and cloud platforms. All free.
What tools do you use for web scraping? Have you found anything better? Drop a comment below 👇
If you're building scraping pipelines and need help, I'm available for consulting — check my profile for contact details.
Top comments (0)