Six months ago my Python scraper was consuming 800MB RAM to process 50k pages/day and timing out on large jobs. I rewrote the core in Rust. Here is what changed and whether it was worth it.
The Problem With Python Scrapers at Scale
Python web scraping works great until it does not:
- Memory: Python objects have 5-10x overhead vs raw data size. Parsing 1MB of HTML creates ~8MB of Python objects.
- GIL: The Global Interpreter Lock means CPU-bound parsing cannot use multiple cores effectively.
- Speed: Python HTML parsing (BeautifulSoup) is slow — typically 10-50ms per page depending on complexity.
- Concurrency: asyncio helps with I/O-bound waiting but does not help CPU-bound parsing.
For 50k pages/day on a cheap VPS, I was hitting OOM kills every few hours.
What Rust Gives You
use scraper::{Html, Selector};
use reqwest::Client;
use tokio::sync::Semaphore;
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::builder()
.user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
.build()?;
// Semaphore limits concurrent requests (rate limiting)
let semaphore = Arc::new(Semaphore::new(10));
let urls = vec![
"https://example.com/page1",
"https://example.com/page2",
// ... thousands more
];
let mut handles = vec![];
for url in urls {
let client = client.clone();
let sem = semaphore.clone();
handles.push(tokio::spawn(async move {
let _permit = sem.acquire().await.unwrap();
let resp = client.get(url).send().await?;
let html = resp.text().await?;
parse_page(&html)
}));
}
for handle in handles {
if let Ok(Ok(data)) = handle.await {
// process data
}
}
Ok(())
}
fn parse_page(html: &str) -> Result<PageData, Box<dyn std::error::Error>> {
let document = Html::parse_document(html);
let title_sel = Selector::parse("h1.product-title").unwrap();
let price_sel = Selector::parse("span.price").unwrap();
let title = document.select(&title_sel)
.next()
.map(|e| e.text().collect::<String>())
.unwrap_or_default();
let price = document.select(&price_sel)
.next()
.map(|e| e.text().collect::<String>())
.unwrap_or_default();
Ok(PageData { title, price })
}
Benchmark: Python vs Rust
Test: parse 10,000 product pages (HTML ~150KB each), extract title + price + description + 5 fields.
| Metric | Python (asyncio + BS4) | Rust (tokio + scraper) |
|---|---|---|
| Total time | 847 seconds | 71 seconds |
| Peak RAM | 780MB | 38MB |
| CPU usage | 95% (1 core) | 340% (3.4 cores) |
| Throughput | 11.8 pages/sec | 140 pages/sec |
| Parse time/page | 48ms avg | 4.2ms avg |
11.8x faster, 20x less RAM.
The Rust Web Scraping Ecosystem in 2026
HTTP: reqwest
The standard. Async, supports HTTP/2, connection pooling, cookie jars, proxy support. Mirrors Python requests API closely.
[dependencies]
reqwest = { version = "0.12", features = ["json", "cookies", "gzip"] }
tokio = { version = "1", features = ["full"] }
HTML Parsing: scraper
Built on html5ever (the HTML parser from Firefox). Implements CSS selectors. Fast.
For JavaScript Sites: chromiumoxide
Controls a real Chrome instance from Rust. Equivalent to Playwright. Slightly more verbose but faster.
use chromiumoxide::Browser;
let (browser, mut handler) = Browser::launch(BrowserConfig::builder().build()?).await?;
tokio::spawn(async move { while let Some(_) = handler.next().await {} });
let page = browser.new_page("https://example.com").await?;
page.wait_for_navigation().await?;
let content = page.content().await?;
Proxy Rotation
Reqwest supports per-request proxies via Proxy::all(). Build a proxy pool manager:
fn get_proxy(proxies: &[String]) -> reqwest::Proxy {
let proxy_url = &proxies[rand::random::<usize>() % proxies.len()];
reqwest::Proxy::all(proxy_url).unwrap()
}
Where Rust Wins and Where Python Still Wins
Rust wins:
- High-volume scraping (>10k pages/day)
- Memory-constrained environments (cheap VPS)
- CPU-bound parsing workloads
- Long-running background scrapers
Python still wins:
- Rapid prototyping (10x faster development)
- Dynamic site scraping (Playwright ecosystem is richer)
- One-off scripts and experiments
- When Playwright is needed (Rust's browser automation is less mature)
My Current Setup
I use both. Python for prototyping and dynamic-site scraping. Rust for the production data pipelines that run 24/7.
The pre-built scraper toolkit I use is Python-based (faster iteration, richer Playwright ecosystem):
Includes production-ready Python scrapers with async/httpx, Playwright stealth patches, proxy rotation, and error handling. For when you need working scrapers today, not a Rust rewrite.
Running scrapers in production? What language/stack are you using and what is your throughput?
Top comments (0)