Vhub Systems

Posted on Apr 3

Why I Rewrote My Web Scraper in Rust (10x Faster, 20x Less Memory)

#rust #tutorial #webdev #antibot

Six months ago my Python scraper was consuming 800MB RAM to process 50k pages/day and timing out on large jobs. I rewrote the core in Rust. Here is what changed and whether it was worth it.

The Problem With Python Scrapers at Scale

Python web scraping works great until it does not:

Memory: Python objects have 5-10x overhead vs raw data size. Parsing 1MB of HTML creates ~8MB of Python objects.
GIL: The Global Interpreter Lock means CPU-bound parsing cannot use multiple cores effectively.
Speed: Python HTML parsing (BeautifulSoup) is slow — typically 10-50ms per page depending on complexity.
Concurrency: asyncio helps with I/O-bound waiting but does not help CPU-bound parsing.

For 50k pages/day on a cheap VPS, I was hitting OOM kills every few hours.

What Rust Gives You

use scraper::{Html, Selector};
use reqwest::Client;
use tokio::sync::Semaphore;
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::builder()
        .user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36")
        .build()?;

    // Semaphore limits concurrent requests (rate limiting)
    let semaphore = Arc::new(Semaphore::new(10));

    let urls = vec![
        "https://example.com/page1",
        "https://example.com/page2",
        // ... thousands more
    ];

    let mut handles = vec![];
    for url in urls {
        let client = client.clone();
        let sem = semaphore.clone();

        handles.push(tokio::spawn(async move {
            let _permit = sem.acquire().await.unwrap();
            let resp = client.get(url).send().await?;
            let html = resp.text().await?;
            parse_page(&html)
        }));
    }

    for handle in handles {
        if let Ok(Ok(data)) = handle.await {
            // process data
        }
    }
    Ok(())
}

fn parse_page(html: &str) -> Result<PageData, Box<dyn std::error::Error>> {
    let document = Html::parse_document(html);
    let title_sel = Selector::parse("h1.product-title").unwrap();
    let price_sel = Selector::parse("span.price").unwrap();

    let title = document.select(&title_sel)
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();

    let price = document.select(&price_sel)
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();

    Ok(PageData { title, price })
}

Benchmark: Python vs Rust

Test: parse 10,000 product pages (HTML ~150KB each), extract title + price + description + 5 fields.

Metric	Python (asyncio + BS4)	Rust (tokio + scraper)
Total time	847 seconds	71 seconds
Peak RAM	780MB	38MB
CPU usage	95% (1 core)	340% (3.4 cores)
Throughput	11.8 pages/sec	140 pages/sec
Parse time/page	48ms avg	4.2ms avg

11.8x faster, 20x less RAM.

The Rust Web Scraping Ecosystem in 2026

HTTP: `reqwest`

The standard. Async, supports HTTP/2, connection pooling, cookie jars, proxy support. Mirrors Python requests API closely.

[dependencies]
reqwest = { version = "0.12", features = ["json", "cookies", "gzip"] }
tokio = { version = "1", features = ["full"] }

HTML Parsing: `scraper`

Built on html5ever (the HTML parser from Firefox). Implements CSS selectors. Fast.

For JavaScript Sites: `chromiumoxide`

Controls a real Chrome instance from Rust. Equivalent to Playwright. Slightly more verbose but faster.

use chromiumoxide::Browser;

let (browser, mut handler) = Browser::launch(BrowserConfig::builder().build()?).await?;
tokio::spawn(async move { while let Some(_) = handler.next().await {} });

let page = browser.new_page("https://example.com").await?;
page.wait_for_navigation().await?;
let content = page.content().await?;

Proxy Rotation

Reqwest supports per-request proxies via Proxy::all(). Build a proxy pool manager:

fn get_proxy(proxies: &[String]) -> reqwest::Proxy {
    let proxy_url = &proxies[rand::random::<usize>() % proxies.len()];
    reqwest::Proxy::all(proxy_url).unwrap()
}

Where Rust Wins and Where Python Still Wins

Rust wins:

High-volume scraping (>10k pages/day)
Memory-constrained environments (cheap VPS)
CPU-bound parsing workloads
Long-running background scrapers

Python still wins:

Rapid prototyping (10x faster development)
Dynamic site scraping (Playwright ecosystem is richer)
One-off scripts and experiments
When Playwright is needed (Rust's browser automation is less mature)

My Current Setup

I use both. Python for prototyping and dynamic-site scraping. Rust for the production data pipelines that run 24/7.

The pre-built scraper toolkit I use is Python-based (faster iteration, richer Playwright ecosystem):

Scraper Toolkit — €29

Includes production-ready Python scrapers with async/httpx, Playwright stealth patches, proxy rotation, and error handling. For when you need working scrapers today, not a Rust rewrite.

Running scrapers in production? What language/stack are you using and what is your throughput?

DEV Community

Why I Rewrote My Web Scraper in Rust (10x Faster, 20x Less Memory)

The Problem With Python Scrapers at Scale

What Rust Gives You

Benchmark: Python vs Rust

The Rust Web Scraping Ecosystem in 2026

HTTP: `reqwest`

HTML Parsing: `scraper`

For JavaScript Sites: `chromiumoxide`

Proxy Rotation

Where Rust Wins and Where Python Still Wins

My Current Setup

Top comments (0)

The Problem With Python Scrapers at Scale

What Rust Gives You

Benchmark: Python vs Rust

The Rust Web Scraping Ecosystem in 2026

HTTP: reqwest

HTML Parsing: scraper

For JavaScript Sites: chromiumoxide

Proxy Rotation

Where Rust Wins and Where Python Still Wins

My Current Setup

HTTP: `reqwest`

HTML Parsing: `scraper`

For JavaScript Sites: `chromiumoxide`