How to Build a Web Scraper in Rust: Performance and Safety

#python #tutorial #webdev #programming

While Python dominates web scraping, Rust offers compelling advantages: zero-cost abstractions, memory safety, and raw performance. Here's how to build a production-grade scraper in Rust.

Why Rust for Web Scraping?

10-100x faster than Python for CPU-bound parsing
Memory safe without garbage collection
Concurrent by design with async/await and Tokio
Single binary deployment — no dependency hell
Low memory footprint — crucial for large-scale scraping

Setting Up

cargo new web_scraper && cd web_scraper

Add to Cargo.toml:

[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
csv = "1.3"

Basic HTTP Scraper

use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
use std::error::Error;

#[derive(Debug, Serialize)]
struct Article {
    title: String,
    url: String,
    points: u32,
}

async fn scrape_hn(client: &Client) -> Result<Vec<Article>, Box<dyn Error>> {
    let response = client
        .get("https://news.ycombinator.com")
        .header("User-Agent", "RustScraper/1.0")
        .send().await?
        .text().await?;

    let document = Html::parse_document(&response);
    let title_selector = Selector::parse(".titleline > a").unwrap();
    let score_selector = Selector::parse(".score").unwrap();

    let mut articles = Vec::new();
    let titles: Vec<_> = document.select(&title_selector).collect();
    let scores: Vec<_> = document.select(&score_selector).collect();

    for (i, title_el) in titles.iter().enumerate() {
        let title = title_el.text().collect::<String>();
        let url = title_el.value().attr("href").unwrap_or("").to_string();
        let points = scores.get(i)
            .map(|s| s.text().collect::<String>()
                .replace(" points", "").trim()
                .parse::<u32>().unwrap_or(0))
            .unwrap_or(0);
        articles.push(Article { title, url, points });
    }
    Ok(articles)
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = Client::new();
    let articles = scrape_hn(&client).await?;
    for a in &articles {
        println!("{} ({} pts) - {}", a.title, a.points, a.url);
    }
    Ok(())
}

Concurrent Scraping with Tokio

use tokio::time::{sleep, Duration};
use futures::future::join_all;

async fn scrape_page(client: &Client, url: &str) -> Result<String, Box<dyn Error>> {
    sleep(Duration::from_millis(500)).await;
    let response = client.get(url).send().await?.text().await?;
    Ok(response)
}

async fn scrape_multiple(urls: Vec<String>) -> Vec<Result<String, Box<dyn Error>>> {
    let client = Client::new();
    let sem = std::sync::Arc::new(tokio::sync::Semaphore::new(5));

    let tasks: Vec<_> = urls.into_iter().map(|url| {
        let client = client.clone();
        let sem = sem.clone();
        tokio::spawn(async move {
            let _permit = sem.acquire().await.unwrap();
            scrape_page(&client, &url).await
        })
    }).collect();

    join_all(tasks).await.into_iter().map(|r| r.unwrap()).collect()
}

Error Handling and Retries

async fn fetch_with_retry(
    client: &Client, url: &str, max_retries: u32
) -> Result<String, Box<dyn Error>> {
    let mut retries = 0;
    loop {
        match client.get(url).send().await {
            Ok(r) if r.status().is_success() => return Ok(r.text().await?),
            Ok(r) if r.status() == 429 => {
                let wait = Duration::from_secs(2u64.pow(retries));
                eprintln!("Rate limited, waiting {:?}", wait);
                sleep(wait).await;
            }
            Ok(r) => eprintln!("HTTP {}", r.status()),
            Err(e) => eprintln!("Failed: {}", e),
        }
        retries += 1;
        if retries >= max_retries {
            return Err("Max retries exceeded".into());
        }
    }
}

Rust vs Python: Performance

Metric	Python	Rust
Parse 10K pages	~45s	~2s
Memory usage	~500 MB	~30 MB
Concurrent requests	GIL limited	True parallelism
Binary size	N/A	~5 MB

When to Use Rust vs Python

Use Rust when: parsing millions of pages, memory is constrained, you need a single binary, CPU-bound parsing is the bottleneck.

Use Python when: rapid prototyping, complex browser automation needed, team familiarity matters, scraping logic changes often.

Scaling Rust Scrapers

Even Rust scrapers need proxy rotation. ScraperAPI provides a simple HTTP API that works with reqwest. For residential proxies, ThorData handles IP rotation. Monitor with ScrapeOps.

Conclusion

Rust is excellent for high-performance web scraping. The learning curve is steeper than Python, but the payoff in performance and reliability is substantial. Start simple, add concurrency with Tokio, and handle millions of pages without breaking a sweat.

DEV Community