DEV Community

agenthustler
agenthustler

Posted on

How to Build a Web Scraper in Rust: Performance and Safety

While Python dominates web scraping, Rust offers compelling advantages: zero-cost abstractions, memory safety, and raw performance. Here's how to build a production-grade scraper in Rust.

Why Rust for Web Scraping?

  • 10-100x faster than Python for CPU-bound parsing
  • Memory safe without garbage collection
  • Concurrent by design with async/await and Tokio
  • Single binary deployment — no dependency hell
  • Low memory footprint — crucial for large-scale scraping

Setting Up

cargo new web_scraper && cd web_scraper
Enter fullscreen mode Exit fullscreen mode

Add to Cargo.toml:

[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
csv = "1.3"
Enter fullscreen mode Exit fullscreen mode

Basic HTTP Scraper

use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
use std::error::Error;

#[derive(Debug, Serialize)]
struct Article {
    title: String,
    url: String,
    points: u32,
}

async fn scrape_hn(client: &Client) -> Result<Vec<Article>, Box<dyn Error>> {
    let response = client
        .get("https://news.ycombinator.com")
        .header("User-Agent", "RustScraper/1.0")
        .send().await?
        .text().await?;

    let document = Html::parse_document(&response);
    let title_selector = Selector::parse(".titleline > a").unwrap();
    let score_selector = Selector::parse(".score").unwrap();

    let mut articles = Vec::new();
    let titles: Vec<_> = document.select(&title_selector).collect();
    let scores: Vec<_> = document.select(&score_selector).collect();

    for (i, title_el) in titles.iter().enumerate() {
        let title = title_el.text().collect::<String>();
        let url = title_el.value().attr("href").unwrap_or("").to_string();
        let points = scores.get(i)
            .map(|s| s.text().collect::<String>()
                .replace(" points", "").trim()
                .parse::<u32>().unwrap_or(0))
            .unwrap_or(0);
        articles.push(Article { title, url, points });
    }
    Ok(articles)
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
    let client = Client::new();
    let articles = scrape_hn(&client).await?;
    for a in &articles {
        println!("{} ({} pts) - {}", a.title, a.points, a.url);
    }
    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

Concurrent Scraping with Tokio

use tokio::time::{sleep, Duration};
use futures::future::join_all;

async fn scrape_page(client: &Client, url: &str) -> Result<String, Box<dyn Error>> {
    sleep(Duration::from_millis(500)).await;
    let response = client.get(url).send().await?.text().await?;
    Ok(response)
}

async fn scrape_multiple(urls: Vec<String>) -> Vec<Result<String, Box<dyn Error>>> {
    let client = Client::new();
    let sem = std::sync::Arc::new(tokio::sync::Semaphore::new(5));

    let tasks: Vec<_> = urls.into_iter().map(|url| {
        let client = client.clone();
        let sem = sem.clone();
        tokio::spawn(async move {
            let _permit = sem.acquire().await.unwrap();
            scrape_page(&client, &url).await
        })
    }).collect();

    join_all(tasks).await.into_iter().map(|r| r.unwrap()).collect()
}
Enter fullscreen mode Exit fullscreen mode

Error Handling and Retries

async fn fetch_with_retry(
    client: &Client, url: &str, max_retries: u32
) -> Result<String, Box<dyn Error>> {
    let mut retries = 0;
    loop {
        match client.get(url).send().await {
            Ok(r) if r.status().is_success() => return Ok(r.text().await?),
            Ok(r) if r.status() == 429 => {
                let wait = Duration::from_secs(2u64.pow(retries));
                eprintln!("Rate limited, waiting {:?}", wait);
                sleep(wait).await;
            }
            Ok(r) => eprintln!("HTTP {}", r.status()),
            Err(e) => eprintln!("Failed: {}", e),
        }
        retries += 1;
        if retries >= max_retries {
            return Err("Max retries exceeded".into());
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

Rust vs Python: Performance

Metric Python Rust
Parse 10K pages ~45s ~2s
Memory usage ~500 MB ~30 MB
Concurrent requests GIL limited True parallelism
Binary size N/A ~5 MB

When to Use Rust vs Python

Use Rust when: parsing millions of pages, memory is constrained, you need a single binary, CPU-bound parsing is the bottleneck.

Use Python when: rapid prototyping, complex browser automation needed, team familiarity matters, scraping logic changes often.

Scaling Rust Scrapers

Even Rust scrapers need proxy rotation. ScraperAPI provides a simple HTTP API that works with reqwest. For residential proxies, ThorData handles IP rotation. Monitor with ScrapeOps.

Conclusion

Rust is excellent for high-performance web scraping. The learning curve is steeper than Python, but the payoff in performance and reliability is substantial. Start simple, add concurrency with Tokio, and handle millions of pages without breaking a sweat.

Top comments (0)