While Python dominates web scraping, Rust offers compelling advantages: zero-cost abstractions, memory safety, and raw performance. Here's how to build a production-grade scraper in Rust.
Why Rust for Web Scraping?
- 10-100x faster than Python for CPU-bound parsing
- Memory safe without garbage collection
- Concurrent by design with async/await and Tokio
- Single binary deployment — no dependency hell
- Low memory footprint — crucial for large-scale scraping
Setting Up
cargo new web_scraper && cd web_scraper
Add to Cargo.toml:
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
tokio = { version = "1", features = ["full"] }
scraper = "0.20"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
csv = "1.3"
Basic HTTP Scraper
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
use std::error::Error;
#[derive(Debug, Serialize)]
struct Article {
title: String,
url: String,
points: u32,
}
async fn scrape_hn(client: &Client) -> Result<Vec<Article>, Box<dyn Error>> {
let response = client
.get("https://news.ycombinator.com")
.header("User-Agent", "RustScraper/1.0")
.send().await?
.text().await?;
let document = Html::parse_document(&response);
let title_selector = Selector::parse(".titleline > a").unwrap();
let score_selector = Selector::parse(".score").unwrap();
let mut articles = Vec::new();
let titles: Vec<_> = document.select(&title_selector).collect();
let scores: Vec<_> = document.select(&score_selector).collect();
for (i, title_el) in titles.iter().enumerate() {
let title = title_el.text().collect::<String>();
let url = title_el.value().attr("href").unwrap_or("").to_string();
let points = scores.get(i)
.map(|s| s.text().collect::<String>()
.replace(" points", "").trim()
.parse::<u32>().unwrap_or(0))
.unwrap_or(0);
articles.push(Article { title, url, points });
}
Ok(articles)
}
#[tokio::main]
async fn main() -> Result<(), Box<dyn Error>> {
let client = Client::new();
let articles = scrape_hn(&client).await?;
for a in &articles {
println!("{} ({} pts) - {}", a.title, a.points, a.url);
}
Ok(())
}
Concurrent Scraping with Tokio
use tokio::time::{sleep, Duration};
use futures::future::join_all;
async fn scrape_page(client: &Client, url: &str) -> Result<String, Box<dyn Error>> {
sleep(Duration::from_millis(500)).await;
let response = client.get(url).send().await?.text().await?;
Ok(response)
}
async fn scrape_multiple(urls: Vec<String>) -> Vec<Result<String, Box<dyn Error>>> {
let client = Client::new();
let sem = std::sync::Arc::new(tokio::sync::Semaphore::new(5));
let tasks: Vec<_> = urls.into_iter().map(|url| {
let client = client.clone();
let sem = sem.clone();
tokio::spawn(async move {
let _permit = sem.acquire().await.unwrap();
scrape_page(&client, &url).await
})
}).collect();
join_all(tasks).await.into_iter().map(|r| r.unwrap()).collect()
}
Error Handling and Retries
async fn fetch_with_retry(
client: &Client, url: &str, max_retries: u32
) -> Result<String, Box<dyn Error>> {
let mut retries = 0;
loop {
match client.get(url).send().await {
Ok(r) if r.status().is_success() => return Ok(r.text().await?),
Ok(r) if r.status() == 429 => {
let wait = Duration::from_secs(2u64.pow(retries));
eprintln!("Rate limited, waiting {:?}", wait);
sleep(wait).await;
}
Ok(r) => eprintln!("HTTP {}", r.status()),
Err(e) => eprintln!("Failed: {}", e),
}
retries += 1;
if retries >= max_retries {
return Err("Max retries exceeded".into());
}
}
}
Rust vs Python: Performance
| Metric | Python | Rust |
|---|---|---|
| Parse 10K pages | ~45s | ~2s |
| Memory usage | ~500 MB | ~30 MB |
| Concurrent requests | GIL limited | True parallelism |
| Binary size | N/A | ~5 MB |
When to Use Rust vs Python
Use Rust when: parsing millions of pages, memory is constrained, you need a single binary, CPU-bound parsing is the bottleneck.
Use Python when: rapid prototyping, complex browser automation needed, team familiarity matters, scraping logic changes often.
Scaling Rust Scrapers
Even Rust scrapers need proxy rotation. ScraperAPI provides a simple HTTP API that works with reqwest. For residential proxies, ThorData handles IP rotation. Monitor with ScrapeOps.
Conclusion
Rust is excellent for high-performance web scraping. The learning curve is steeper than Python, but the payoff in performance and reliability is substantial. Start simple, add concurrency with Tokio, and handle millions of pages without breaking a sweat.
Top comments (0)