Last year I was building an AI agent. Simple job: scrape web pages, convert to markdown, feed to an LLM. Classic RAG pipeline stuff.
I tried Firecrawl. Nice API, solid docs, everything looked great — until I put it in production.
4.6 seconds. For a single page. My agent was spending more time waiting than thinking.
Switched to Crawl4AI. Speed was okay-ish but I couldn't deploy the damn thing. Python, Playwright, Chromium, a mountain of dependencies. Docker image was 2GB. Running it on a simple VPS was an adventure in itself.
Looked at Spider.cloud. Fast, but closed-source and expensive. You don't own your infrastructure.
One night I thought "how hard can this be in Rust?"
8 months later, here we are.
What is CRW?
CRW is an open-source web scraping API written in Rust. It does everything Firecrawl does — scrape, crawl, map, LLM extraction — but as a single 8MB binary.
Here's where it stands right now:
| CRW | Firecrawl | Crawl4AI | |
|---|---|---|---|
| Avg latency | 833ms | 4,600ms | 3,200ms |
| Crawl coverage | 92% | 77.2% | ~80% |
| Memory usage | 6.6MB | 500MB+ | 300MB+ |
| Docker image | 8MB | 500MB+ | ~2GB |
(Scrapeway benchmark data, same 500-URL corpus)
I was honestly surprised by the first benchmark results myself. I knew Rust would be faster but didn't expect a 5.5x gap. The real surprise was coverage — the lol-html parser does seriously good work.
Okay but why "yet another scraper"?
Fair question. There are plenty of scrapers out there. But here's the thing:
Firecrawl's API is actually well-designed. /v1/scrape, /v1/crawl, /v1/map — clean, intuitive, useful. The problem isn't the API design, it's the engine underneath.
So I thought: let's take the same API and rewrite the engine from scratch in Rust. That way anyone using Firecrawl can switch by changing one line:
- const BASE_URL = "https://api.firecrawl.dev";
+ const BASE_URL = "https://api.fastcrw.com";
I mean it. That's literally it. Same endpoints, same request/response format.
How it works
Simplest possible usage:
docker run -p 3000:3000 ghcr.io/us/crw:latest
Done. You now have a web scraping API running on localhost:3000. Unlimited requests, zero cost.
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com", "formats": ["markdown"]}'
You get back clean markdown. No HTML tags, no ads, no navigation menus. Clean text you can feed directly to an LLM.
Using it with AI agents — my original motivation
The whole reason I built CRW was for AI agents. So MCP server comes built-in.
If you're using Claude Desktop or Cursor, just add this to your claude_desktop_config.json:
{
"mcpServers": {
"crw": {
"command": "crw",
"args": ["mcp"]
}
}
}
Now Claude can run "scrape this page" or "crawl this site" commands directly. Your agent can freely browse the web.
You can do this with Firecrawl too, but their MCP server is a separate package, separate setup, separate config. With CRW it's all inside the same binary.
LLM extraction — my favorite feature
You give it a JSON schema, CRW extracts structured data from the page:
curl -X POST http://localhost:3000/v1/scrape \
-H "Content-Type: application/json" \
-d '{
"url": "https://news.ycombinator.com",
"formats": ["extract"],
"extract": {
"schema": {
"type": "object",
"properties": {
"stories": {
"type": "array",
"items": {
"type": "object",
"properties": {
"title": {"type": "string"},
"points": {"type": "number"}
}
}
}
}
}
}
}'
You get titles and points from the Hacker News front page as clean JSON. No regex, no CSS selectors even. The LLM understands the page and extracts what you need.
I use this constantly in RAG pipelines. Pulling product data from e-commerce sites, grabbing article metadata from blogs — always this endpoint.
Self-host vs Cloud
Two options:
Self-host (free, forever): docker run and you're done. Runs on your server, your data stays with you, no request limits. AGPL-3.0 license.
It runs comfortably on a $5 DigitalOcean droplet because it uses 6.6MB RAM at idle. Try self-hosting Firecrawl — you'll need at least 1GB RAM minimum.
Cloud (fastcrw.com): For when you don't want to manage servers. 50 free credits, no credit card. Proxy network, auto-scaling, Chromium fleet all included. I actually use the cloud version for my own agents because I don't want to deal with proxy management.
Why Rust?
I get this question a lot. "You could've written it in Go, you could've written it in Node" — yeah, I could've.
But web scraping is CPU-bound work. Parsing HTML, traversing the DOM, converting to markdown — these are all byte-level operations. Rust's zero-cost abstractions make a real difference here.
Then there's memory. Firecrawl's Node.js + Playwright stack eats 500MB+ at idle. At 10 concurrent requests it blows past 2GB. CRW handles the same load under 50MB. That means 10x more throughput on the same server.
And finally: single binary. You run cargo build --release, you get an 8MB file. Alpine-based Docker image, minimal total size. Even CI/CD build times are fast.
Known gaps (let me be honest)
It's not perfect. Here's what I know is missing:
- No WebSocket streaming yet — you track crawl progress via polling. Coming soon.
- No screenshot/PDF capture — it's on the roadmap but I haven't implemented it yet.
- Docs are still growing — I add to them every week but they're not as comprehensive as Firecrawl's yet.
I'm telling you this because I don't trust projects that claim to do everything perfectly with zero issues. CRW is a fast and reliable scraper, but it's still evolving.
Try it out
Self-host in 30 seconds:
docker run -p 3000:3000 ghcr.io/us/crw:latest
Try the cloud(50 free credits):
fastcrw.com — 50 free credits
Source code:
github.com/us/crw
Full docs:
us.github.io/crw
If you find bugs or want features, open an issue. I look at every single one.
And honestly, if you star the repo I'd appreciate it. That's how open-source motivation works — you see a star come in and suddenly you're writing code at 2am again.
Top comments (0)