Last week I pointed my crawler at cloudflare.com, set the depth to 1, and walked away.
When I came back this is what I saw:
✓ (layer1) https://cloudflare.com
✓ (layer1) https://cloudflare.com/en-gb
✓ (layer1) https://cloudflare.com/de-de
✓ (layer1) https://cloudflare.com/products
✓ (layer1) https://cloudflare.com/developer-platform
...
Total crawled : 100+
Failed : 0
Here is a proof
A website protected by Cloudflare, scraped past Cloudflare. 100+ pages. Zero blocks. All Layer 1 meaning not even a headless browser was involved.
That's PhantomCrawl. And this is how it works.
The actual scraped data is in the repo if you want to see it.
Why Scrapers Keep Breaking
Most developers write a Python script with requests, run it against a real site, and get a Cloudflare challenge page or a 403 back. So they switch to Puppeteer or Playwright. That works for a while, then sites start detecting headless Chrome too.
The part nobody talks about is TLS fingerprinting.
When your Go or Python HTTP client connects to a server it sends a TLS handshake. That handshake has a unique fingerprint - specific cipher suites, extensions, and ordering that identify the library you're using. A Go net/http client looks nothing like Chrome at the TLS layer. Cloudflare checks this fingerprint before it even looks at your User-Agent or cookies.
Changing your User-Agent header does nothing if your TLS fingerprint is screaming "I am a Python script."
PhantomCrawl fixes this at the transport level using utls HelloChrome_120 - the exact same TLS fingerprint as a real Chrome 120 browser. To Cloudflare's infrastructure, every request looks like it came from a real person on Windows Chrome. Because at the cryptographic handshake level, it genuinely does.
The Problem With Existing Tools
Before I built PhantomCrawl I looked at everything that existed. Here's the honest picture.
Feature Comparison
| Tool | Self-Hosted | Anti-Bot | TLS Fingerprint | AI Cleaning | Binary | Free |
|---|---|---|---|---|---|---|
| PhantomCrawl | ✅ | ✅ | ✅ HelloChrome_120 | ✅ | ✅ | ✅ |
| Firecrawl | ❌ API only | ✅ | ❌ | ✅ | ❌ | 500 pages/mo |
| Scrapy | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Apify | ❌ Cloud only | ✅ | ❌ | ❌ | ❌ | Limited |
| BeautifulSoup | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
| Puppeteer | ✅ | ⚠️ Detectable | ❌ | ❌ | ❌ | ✅ |
| ScrapingBee | ❌ API only | ✅ | ❌ | ❌ | ❌ | 1,000 req/mo |
What It Actually Costs
This is where things get uncomfortable for the existing tools.
| Tool | Free Tier | Paid Entry | 100K pages/mo |
|---|---|---|---|
| PhantomCrawl | Unlimited | $0 | $0 |
| Firecrawl | 500 pages | $16/mo | $83/mo |
| Apify | ~$5 credit | $29/mo | ~$123/mo |
| ScrapingBee | 1,000 req | $49/mo | $249+/mo |
| Bright Data | 100 records | $500+/mo | $1,000+/mo |
| Browserless | 6hr/mo | $29/mo | Pay per hour |
PhantomCrawl is $0 because it runs on your machine. You are not paying for someone else's servers. The only optional cost is Groq for AI cleaning - which gives you 100,000 tokens free per day, enough for hundreds of pages at zero cost.
Scale to 1 million pages a month. Still $0. You just need a machine and an internet connection.
How PhantomCrawl Works
PhantomCrawl uses a 4-layer escalation engine. Every URL starts at Layer 1 and only moves up if needed. Most sites never leave Layer 1.
Layer 1 - Direct HTTP + TLS Fingerprinting
The fastest method. A direct HTTP request using utls HelloChrome_120 to disguise the TLS handshake as real Chrome. Includes human-like headers (Sec-Fetch-*, Sec-Ch-Ua-*), jittered timing, and user agent rotation.
Covers roughly 90% of the web. SSR sites, static pages, Next.js, WordPress, and yes - Cloudflare-protected sites.
Layer 2 - Network Hijacking
If Layer 1 gets HTML but the content is not useful, Layer 2 inspects the raw HTML for embedded data. It scans for window.__NEXT_DATA__, window.__INITIAL_STATE__, window.__NUXT__, JSON-LD structured data, and API endpoint patterns.
Many modern SPAs ship their data pre-embedded in the HTML before JavaScript even runs. Layer 2 extracts it directly without a browser.
Layer 2.5 - XHR/Fetch Interception
This is what makes PhantomCrawl different from anything else out there. Instead of scraping the rendered HTML from a headless browser, Layer 2.5 intercepts the actual API responses the browser receives during page load - the raw JSON from XHR and fetch calls.
The result is clean structured data with zero boilerplate. No parsing noise. No nav menus or footers. Just the data the page was going to display anyway, captured before it becomes HTML.
Layer 3 - Full Headless Browser
Last resort. Launches a real browser - go-rod if Chrome is installed locally, or Browserless via API - and fully renders the page with JavaScript execution. Handles the most complex SPAs.
You never configure which layer to use. PhantomCrawl decides based on what each site actually returns.
Every Feature
- 4-layer escalation engine with automatic fallback
- TLS fingerprinting via utls HelloChrome_120
- AI content cleaning via Groq or OpenAI with chunked processing
- Rate limit retry with automatic backoff and resume
- Proxy rotation tunneled at TCP level through the TLS transport
- Depth crawling with per-parent link limits
- Fragment URL deduplication (
/#aboutand/are the same page) - Asset filtering - PDFs, images, and zips skipped during depth crawling
- SQLite state tracking with full resume if interrupted
-
.envkey management with$VAR_NAMEreferences - Config generator UI - no terminal needed to configure
- Cross-platform binaries for Linux, Mac (Intel + Apple Silicon), Windows, ARM, and Termux
- Structured JSON output -
raw.jsonandcleaned.jsonper page - Absolute URL resolution on all extracted links
- Single binary under 20MB, no runtime required
The Sleep and Scrape Workflow 😴
Here is something nobody talks about with web scraping at scale. It takes time. Sites throttle you, rate limits kick in, AI cleaning queues up. Trying to babysit this in real time is exhausting and pointless.
So don't. This is literally the workflow:
# 1. Put your URLs in urls.txt
# 2. Run it
phantomcrawl start
# 3. Go to sleep
Wake up to this:
Total crawled : 847
Failed : 0
AI cleaned : 847
Clean pending : 0
Output : ~/phantomcrawl/scraped
PhantomCrawl batches requests with randomized delays so your timing is never predictable. It retries failures with exponential backoff. If the AI token quota resets overnight, it resumes exactly where it left off. Nothing gets crawled twice.
Put your URLs in. Go to sleep. Wake up to a folder full of clean JSON. ☕
Getting Started
Download the binary for your platform from GitHub Releases:
| Platform | Binary |
|---|---|
| Linux 64-bit | phantomcrawl-linux-amd64 |
| Linux ARM / Termux | phantomcrawl-linux-arm64 |
| macOS Apple Silicon | phantomcrawl-darwin-arm64 |
| macOS Intel | phantomcrawl-darwin-amd64 |
| Windows | phantomcrawl-windows-amd64.exe |
# Linux/Mac
chmod +x phantomcrawl-linux-amd64
sudo mv phantomcrawl-linux-amd64 /usr/local/bin/phantomcrawl
# Three commands to your first crawl
phantomcrawl init
echo "https://example.com" > urls.txt
phantomcrawl start
Want AI cleaning? Get a free Groq key at console.groq.com, add it to .env, and set "ai": { "enabled": true } in crawl.json. That's it.
Full docs at phantomcrawl.vercel.app
Why It's Under 20MB
PhantomCrawl is written in Go and compiled to a single static binary. No runtime, no package manager, no node_modules folder that somehow weighs 300MB. Everything is included - the crawler, the AI pipeline, the API server, the SQLite driver, and the config system.
The binary is about 14MB. A fresh Next.js project's node_modules is 50MB before you've written a line of code. Go was the right choice for this.
Cross-compiling from a phone running Termux on Android to Linux amd64, macOS arm64, and Windows amd64 is one command per platform. Try doing that with Python.
One More Thing
I'm Raphael, 18, from Lagos, Nigeria. I started coding at 12 on a phone with 1GB of RAM. No laptop, no bootcamp, no one teaching me. PhantomCrawl is my 7th shipped product.
I built it because I needed it and nothing else did the job without either costing money or breaking on any site worth scraping.
If it helps you, a star on the repo means a lot.

Top comments (0)