phantomDev

Posted on Mar 15

I Built a Web Crawler That Scraped Cloudflare Past Their Own Protection

#go #webdev #programming #opensource

Last week I pointed my crawler at cloudflare.com, set the depth to 1, and walked away.

When I came back this is what I saw:

✓ (layer1) https://cloudflare.com
✓ (layer1) https://cloudflare.com/en-gb
✓ (layer1) https://cloudflare.com/de-de
✓ (layer1) https://cloudflare.com/products
✓ (layer1) https://cloudflare.com/developer-platform
...

Total crawled : 100+
Failed        : 0

Here is a proof

A website protected by Cloudflare, scraped past Cloudflare. 100+ pages. Zero blocks. All Layer 1 meaning not even a headless browser was involved.

That's PhantomCrawl. And this is how it works.

The actual scraped data is in the repo if you want to see it.

Why Scrapers Keep Breaking

Most developers write a Python script with requests, run it against a real site, and get a Cloudflare challenge page or a 403 back. So they switch to Puppeteer or Playwright. That works for a while, then sites start detecting headless Chrome too.

The part nobody talks about is TLS fingerprinting.

When your Go or Python HTTP client connects to a server it sends a TLS handshake. That handshake has a unique fingerprint - specific cipher suites, extensions, and ordering that identify the library you're using. A Go net/http client looks nothing like Chrome at the TLS layer. Cloudflare checks this fingerprint before it even looks at your User-Agent or cookies.

Changing your User-Agent header does nothing if your TLS fingerprint is screaming "I am a Python script."

PhantomCrawl fixes this at the transport level using utls HelloChrome_120 - the exact same TLS fingerprint as a real Chrome 120 browser. To Cloudflare's infrastructure, every request looks like it came from a real person on Windows Chrome. Because at the cryptographic handshake level, it genuinely does.

The Problem With Existing Tools

Before I built PhantomCrawl I looked at everything that existed. Here's the honest picture.

Feature Comparison

Tool	Self-Hosted	Anti-Bot	TLS Fingerprint	AI Cleaning	Binary	Free
PhantomCrawl	✅	✅	✅ HelloChrome_120	✅	✅	✅
Firecrawl	❌ API only	✅	❌	✅	❌	500 pages/mo
Scrapy	✅	❌	❌	❌	❌	✅
Apify	❌ Cloud only	✅	❌	❌	❌	Limited
BeautifulSoup	✅	❌	❌	❌	❌	✅
Puppeteer	✅	⚠️ Detectable	❌	❌	❌	✅
ScrapingBee	❌ API only	✅	❌	❌	❌	1,000 req/mo

What It Actually Costs

This is where things get uncomfortable for the existing tools.

Tool	Free Tier	Paid Entry	100K pages/mo
PhantomCrawl	Unlimited	$0	$0
Firecrawl	500 pages	$16/mo	$83/mo
Apify	~$5 credit	$29/mo	~$123/mo
ScrapingBee	1,000 req	$49/mo	$249+/mo
Bright Data	100 records	$500+/mo	$1,000+/mo
Browserless	6hr/mo	$29/mo	Pay per hour

PhantomCrawl is $0 because it runs on your machine. You are not paying for someone else's servers. The only optional cost is Groq for AI cleaning - which gives you 100,000 tokens free per day, enough for hundreds of pages at zero cost.

Scale to 1 million pages a month. Still $0. You just need a machine and an internet connection.

How PhantomCrawl Works

PhantomCrawl uses a 4-layer escalation engine. Every URL starts at Layer 1 and only moves up if needed. Most sites never leave Layer 1.

Layer 1 - Direct HTTP + TLS Fingerprinting

The fastest method. A direct HTTP request using utls HelloChrome_120 to disguise the TLS handshake as real Chrome. Includes human-like headers (Sec-Fetch-*, Sec-Ch-Ua-*), jittered timing, and user agent rotation.

Covers roughly 90% of the web. SSR sites, static pages, Next.js, WordPress, and yes - Cloudflare-protected sites.

Layer 2 - Network Hijacking

If Layer 1 gets HTML but the content is not useful, Layer 2 inspects the raw HTML for embedded data. It scans for window.__NEXT_DATA__, window.__INITIAL_STATE__, window.__NUXT__, JSON-LD structured data, and API endpoint patterns.

Many modern SPAs ship their data pre-embedded in the HTML before JavaScript even runs. Layer 2 extracts it directly without a browser.

Layer 2.5 - XHR/Fetch Interception

This is what makes PhantomCrawl different from anything else out there. Instead of scraping the rendered HTML from a headless browser, Layer 2.5 intercepts the actual API responses the browser receives during page load - the raw JSON from XHR and fetch calls.

The result is clean structured data with zero boilerplate. No parsing noise. No nav menus or footers. Just the data the page was going to display anyway, captured before it becomes HTML.

Layer 3 - Full Headless Browser

Last resort. Launches a real browser - go-rod if Chrome is installed locally, or Browserless via API - and fully renders the page with JavaScript execution. Handles the most complex SPAs.

You never configure which layer to use. PhantomCrawl decides based on what each site actually returns.

Every Feature

4-layer escalation engine with automatic fallback
TLS fingerprinting via utls HelloChrome_120
AI content cleaning via Groq or OpenAI with chunked processing
Rate limit retry with automatic backoff and resume
Proxy rotation tunneled at TCP level through the TLS transport
Depth crawling with per-parent link limits
Fragment URL deduplication (/#about and / are the same page)
Asset filtering - PDFs, images, and zips skipped during depth crawling
SQLite state tracking with full resume if interrupted
.env key management with $VAR_NAME references
Config generator UI - no terminal needed to configure
Cross-platform binaries for Linux, Mac (Intel + Apple Silicon), Windows, ARM, and Termux
Structured JSON output - raw.json and cleaned.json per page
Absolute URL resolution on all extracted links
Single binary under 20MB, no runtime required

The Sleep and Scrape Workflow 😴

Here is something nobody talks about with web scraping at scale. It takes time. Sites throttle you, rate limits kick in, AI cleaning queues up. Trying to babysit this in real time is exhausting and pointless.

So don't. This is literally the workflow:

# 1. Put your URLs in urls.txt
# 2. Run it
phantomcrawl start
# 3. Go to sleep

Wake up to this:

Total crawled : 847
Failed        : 0
AI cleaned    : 847
Clean pending : 0
Output        : ~/phantomcrawl/scraped

PhantomCrawl batches requests with randomized delays so your timing is never predictable. It retries failures with exponential backoff. If the AI token quota resets overnight, it resumes exactly where it left off. Nothing gets crawled twice.

Put your URLs in. Go to sleep. Wake up to a folder full of clean JSON. ☕

Getting Started

Download the binary for your platform from GitHub Releases:

Platform	Binary
Linux 64-bit	`phantomcrawl-linux-amd64`
Linux ARM / Termux	`phantomcrawl-linux-arm64`
macOS Apple Silicon	`phantomcrawl-darwin-arm64`
macOS Intel	`phantomcrawl-darwin-amd64`
Windows	`phantomcrawl-windows-amd64.exe`

# Linux/Mac
chmod +x phantomcrawl-linux-amd64
sudo mv phantomcrawl-linux-amd64 /usr/local/bin/phantomcrawl

# Three commands to your first crawl
phantomcrawl init
echo "https://example.com" > urls.txt
phantomcrawl start

Want AI cleaning? Get a free Groq key at console.groq.com, add it to .env, and set "ai": { "enabled": true } in crawl.json. That's it.

Full docs at phantomcrawl.vercel.app

Why It's Under 20MB

PhantomCrawl is written in Go and compiled to a single static binary. No runtime, no package manager, no node_modules folder that somehow weighs 300MB. Everything is included - the crawler, the AI pipeline, the API server, the SQLite driver, and the config system.

The binary is about 14MB. A fresh Next.js project's node_modules is 50MB before you've written a line of code. Go was the right choice for this.

Cross-compiling from a phone running Termux on Android to Linux amd64, macOS arm64, and Windows amd64 is one command per platform. Try doing that with Python.

One More Thing

I'm Raphael, 18, from Lagos, Nigeria. I started coding at 12 on a phone with 1GB of RAM. No laptop, no bootcamp, no one teaching me. PhantomCrawl is my 7th shipped product.

I built it because I needed it and nothing else did the job without either costing money or breaking on any site worth scraping.

If it helps you, a star on the repo means a lot.

github.com/var-raphael/PhantomCrawl

DEV Community