I Tested 15 LLMs for Web Scraping and Built Heuristics Instead

#webdev #ai #scraping #javascript

The problem nobody talks about: 600KB of DOM

When I started building a web scraper, the obvious move was to send the page to an LLM and ask it to extract the data. Simple, right?

Wrong. A typical product listing page is 500–700KB of raw DOM. Sending that to any model means you're paying for ~150,000 tokens per page, waiting 15–30 seconds per request, and hitting context limits on anything complex.

I hit this wall on page one.

Four months, 15 models, same result

I tested everything: GPT-4, GPT-4o, Gemini 1.5 Pro, Gemini Ultra, Claude 3 Opus, Claude 3.5 Sonnet, Mistral Large, Llama 3 70B, Cohere Command R+, and a handful of smaller fine-tuned models.

The results were consistent:

GPT-4 / Gemini Ultra: accurate, but 25–35 seconds per page
Claude 3.5 Sonnet: best accuracy-to-latency ratio, still 5–10 seconds
Smaller models: faster, but hallucinated field names constantly

No model solved the latency problem because I was asking them to solve the wrong problem.

The pre-processor breakthrough

The real issue wasn't the model — it was the input size.

I built a DOM pre-processor:

Strip all <script>, <style>, and tracking pixels
Remove navigation, footer, sidebar elements
Collapse deeply nested wrappers that carry no semantic content
Apply SimHash to deduplicate structurally identical subtrees

Result: 580KB → 4.2KB. A 99.3% reduction.

With a 4KB input, every model became fast. But something more interesting happened: at that size, the repeating patterns became obvious. The same structure repeated 20, 50, 100 times — product cards, directory rows, search results.

The architecture decision

If the pattern is already obvious from the structure alone, why am I paying a model to find it?

I wrote a heuristic detector:

Identify elements with 3+ structurally identical siblings
Score candidate lists by depth, child count uniformity, and text density
Return ranked list candidates in 0ms

Then AI enters after detection — not to identify the list, but to label the fields and structure the output. That's a 200-token job, not a 150,000-token job.

Step	Approach	Latency
List detection	Heuristics	0.2ms
Field labeling	LLM (small input)	~2s
Total	—	~2s

vs. naive LLM approach: 25–35 seconds.

What I actually shipped

This architecture became Clura — a heuristic-first AI web scraper Chrome extension.

Open any page, Clura automatically detects every list using the heuristic engine. You pick the list, pick the fields you want, and it extracts all records in seconds. No "describe what you want" prompts. No robot training. No 30-second waits. The heuristic layer handles detection; AI handles labeling.

The lesson

LLMs are exceptional at understanding what something means. They're terrible at scanning 600KB of HTML to find where something is. That's a structural pattern problem — and structural pattern problems are what algorithms are for.

The best AI product architecture I've found isn't "use the best model." It's "use heuristics to reduce the problem until the model only sees what it's actually good at."

If you're building anything with LLMs on messy real-world inputs, the DOM pre-processing step alone is worth stealing. It will make every model you use faster, cheaper, and more accurate — regardless of the underlying task.

If you want to see this in action, try Clura free — it runs entirely in your browser with no server round-trips for detection.