Your AI Agent Is Paying for HTML It Never Reads — I Measured the 7x Token Tax

#ai #python #webscraping #llm

I gave an agent a fetch_page tool, asked it to read one Wikipedia article, and watched that single page cost 48,703 tokens before the model produced a word. The readable text on that page is about 7,300 tokens. I was paying for ~41,000 tokens of <div>, inline CSS, and analytics scripts that never help the model answer anything.

That's the token tax on agent web access, and almost nobody measures it. Here's the number, the 40-line fix, and the honest part — where it doesn't matter.

The short version

When your agent "reads a page", it usually gets raw HTML pasted into the prompt. On three pages I tested, 85–86% of those tokens were markup the model doesn't need to read for meaning. Strip the page to text first and the token bill drops ~7×. The fix is the standard library plus a tokenizer — no API, no paid service.

The measurement

Counted with o200k_base (the tokenizer GPT-4o uses), three live pages of different sizes, raw HTML vs text-only. Measured 2026-06-09 — these are live pages, so your exact numbers will differ:

Page	Raw HTML	Text for the agent	Reduction
Wikipedia: Web scraping (165 KB)	48,703 tok	7,280 tok	6.7× (85% less)
Wikipedia: Large language model (686 KB)	221,622 tok	30,988 tok	7.2× (86% less)
example.com (528 B, control)	152 tok	22 tok	6.9× (86% less)

Three pages is a small sample, not a benchmark. But the ratio barely moved between a 528-byte page and a 686 KB one, which is the interesting part: the markup overhead is roughly proportional, so on the pages I tested the tax shows up everywhere, not just on the big ones.

At GPT-4o input pricing ($2.50 / 1M tokens, OpenAI, checked June 2026), the LLM page alone is $0.55 raw vs $0.078 clean per read. One read. An agent that crawls 200 pages in a loop turns that into real money — and worse, it fills the context window with noise that pushes out the tokens you actually want the model reasoning over.

The fix (stdlib + tiktoken)

import sys, ssl, urllib.request
from html.parser import HTMLParser
import tiktoken

UA = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) Chrome/148.0.0.0 Safari/537.36"
SKIP = {"script", "style", "head", "noscript", "svg", "template"}

class TextOnly(HTMLParser):
    def __init__(self):
        super().__init__(); self.parts = []; self.skipping = 0
    def handle_starttag(self, tag, attrs):
        if tag in SKIP: self.skipping += 1
    def handle_endtag(self, tag):
        if tag in SKIP and self.skipping: self.skipping -= 1
    def handle_data(self, data):
        if not self.skipping and data.strip(): self.parts.append(data.strip())

raw = urllib.request.urlopen(
    urllib.request.Request(sys.argv[1], headers={"User-Agent": UA}), timeout=30
).read().decode("utf-8", "replace")

p = TextOnly(); p.feed(raw); text = "\n".join(p.parts)
enc = tiktoken.encoding_for_model("gpt-4o")          # o200k_base
raw_tok, clean_tok = len(enc.encode(raw)), len(enc.encode(text))
assert clean_tok <= raw_tok and len(text) > 0        # prove the work was done
print(f"{raw_tok:,} -> {clean_tok:,} tokens  ({raw_tok/clean_tok:.1f}x less)")

This is a runnable local excerpt. pip install -U tiktoken (you need a recent version for o200k_base), then python clean.py https://en.wikipedia.org/wiki/Web_scraping. Output on my machine:

48,703 -> 7,280 tokens  (6.7x less)

The full script (with the TLS handling and sanity check below) is in the repo. It's the standard library doing the work HTMLParser was built for, plus tiktoken so you count in the model's units, not characters. No requests, no readability library, no service.

One honest detail from the sanity print: the extracted text starts with Jump to content Main menu Main menu ... Navigation. This is a sanitizer, not a main-content reader — it keeps nav and footer text (more on that below).

One gotcha that cost me ten minutes

I run behind a VPN, and the first fetch died with CERTIFICATE_VERIFY_FAILED. The VPN was intercepting TLS, so the system couldn't chain to a trusted root. urllib hides this: it wraps the ssl.SSLError inside a urllib.error.URLError, so a naive except ssl.SSLError never fires. You catch URLError and look at e.reason:

except urllib.error.URLError as e:
    if not isinstance(e.reason, ssl.SSLError):
        raise
    sys.exit("TLS failed. If you trust this proxy, re-run with --insecure.")

The script fails closed — it won't silently disable verification. Measuring a page over an untrusted MITM proxy is meaningless (you'd be tokenizing whatever the proxy injected), so turning off TLS is an explicit --insecure flag, not a quiet fallback.

When this is NOT worth it

I'd be lying if I said "always strip HTML." It isn't free:

You lose structure. Tables, link targets, alt/title, and <code> boundaries flatten into text. If the agent's job is "extract every row of this table" or "follow these links," text-only throws away the signal. Hand it Markdown for those.
html.parser is not a browser. JS-rendered pages return a near-empty shell — this strips what the server sent, not what a browser paints. SPA targets still need a headless browser first.
It's a sanitizer, not a reader. It keeps menu and footer text (that Jump to content / Main menu above). A readability pass cuts further, at the cost of a dependency and occasionally eating real content. For an agent, "too much text" is cheap; "silently dropped the answer" is expensive — so I over-keep.
The 7× is against raw HTML, the naive default — not against Markdown or a readability pass, which also cut tokens. If you already feed Markdown, your win is smaller.
Numbers depend on the live page, your User-Agent, and locale. Re-measure on your own targets.

So the honest rule: strip to text when the agent is reading for meaning (RAG ingestion, summarization, Q&A). Keep structure when it's extracting specific fields.

Why I bothered

I run production scrapers, and the lesson that transfers to agents is the same one that bit me on data pipelines: the cost isn't in the request, it's in what you carry forward. An agent that pastes raw HTML into every step pays the tax on every step, and the context bloat quietly degrades the reasoning you're paying for twice.

40 lines. One pip install. ~7× fewer tokens on the pages I tested.

How are you feeding pages to your agents right now — raw HTML, Markdown, or a readability pass? And has anyone measured the token difference on their own targets? Drop your numbers 👇