Vishal VeeraReddy

Posted on May 17

HTML Is the New Markdown

#ai #html #productivity #webdev

A response to Thariq Shihipar's "HTML is the new markdown" post — and a practical answer for anyone watching their per-request costs creep up.

The tweet that started it

On May 8, Thariq Shihipar — a member of the Claude Code team at Anthropic — posted what is now one of the most-discussed dev takes of the quarter:

"HTML is the new markdown. I've stopped writing markdown files for almost everything and switched to using Claude Code to generate HTML for me."

The case he made was genuinely compelling. HTML, he argued, lets Claude do things markdown simply can't:

Inline SVG diagrams instead of fenced ASCII art
Interactive widgets — sliders, toggles, collapsible sections — so a PR review or architectural doc becomes a navigable artifact instead of a wall of text
In-page navigation, color-coded severity, real layout
Semantic structure the model can hang richer reasoning on

Thariq's example prompt — "Help me review this PR by creating an HTML artifact… color-code findings by severity" — captures the appeal in one line. The output looks like a small internal tool, not a chat log. Once you see it, going back to plain markdown feels like trading a dashboard for a printout.

Simon Willison agreed enough to write it up the same week. Then the dissents arrived.

The counter-argument: this is expensive

The most-shared rebuttal came from Kurtis Redux's "The Unreasonable Ineffectiveness of HTML". Three points landed:

HTML is 2–4× slower to generate than the markdown equivalent — and roughly 2–4× more output tokens to render the same information.
Verbose markup dilutes model attention. The longer the output, the easier it is for the model to lose the plot, repeat itself, or hallucinate structure halfway down.
Output tokens are the expensive tokens. Across every major API price card, output is 3–5× the cost of input. Recommending 2–4× more output is a real line item.

Then came the uncomfortable subtext from a few corners of the timeline: Anthropic sells by token. Of course a Claude Code engineer wants you generating more of them.

I don't think that's fair to Thariq — he's been shipping thoughtful, often unflattering-to-Anthropic posts about prompt caching and token efficiency for a year. But the structural concern stands: the people best positioned to tell us how to use these tools also have the most to gain when we use them more.

The real question

The argument has been framed as a binary: rich-but-expensive HTML, or cheap-but-flat markdown. Pick a side.

That framing is wrong. The question that actually matters is:

Can I get the HTML output Thariq is excited about — without paying the HTML token bill?

The answer is yes, and most of it has nothing to do with which model you're using. It has to do with everything that happens around the model call. That's the slice I've been working on with Lynkr.

What's actually inflating your token count

Before talking about HTML output, it's worth being honest about where tokens actually go in a Claude Code-style session. In a typical agentic loop, the output HTML is the smallest chunk. The big spenders are:

The system prompt and tool definitions — re-sent every single turn. Claude Code ships with 50+ tools, each with verbose JSON schemas. That's tens of thousands of input tokens per turn before the user has typed anything.
Conversation history — every prior assistant message, every tool call, every tool result, replayed each step.
Tool results — cargo build, pytest -v, eslint ., git diff. A failing test run can dump 40k tokens of stack trace.
MCP tool catalogs — wire up half a dozen MCP servers and your tool schema can balloon past 100k tokens.

The HTML the model writes is real cost, but it's measured in hundreds-to-low-thousands of output tokens. The HTML-vs-markdown debate is fighting over the tip while the iceberg sits unchallenged.

If you want HTML outputs without a wince at the invoice, the lever isn't "stop generating HTML." The lever is "stop sending the model a quarter-million tokens of repeated context to get to that HTML."

How Lynkr attacks the iceberg

Lynkr is a self-hosted proxy that sits between your coding tool (Claude Code, Cursor, Codex CLI, Cline, Continue) and any of 13+ LLM providers. It speaks Anthropic's /v1/messages and OpenAI's /v1/chat/completions, so the client doesn't know it's there. What it does in the middle is the interesting part. A few of the layers most relevant to the HTML-output discussion:

0. Preflight short-circuit (zero tokens)

The most extreme expression of the thesis. Before any model call happens, Lynkr can run a user-supplied shell precondition — pytest path/to/test.py, tsc --noEmit, lint --quiet, whatever proves the work is already done. If every command exits 0, Lynkr returns a synthetic "preflight satisfied" response without calling the LLM at all.

The use case sounds niche until you run a real agent loop: a CI-triggered "fix the failing test" request that arrives 90 seconds after the test was already fixed on another branch. An idempotent agent retry. A scheduled refactor job whose work was finished by a previous run. Today, every one of those wastes a full agentic loop — sometimes hundreds of thousands of tokens — to rediscover that the answer is "nothing to do."

The most expensive HTML output is the one you generate to explain a problem that no longer exists. Preflight is the layer that just doesn't.

Opt-in via preflight_commands in the request payload, gated on LYNKR_PREFLIGHT_ENABLED=true. Inspired by the CodexSaver routing patterns.

1. Smart tool selection (~50–70% token reduction on tool defs)

Claude Code sends every registered tool definition on every turn. Lynkr classifies the incoming request — is this a code edit, a search, a refactor, a "generate an HTML artifact" task? — and ships only the tools that classification actually needs. The model doesn't see the 40 it won't use.

For an "HTML artifact" task specifically, you don't need Bash, Glob, Git, the test runners, or MCP tools. You need read, write, and maybe web search. The savings compound across every turn of the loop.

2. Tool result compression

When the model calls pytest and gets back 40k tokens of output, Lynkr's tool-result compressor detects the pattern (test runner, linter, build, git diff) and compresses it before the model sees it on the next turn — usually keeping the failing assertions and dropping the redundant traceback frames. A tee cache holds the full version if anything downstream needs it.

In an agentic HTML-generation flow ("read the codebase, then produce an HTML report"), tool results are typically the largest single token category. Compressing them shrinks every subsequent turn.

3. History compression (Distill-style structural dedup)

Long sessions accumulate near-duplicate state: the model re-reads the same file four times, re-runs a slightly-different version of the same query, repeats its own reasoning. Lynkr's history compressor deduplicates older turns structurally while preserving the most recent N verbatim. The model still has the context; it doesn't pay for it twice.

4. Code Mode for MCP (~96% reduction)

If you've wired up MCP servers, this one's almost embarrassing. A typical MCP catalog — Linear, GitHub, Notion, Slack, a few internal ones — is 100+ tool definitions and tens of thousands of tokens every turn. Lynkr's Code Mode replaces all of them with four meta-tools (listResources, readResource, callTool, searchTools) and lets the model discover the rest at runtime. Same capability surface, ~96% fewer tokens spent on the catalog.

5. Risk-aware smart routing across model tiers

This is the one most directly relevant to the HTML debate. Generating a polished HTML artifact really does want a strong model. But the fifteen prior turns that gathered the data, ran searches, and figured out what to put in that artifact often don't.

Lynkr routes on two orthogonal axes: a 15-dimension complexity analysis (how hard is the task) and an independent risk score (how dangerous is it if the model gets this wrong). A request that touches auth/*, payments/*, migrations/*, .env, or .github/workflows, or that names high-risk concepts like authentication or deployment, is forced to the COMPLEX tier regardless of complexity score. Cheap tier for the cheap turns; frontier tier whenever the blast radius warrants it.

The four configurable tiers — TIER_SIMPLE, TIER_MEDIUM, TIER_COMPLEX, TIER_REASONING — let you map each to whatever provider:model pair makes sense. A greeting goes to a local Ollama model and costs zero. A file-read summarization goes to Haiku. The final "now write the HTML artifact" call goes to Opus or Sonnet, where it belongs.

Pay frontier prices for frontier work and high-risk work. Pay nothing for the rest.

Optionally, LYNKR_VISIBLE_ROUTING=true attaches a human-readable interaction block to each response — tier, provider, model, risk level, estimated savings — so the routing decision shows up inside Claude Code instead of buried in headers no one reads.

6. Semantic and prompt caching

Identical requests hit the prompt cache (SHA-256 keyed LRU, 5min TTL). Near-identical requests — same intent, different phrasing — hit the semantic cache (embedding cosine similarity at 0.95). For repeated "generate an HTML report of X" prompts within a working session, the second call can be effectively free.

7. Headroom sidecar (optional ML compression)

For users who really want to push it, there's an optional Python sidecar running LLMLingua-style transforms (Smart Crusher, Continuous Context Reduction, Tool Crusher, Cache Aligner, Rolling Window). Pure context-side compression — the model output doesn't change, the input it operates on gets smaller.

8. Token budget enforcement

Each model's context window is tracked. At 85% utilization, Lynkr automatically applies adaptive compression rather than letting the request blow past the window mid-generation. You don't get the 400-error-then-retry burn that costs you the input tokens twice.

What this looks like in practice

A concrete scenario: you ask Claude Code to "review this PR and produce an HTML report grouped by severity" — exactly Thariq's example.

Without a proxy layer, that's 8–12 model turns. Each turn re-sends the full system prompt, 50+ tool definitions, the growing conversation history, and accumulated tool results. The final turn produces ~2,000 tokens of HTML. Total input cost across the run: somewhere between 400k and 1.2M tokens depending on repo size. The HTML output is maybe 0.5% of the bill.

With Lynkr in the middle:

If the PR's findings are already merged or the failing check has self-resolved, preflight returns "satisfied" without a single model token spent.
Smart tool selection trims the per-turn tool-def overhead by ~60%.
Tool result compression keeps git diff and test output from dominating subsequent turns.
History compression dedupes the redundant intermediate state.
Risk-aware routing sends the search/read turns to a cheap tier; the final HTML generation — and anything touching protected paths — goes to your strongest model.
The semantic cache catches the second time you ask for the same review.

You keep the rich HTML output. You stop paying for the parts of the run that don't earn their tokens.

The takeaway

Thariq is right. HTML output really is better for the kinds of artifacts coding agents produce — and "we should use markdown because tokens are expensive" is an argument from a 32k-context-window world we no longer live in.

The critics are also right. HTML is more expensive than markdown to generate, the cost matters, and "use the richer format" without a story for cost control is a recommendation that benefits the seller more than the buyer.

What's missing from both sides of the timeline is the recognition that the output format is not the cost lever. The cost lever is everything upstream of the final response: which tools you send, how you compress prior turns, which tier you route to, what you cache, whether the work even needs to run at all.

Generate the HTML. Then put a smarter proxy in front of the model so you can afford to do it again tomorrow — and so that sometimes, when the answer is already on disk, you don't have to.

Lynkr is open source (Apache 2.0) and self-hosted: https://github.com/Fast-Editor/Lynkr. Install with npm i -g lynkr, point your coding tool's base URL at localhost:8080, and you're routing.

If you've measured your own HTML-vs-markdown token deltas in real workflows, I'd love to compare numbers — reply with your results.

DEV Community