DEV Community

Cover image for Cloudflare Workers HTML to Markdown on the Free Plan
Rick Cogley
Rick Cogley

Posted on • Originally published at cogley.jp

Cloudflare Workers HTML to Markdown on the Free Plan

This is a condensed version. Full article on cogley.jp has the complete code walkthrough, known limits of the starter emitter, and the full reasoning for each alternative.

AI crawlers — Gemini, GPT, Claude, Perplexity — read your site constantly, and they'd rather parse markdown than HTML. Markdown means cleaner context, fewer tokens, cheaper inference.

If your content is already markdown (CMS, Git, database), you just negotiate the format with Accept: text/markdown. But if your content is HTML — you're proxying a third-party page, mirroring docs, building a reader-mode endpoint, feeding an LLM summarizer, or simply serving a static website — you'd have to convert it to markdown inside a Cloudflare Worker. On the free plan that means 10 ms CPU and 1 MB compressed bundle — which strategy survives those constraints?

"Paid" is two different things

Worth clarifying upfront because Cloudflare uses "paid" for two separate products:

  • Workers Paid ($5/month + usage) — the Worker runtime upgrade. CPU jumps from 10 ms to 30 s, bundle from 1 MB to 10 MB. This plan is what changes the HTML-to-markdown calculus.
  • Cloudflare Pro ($20/month per domain) — a domain plan. Adds WAF, image optimization, page rules. Does not change any Worker limit.

Throughout this article, "paid" means Workers Paid.

The free-plan budgets

Limit Free Paid
CPU time per request 10 ms 30 s (Standard)
Compressed Worker bundle 1 MB 10 MB

Ten ms is plenty for routing or JSON work. HTML-to-markdown is different — you're parsing a DOM, walking every node, emitting a transformed string. It's CPU-dense, and any strategy that ships its own DOM implementation tends to blow the bundle budget too.

The punchline: HTMLRewriter

HTMLRewriter is built into workerd — the open-source JavaScript/Wasm runtime that executes your Worker at the edge (and what wrangler dev runs locally). Zero npm dependencies. Used by Cloudflare themselves for response transformation.

Architectural distinction. HTMLRewriter is streaming and SAX-style: it fires <h1> / text / </h1> events as bytes arrive and never builds an in-memory tree. The turndown / Readability / cheerio family do the opposite — buffer the whole document, construct a DOM with every node and parent pointer allocated, then walk it. That construction pass is both a CPU tax (before you emit a single markdown character) and the reason those libraries ship their own DOM implementation (hundreds of KB of bundle).

On a sample 34 KB HTML article:

  • Bundle: 10.52 KiB uncompressed / 3.74 KiB gzipped (0.4% of the 1 MB budget)
  • CPU: 2 ms median over 50 runs (min 2, max 8) — 20% of the 10 ms budget
  • Output: 24.9 KB markdown

That's 5× CPU headroom and 250× bundle headroom on free. Nothing else I measured came close.

Numbers are from wrangler dev local workerd — the edge runtime is typically 1.5–2× slower, so plan on 3–4 ms realistic median. Still well inside 10 ms.

Why the alternatives don't fit

Strategy Bundle CPU estimate Fits free budget?
HTMLRewriter ~11 KB 2 ms ✅ Huge headroom
node-html-parser ~40 KB Fast ✅ Good fallback
cheerio + custom 100–150 KB Moderate △ Tight, no upside
turndown + domino shim ~320 KB 15–30 ms ❌ Busts CPU budget
Readability + turndown ~400 KB 20–40 ms ❌ Busts both
jsdom ~2 MB ❌ Twice the bundle budget

The turndown / Readability stack is the canonical "HTML to markdown" setup, and it produces clean output. It just doesn't fit on the free plan — the 240 KB @mixmark-io/domino shim alone eats a third of your bundle before you write a single line. Use it on paid.

How to use HTMLRewriter for markdown

Insert markdown punctuation as text adjacent to each matched element, then strip everything else with a catch-all * selector:

const rewriter = new HTMLRewriter()
  .on('head, nav, aside, footer, script, style, figure', {
    element(el) {
      el.remove();
    },
  })
  .on('h1', {
    element(el) {
      el.before('\n\n# ', { html: false });
      el.after('\n\n', { html: false });
    },
  })
  .on('h2', {
    element(el) {
      el.before('\n\n## ', { html: false });
      el.after('\n\n', { html: false });
    },
  })
  .on('li', {
    element(el) {
      el.before('\n- ', { html: false });
    },
  })
  .on('a', {
    element(el) {
      const href = (el.getAttribute('href') || '').replace(/\s+/g, '');
      el.before('[', { html: false });
      el.after(`](${href})`, { html: false });
    },
  })
  .on('*', {
    element(el) {
      el.removeAndKeepContent();
    },
  });

const raw = await rewriter
  .transform(new Response(html, { headers: { 'content-type': 'text/html' } }))
  .text();
Enter fullscreen mode Exit fullscreen mode

Post-process for HTML entities and whitespace normalization (full code in the cogley.jp article).

Known limits

HTMLRewriter selectors fire independently, so cross-element state is awkward:

  • Ordered lists come out unnumbered (no parent lookup for list index)
  • Inline <code> inside <pre> drops its fences (no parent-type check)
  • Link text spanning formatting tags loses emphasis when * strips <em>

These matter for round-tripping. For "give AI agents clean markdown" they don't.

Measure it yourself

The harness that produced these numbers is a separate public repo: cf-workers-html-to-markdown-harness. It's a measurement rig, not a library.

git clone https://github.com/RickCogley/cf-workers-html-to-markdown-harness
cd cf-workers-html-to-markdown-harness
npm install --ignore-scripts
npm run dev
curl 'http://127.0.0.1:8791/bench?strategy=htmlrewriter&runs=50'
Enter fullscreen mode Exit fullscreen mode

Adding a new strategy (cheerio, node-html-parser, turndown+domino) is one handler file under src/handlers/ plus one line in src/index.ts — see ADD_A_STRATEGY.md for the full pattern.

If you measure something I didn't and get different numbers, open an issue on the harness repo and I'll update the article.

When to stop fighting and go paid

Need any of these? Stop trying to stay on free:

  • Round-trippable markdown → turndown
  • Article extraction (skip nav/sidebar/comments) → Readability
  • HTML tables → markdown tables → turndown or cheerio
  • Larger or more varied inputs → paid's 30 s budget removes the ceiling

Workers Paid is $5/month + usage. It's cheaper than an afternoon of engineering around the free-plan budgets if your use case needs a fuller converter.


Full version with complete code, reasoning, and the harness deep-dive: cogley.jp.


Rick Cogley is founder/CEO of eSolia Inc. in Tokyo.

Originally published at cogley.jp

Rick Cogley is CEO of eSolia Inc., providing bilingual IT outsourcing and infrastructure services in Tokyo, Japan.

Top comments (0)