Nicolas Francisquelo Tacca

Posted on Feb 14

Your AI Agent Doesn't Need Firecrawl Anymore

#webdev #ai #webscraping #news

And honestly? It's about time.

I've been building AI agents that consume web content for a while now. I've also been running sites behind Cloudflare for years. So when they announced Markdown for Agents this week, I didn't just read the blog post — I felt it. Because I've lived on both sides of this problem, and I know exactly how much unnecessary pain this eliminates.

Let me explain why this is a bigger deal than most people realize.

The Dirty Secret of Every AI Agent Pipeline

Here's what building an AI agent that browses the web actually looks like today:

Your agent needs information from a webpage
You fetch the HTML
You stare at 900KB of <div> soup, inline styles, tracking scripts, and navigation menus
You pipe it through Firecrawl, Crawl4AI, Jina Reader, or your own janky Playwright script
You get markdown back
You feed it to your LLM
You pray the conversion didn't mangle anything important

Every. Single. Time.

And look — tools like Firecrawl and Crawl4AI are genuinely great. I've used them extensively and they solve real problems. But here's the thing that always nagged me: why are we doing this at all?

Think about it. The website already knows its own content. The server already has the structured data before it wraps it in HTML. We're asking a third-party tool to reverse-engineer structure that the origin already had. It's like translating a book from English to French and then paying someone else to translate it back to English. Something always gets lost.

What Cloudflare Actually Built

Markdown for Agents is deceptively simple. Your AI agent sends a request with Accept: text/markdown in the header. If the site uses Cloudflare (and has this enabled), you get clean markdown back instead of HTML. Same URL. No separate API. No scraping tool. No conversion step.

curl https://example.com/some-page -H "Accept: text/markdown"

That's it. You get markdown. Done.

The numbers are staggering:

80% fewer tokens on average
One Amazon product page went from 896,000 tokens in HTML to 8,000 in markdown. That's a 99% reduction.
A typical blog post drops from ~16,000 to ~3,000 tokens

This isn't an incremental improvement. This is an entire category of tooling that just became optional.

Why This Hits Different If You Build Agents

If you're a full-stack dev exploring AI agents, you might not immediately feel why this matters. Let me put it in terms that hit your wallet and your architecture.

Token costs are real. Every token you feed to an LLM costs money. When an HTML page burns 16K tokens and the actual content is 3K tokens, you're paying 5x more than you need to. Across thousands of pages, across multiple agent runs per day — that adds up fast.

Context windows aren't infinite. Even with 128K or 200K context windows, you're constantly playing Tetris with how much information you can fit. Cut 80% of the noise and suddenly your agent can process 5x more pages in a single context. Your RAG pipeline gets dramatically better because you're embedding actual content, not CSS class names.

Your scraping pipeline is a liability. Every extra service in your pipeline is a point of failure. Rate limits, API changes, conversion bugs, timeout errors. I've debugged more Playwright timeout issues than I care to admit. Removing that entire layer from your architecture isn't just cleaner — it's more reliable.

The Bigger Picture Most People Are Missing

Here's my hot take: Cloudflare is positioning itself as the HTTP layer for the agent economy.

Think about what they've been building:

AI Gateway — unified access to 350+ models with routing, billing, monitoring
Agents SDK — state management, WebSocket support, tool calling for agents
Workers AI — inference at the edge
Browser Rendering — headless browser with a markdown API endpoint
Content Signals Policy — machine-readable permissions for AI access (ai-train=yes/no, ai-input=yes/no)

And now Markdown for Agents — making every Cloudflare site automatically agent-readable.

This isn't a feature announcement. This is infrastructure for a world where agents are first-class citizens of the web, right alongside browsers.

The Content-Signal header in the markdown response is particularly telling. It includes directives like ai-train=yes, search=yes, ai-input=yes. Cloudflare is building the consent layer for AI access to web content. In a world where publishers are suing AI companies and robots.txt compliance is dropping (13% of AI bot requests ignored robots.txt in mid-2025), having a standardized, opt-in mechanism for serving content to agents is going to matter a lot.

Content Negotiation: Old Protocol, New Purpose

What's elegant about this approach is that it uses HTTP content negotiation — a mechanism that's existed since HTTP/1.1. The Accept header has always been how clients tell servers what format they want. We just never had a use case this compelling for it.

Vercel is doing something similar with their pages, serving optimized formats to agents. The pattern is converging: same URL, different representation depending on who's asking. Humans get the full visual experience. Agents get clean, semantic text.

This is the right architectural pattern. No separate APIs to maintain. No duplicate content. No special endpoints. Just standard HTTP doing what it was designed to do.

Coding agents like Claude Code and OpenCode are already sending Accept: text/markdown headers. The demand side is moving. Cloudflare just gave the supply side a one-click way to respond.

What This Means for Your Stack

Let me be practical for a second. If you're building AI agents today, here's how this changes your thinking:

Before Markdown for Agents:

fetch URL → receive HTML → send to Firecrawl/Jina → receive markdown → feed to LLM

After:

fetch URL with Accept header → receive markdown → feed to LLM

You still need Firecrawl or Crawl4AI for:

Sites not on Cloudflare
Sites that haven't enabled the feature (requires Pro plan or above)
Heavy JavaScript-rendered content (use Cloudflare's Browser Rendering instead)
Structured data extraction with custom schemas

But for a huge chunk of the web — Cloudflare powers roughly 20% of all websites — you can skip the middleman entirely. And the percentage of sites supporting this will only grow as more platforms adopt content negotiation.

My recommendation: Update your agent's fetching logic to try Accept: text/markdown first. Fall back to your existing conversion pipeline if the response comes back as HTML. It's a progressive enhancement that costs you nothing to implement.

async function fetchContent(url: string): Promise<string> {
  const response = await fetch(url, {
    headers: { 'Accept': 'text/markdown, text/html;q=0.9' }
  });

  const contentType = response.headers.get('content-type');

  if (contentType?.includes('text/markdown')) {
    // Direct markdown — no conversion needed
    const tokens = response.headers.get('x-markdown-tokens');
    console.log(`Got markdown directly. Estimated tokens: ${tokens}`);
    return response.text();
  }

  // Fallback to your existing conversion pipeline
  return convertHtmlToMarkdown(await response.text());
}

Bonus: the x-markdown-tokens header tells you the estimated token count before you even process the content. Smart agents can use this for chunking decisions or to decide if a page is worth ingesting at all.

The Uncomfortable Question

There's a tension in Cloudflare's approach that the Hacker News crowd (rightfully) called out. Cloudflare simultaneously offers bot protection to block AI crawlers AND this feature to serve them optimized content. That seems contradictory until you realize the distinction: it's about consent and control.

Bot protection blocks unauthorized scraping. Markdown for Agents enables authorized consumption. The difference is that the site owner opts in, chooses what to serve, and can set content signal policies about how the content can be used. It's the difference between someone breaking into your house and you opening the front door.

Whether this framing holds up long-term — especially as AI-generated content, cloaking concerns, and copyright battles escalate — is an open question. But the intent is right: give publishers control over how agents access their content, and give agents a clean way to consume it when permission is granted.

Where This Is Heading

We're watching the web grow a new interface layer in real time. HTML for humans. Markdown for agents. Same content, different representations, same URLs.

Combine this with:

MCP (Model Context Protocol) giving agents standardized tool interfaces
WebMCP (W3C draft as of last week) making websites agent-ready from the browser
AGENTS.md emerging as a standard for telling AI tools how repositories work
Content Signals providing machine-readable usage permissions

And you start seeing a web that's natively bilingual — designed for both human and machine consumption from the ground up.

I think in 12 months, serving markdown to agents via content negotiation will be as standard as serving gzipped responses. It's too obvious not to become the default.

The Bottom Line

Cloudflare didn't invent HTML-to-markdown conversion. But they did something arguably more important: they made it unnecessary for a massive chunk of the web.

If you're building AI agents, this is a free performance upgrade. If you're running a website on Cloudflare, turning this on makes your content accessible to the growing wave of AI agents — on your terms.

The scraping pipeline you spent weeks building? It's still useful. But its surface area just got a lot smaller. And that's a good thing.

What's your take? Are you already sending Accept: text/markdown headers in your agents? I'd love to hear how this changes your stack - hit me up in the comments or on Twitter/X @nicoeft.

Top comments (2)

wfgsss • Feb 14

The token reduction numbers are wild — 99% on an Amazon page is game-changing for agent pipelines. But here's the catch nobody's talking about: this only works for sites behind Cloudflare that opt in.

For the sites I scrape most (Chinese wholesale platforms like Yiwugo, DHgate, 1688), they're behind their own CDNs or China-specific WAFs, not Cloudflare. So the Firecrawl/Crawl4AI layer isn't going away for those use cases anytime soon.

What I've found works better than HTML-to-markdown conversion for these sites: extracting the window.__INITIAL_STATE__ or __NEXT_DATA__ JSON that most React/Next.js e-commerce sites embed. You get structured data directly — no conversion step, no token waste, and it's already in the format your LLM can reason about. Basically the same idea as Cloudflare's approach (get the data before it becomes HTML), just done client-side.

The Content Signals Policy piece is what I'm most excited about though. Having machine-readable ai-train=yes/no at the infrastructure level would solve so many compliance headaches for scraper developers. Right now every platform has different ToS interpretations.

Great writeup — curious if you've seen any adoption numbers for the Accept: text/markdown header in the wild yet?

Nicolas Francisquelo Tacca • Feb 15 • Edited

Yep you're right, the CF coverage gap is real, my pipelines hit mostly US/EU sites so I'm biased there. If you're scraping Chinese wholesale platforms this doesn't change much for you yet.

Mmm for adoption numbrs it's way too early, it's been live less than a week. I'll update the post if real data surfaces.