Ray

Posted on Jun 19

How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch

#seo #webdev #ai #claude

A while back I wrote Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search, and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official web_search and web_fetch tool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy.

"Search" and "fetch" are two different things

Before going further, the one thing worth being crystal clear on: when AI runs a query, "search" and "fetching the page body" are not the same operation. They're two separate stages.

Stage 1 (WebSearch): the AI takes your question and calls the WebSearch tool (which hits a search engine). What comes back is a list of search results — each entry has things like URL and title, but no page body.
Stage 2 (WebFetch): after looking at the search result list, the AI decides which entries are worth digging into, then fires a WebFetch request at each of those URLs, basically saying "give me the body of this page." That's when actual content gets pulled in.

Why doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it (because the context did overflow). So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch.

Once that two-stage split makes sense, the rest of this post is about what each stage actually pulls in.

What does WebSearch pull in?

Going straight to Anthropic's official web_search tool docs — every search result entry has only four fields:

url: the page URL
title: the page title
page_age: when the page was last updated
encrypted_content: encrypted content, not for the AI to read the article — it's for multi-turn conversation citations

That's it. Four fields.

What the AI sees during the search stage is "URL, title, last updated" — three pieces of human-readable info. No body content at all.

What if the AI cites your content? There's a cap on that too:

Each web_search_result_location's cited_text is up to 150 characters of the cited content

In short: at most 150 characters of quoted text. And that's just the API-level spec.

Claude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's breakdown of Claude Code's internal behavior, Claude Code even drops page_age and encrypted_content, keeping only title and url.

So basically — at the search stage, the AI sees nothing more than one title and one URL from your site. That's it.

What does WebFetch pull in?

Now for when the body actually gets pulled in — Stage 2, WebFetch.

Once the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it?

This needs to be split into two layers, because the API and Claude Code work differently.

Note
When I say "API" here, I mean the Anthropic API's web_fetch tool. "Claude Code" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows.

API-level web_fetch

The Anthropic API's web_fetch tool has a parameter called max_content_tokens that developers can set themselves — though the official docs use 100,000 tokens in their examples.

The docs also give a reference conversion:

Content size	Estimated tokens
Average web page 10 KB	~2,500 tokens
Large doc page 100 KB	~25,000 tokens
Research paper PDF 500 KB	~125,000 tokens

So a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster.

One thing to note: web_fetch's citation works differently from web_search. It uses start_char_index / end_char_index to pick out a specific position in the article (although the docs don't pin down a hard character limit).

Claude Code's built-in WebFetch

Claude Code's built-in WebFetch goes a different route.

Per Mikhail Shilkov's breakdown, the WebFetch flow is:

Convert HTML to Markdown using the Turndown package
Extract the first 100 KB of plain text
Pass that 100 KB to the Haiku 3.5 model
Haiku summarizes the answer based on your prompt and returns the summary to the main model

The real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote.

The citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is:

Enforce a strict 125-character maximum for quotes from any source document.

So quotes max out at 125 characters.

You're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be "how do I get Haiku to extract the parts I want to be quoted," not "how much content am I feeding Haiku."

Running a real test against my own blog

Enough theory — let's actually test it.

I had Claude call the WebFetch tool against my blog's homepage (israynotarray.com) and asked it to grab the body.

To make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "WebSearch|WebFetch",
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/log-web-tool.sh"
          }
        ]
      }
    ],
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/dump-transcript.sh"
          }
        ]
      }
    ]
  }
}

The prompt I used:

Please use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page.

The actual input passed to WebFetch:

{
  "url": "https://israynotarray.com",
  "prompt": "Tell me the main content of this homepage: site name, author bio, latest article list (titles and summaries), categories, plus any navbar or sidebar info."
}

After it ran, the WebFetch result — the summary that came back — looked roughly like this:

{
  "bytes": 1068,
  "code": 200,
  "codeText": "OK",
  "result": "# Page Content Analysis\n\nBased on the content snippet provided, I can share the following:\n\n## Site Structure\n- **Main category**: AI-related content, including Gemini and Gemini Omni products\n\n## Latest Article\n**Title**: \"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\"\n\n**Summary**:\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\n\n**Publish date**: 2026-06-06  \n**Reading time**: ~23 min\n\n## Limitations\nThe content snippet provided does not include author bio or full navbar info.",
  "durationMs": 4603,
  "url": "https://israynotarray.com"
}

See it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back.

I also tried an older post with a deliberately broken URL path, and got this:

{
  "bytes": 0,
  "code": 404,
  "codeText": "Not Found",
  "result": "The server returned HTTP 404 Not Found.\n\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool (e.g. `gh` for GitHub, or an MCP-provided fetch tool) instead of WebFetch.",
  "durationMs": 588,
  "url": "https://israynotarray.com/dqwdqwdqwd"
}

Even the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything.

Side note — this lines up with a caveat in Claude's official docs:

The web fetch tool currently does not support websites dynamically rendered with JavaScript.

If your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators (Hexo, Astro, Next.js in SSG mode) are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content.

Don't forget the robots.txt layer

There's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt.

AI crawlers basically split into two types: search-style (cite and link back to your site) and training-style (eat content to feed the model, not necessarily linking back). The common mapping:

Crawler	Type	Behavior
Claude-SearchBot / Claude-User	Search	Real-time fetch when Claude answers, cites back
ClaudeBot	Training	Fetches content to feed Claude training
OAI-SearchBot / ChatGPT-User	Search	Real-time fetch when ChatGPT answers, cites back
GPTBot	Training	Fetches content to feed GPT training
PerplexityBot	Search	Used by Perplexity engine, cites back
Google-Extended	Training	For Gemini training
CCBot	Training	Common Crawl public dataset

If you want to be cited by AI but don't want your content used for training, the most common strategy is "allow search-style, block training-style."

Here's a robots.txt template you can copy-paste:

# Search-style AI crawlers: allow (they cite back)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Training-style AI crawlers: block (consume data without citing)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Content signal: searchable but not for training, not for direct AI input
Content-Signal: ai-train=no, search=yes, ai-input=no

For a full Agent Readiness setup to score 100, see From 3 to 100! How to Get Your Site to Pass isitagentready's AI Agent Readiness Check.

So what does this concretely mean for writing?

Once you understand all the constraints above, there are four things worth specifically working on.

Titles need to stand on their own

At the search stage, all the AI sees about your article is two fields — title and URL.

If your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped.

A quick comparison:

Weaker: "Implementation Notes"
Stronger: "Complete Implementation Notes: Content Negotiation for HTML-to-Markdown in a Cloudflare Worker"

The stronger version packs in "topic, tool, what it does, and article type" — the AI doesn't even need to open the page to know whether it's worth fetching.

Lead with the answer

At the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is "Before we get into X, let's recap a bit of history…", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background.

The right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search too — worth reading alongside this.

Design single sentences that can be quoted standalone

cited_text is 150 characters on web_search and 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that "makes sense without context."

Consciously design sentences like that. For example:

Weaker: "This is a bit different from the method mentioned earlier — the main difference is…" (makes no sense without context)
Stronger: "llms.txt was proposed by Answer.AI co-founder Jeremy Howard in 2024, with the goal of proactively telling AI what important content a site has." (stands on its own without surrounding text)

After writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted.

Use structure as Haiku's navigation markers

H2, H3, - lists, tables, `js code blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as "what this section is about" indexes, lists as "main points" signals, and tables as "supporting data" units.

If your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted.

Wrap-up

So how much of your blog does AI search actually pull?

The answer breaks down into three layers:

Search stage: title + URL only
Body fetch stage: the API default can fit your whole article, but Claude Code goes through Haiku summarization with a 100 KB cut-off
Citation stage: web_search is 150 characters, Claude Code WebFetch is 125 characters

Writing for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post.

If your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.

DEV Community