DEV Community: Ray

design.md: If Your AI-Generated UI Keeps Drifting, You're Missing Google's design.md

Ray — Sat, 27 Jun 2026 06:41:39 +0000

Have you ever done some Vibe Coding and noticed the AI's layout is always a little off? You keep saying "the spacing is wrong here, the colour is inconsistent there", and it still produces a slightly broken screen? If yes, the real reason is that you haven't given the AI the rules of your design system properly.

To fix exactly that, this post walks through design.md, an AI design system spec released by Google Labs. The goal is to let AI Coding tools like Claude Code, Cursor, and GitHub Copilot actually read your design system and produce consistent UI — so you don't have to keep repeating "wrong colour, wrong font size" forever.

If you're not familiar yet with rules files for AI like CLAUDE.md / AGENTS.md, take a look at CLAUDE.md and Rules for AI: Writing Tips That Make AI Understand Your Project and Agent Skills: More Token-Efficient Than MCP first — it'll make a lot more sense why design.md exists.

Where design.md came from

design.md actually comes out of Stitch, Google's AI design tool (first shown at Google I/O 2025). Stitch is a UI design tool powered by Gemini — you give it a Prompt or upload a sketch and it generates a UI screen, frontend code included.

Around 2026, Google released the design.md format itself under Apache 2.0 on GitHub (google-labs-code/design.md). What does that mean in practice? They're trying to "standardise" Stitch's design system rules.

The most common pain point for Vibe Coding developers is inconsistent UI and inconsistent style. But if you open-source this design spec, the business value gets pretty huge — whether you use Claude Code, Cursor, GitHub Copilot, or any other AI Coding tool, you can use this spec to produce consistent UI.

A few examples of what that gets Google:

Becoming the standard means owning the definition. Think OpenAPI or Kubernetes — looked harmless at launch, but once everyone adopted them, the contributor with the loudest voice on where the spec goes was the one who shipped it.
Other AI tools voluntarily become allies. Once Claude Code, Cursor, and Copilot start marketing "we support design.md", competitors are basically promoting Google's format for free.
The training data flywheel. People will put their project's design.md in public repos. Structured design tokens plus human-written rationale is honestly the perfect combo for AI training data. Future Gemini gets stronger.
Stitch's free marketing funnel. Once developers get used to design.md, when they need a "tool that speaks design.md", the first thing they think of is Stitch.

Big-company open source that looks like a loss is usually trading time for ecosystem. Chromium, Kubernetes, TensorFlow all played this game — on the surface it's a community contribution, in reality it's using "free" to lock in the standard early.

What problem is design.md actually solving?

Basically just this:

Making sure the UI that AI generates uses the same colours, font sizes, and corner radii every time.

Everyone Vibe Coding has hit this. You ask the AI to make a Button, and then:

The first time it gives you #3B82F6
The second time #2563EB
The third time straight up bg-blue-500

The screen looks fine, all blue. But they're not actually the same blue, so your project ends up with a whole watercolour palette of "blues". Font sizes have the same issue — sometimes 1.25rem, sometimes 20px, sometimes text-xl.

This kind of inconsistency makes the UI feel disjointed, which makes the user experience inconsistent, which hurts the brand.

Why does this happen? The main reason is you haven't given the AI a "design system rules" document.

"Can't I just put it in CLAUDE.md?" — you can. But CLAUDE.md (and AGENTS.md) are "natural-language rules" for the AI. There's no structured palette or token system. After reading it, the AI still has to guess "which hex is Primary" or "which corner radius should this button use", and it ends up guessing anyway.

So design.md is basically turning that "design system rules" document into a file that machines can parse precisely + humans can understand the reasoning behind.

What does design.md actually look like?

Pretty simple — design.md is just a Markdown file. The structure splits into two parts:

YAML Front Matter: the precise design tokens (colours, typography, spacing, corner radius, components).
Markdown body: the design rationale — why this colour, when to use it, when not to use it.

By splitting the file this way, the AI doesn't just know "Primary is #1A1C1E" — it also knows "Primary is the deep ink used for headlines and core text". So when it produces UI, it's a lot less likely to use the wrong colour for a button or the wrong size for a piece of text.

YAML token block

The YAML block looks roughly like this:

---
name: Heritage
colors:
  primary: "#1A1C1E"
  secondary: "#6C7278"
  tertiary: "#B8422E"
  neutral: "#F7F5F2"
typography:
  h1:
    fontFamily: Public Sans
    fontSize: 3rem
  body-md:
    fontFamily: Public Sans
    fontSize: 1rem
rounded:
  sm: 4px
  md: 8px
spacing:
  sm: 8px
  md: 16px
components:
  button-primary:
    backgroundColor: "{colors.tertiary}"
    textColor: "{colors.on-tertiary}"
    rounded: "{rounded.sm}"
---

A few key points:

colors, typography, rounded, spacing, components are the five main token categories.
Tokens can reference each other — button-primary's backgroundColor is "{colors.tertiary}", so when you change tertiary later, everything using it updates automatically.
Colours support any valid CSS colour string: hex, rgb, hsl, oklch — all good.

If you've used Tailwind's tailwind.config.js or the W3C Design Tokens Format Module (DTCG), this'll feel familiar.

Markdown body block

After the closing --- of the YAML, you get the design reasoning written for humans (and AI) to read.

Officially recommended section order:

Overview
Colors
Typography
Layout
Elevation & Depth
Shapes
Components
Do's and Don'ts

Only Colors (and the primary colour inside it) is required. The rest is optional — write them only if your design system has that piece.

A concrete example — the Colors block might look like this:

## Colors

Primary (#1A1C1E): Deep ink for headlines and core text.
Use on light surfaces to maximize legibility.

Tertiary (#B8422E): Warm rust accent reserved for primary actions
and brand moments. Avoid using as body text colour due to contrast.

That's basically telling the AI "this colour is called Tertiary, use it for primary action buttons, don't use it for body text".

Wait — there's a CLI?

Yep. design.md ships with a CLI that does four things: lint, diff, export, spec.

Install and use:

npm install @google/design.md
npx @google/design.md lint DESIGN.md

On Windows the .md extension can get eaten by file association, so use the designmd alias:

npx -p @google/design.md designmd lint DESIGN.md

lint: validate the DESIGN.md structure

Same vibe as ESLint, Stylelint, Prettier. lint checks:

Whether the YAML schema is correct
Whether token references are broken (e.g. {colors.tertiery}, missing an a?!)
Whether there's a primary colour
Whether WCAG AA contrast passes (this one's the killer feature)
Whether there are orphan tokens defined but never used
Whether the section order matches the recommendation

For the WCAG AA rule, it tells you straight up "on-secondary against secondary only hits X.X:1 contrast, doesn't pass AA". Accessibility checks baked right into the design system file.

diff: compare two versions

Like git diff for design systems:

npx @google/design.md diff old/DESIGN.md new/DESIGN.md

Tells you which tokens changed, which were deleted, which were added. Use case: bump primary from #1A1C1E to #0A0A0A and want to know "how many components are about to change colour because of this".

export: convert to other formats

This one's genuinely impressive, especially if you want design.md as the "upstream" source.

Format	Description
`json-tailwind`	Converts to a Tailwind v3 config object
`css-tailwind`	Converts to a Tailwind v4 `@theme` CSS block
`dtcg`	Converts to the W3C Design Tokens Format Module

Usage:

npx @google/design.md export DESIGN.md --format json-tailwind > tailwind.config.js
npx @google/design.md export DESIGN.md --format css-tailwind > tailwind.css
npx @google/design.md export DESIGN.md --format dtcg > design-tokens.json

You only have to focus on design.md, then export to Tailwind / CSS / DTCG. Frontend, designers, and AI Coding tools all use the same source of truth.

spec: hand the spec to the AI

npx @google/design.md spec

Outputs design.md's own format spec. Hand that to Claude Code / Cursor / Copilot and tell it "please follow this spec when reading my DESIGN.md". The AI then knows how to parse the YAML schema, that token references use {path.to.token} syntax, what's required, what's optional. Add --rules and it'll output the linting rules as well.

Using it with AI Coding tools

The whole design goal is "turn the UI design system into something that can be continuously understood by AI", so the usage is simple:

Put DESIGN.md in your project root.
In CLAUDE.md / AGENTS.md / .cursor/rules/*.mdc, add a line: "For UI-related tasks, please read DESIGN.md first and strictly follow the tokens and design rationale inside."

That's it.

One reminder at the end, kind of an old chestnut:

Don't stuff design.md with irrelevant junk.

The whole design philosophy is to give the AI a distilled version of your design knowledge — same direction as Agent Skills. Give the AI the key points, not everything.

Right now, design.md doesn't ship with official integration examples for Claude Code / Cursor / Copilot — only the CLI and TypeScript API. So you need to guide the AI to read design.md via your CLAUDE.md / AGENTS.md / rules file.

Wrap-up

So what is design.md, really?

A design system document for AI Coding tools, in a "YAML token + Markdown design rationale" format.
The pain point it solves is AI-generated UI having inconsistent colours, font sizes, and spacing.
The built-in CLI can lint (with WCAG contrast checks), diff, export to Tailwind / DTCG, and output a spec for the AI.

Hope this helps you fix the "AI generates UI with inconsistent colours, font sizes, and spacing" pain so your AI Coding workflow goes a lot more smoothly.

Which AI Coding Tool Should You Pick in 2026? Copilot vs Cursor vs Claude Code vs Kiro vs Antigravity

Ray — Tue, 23 Jun 2026 01:43:30 +0000

If you've been using AI to write code, you've probably noticed that in 2026 there are more tools out there than ever — GitHub Copilot, Cursor that came later, Anthropic's Claude Code, AWS's Kiro, and Google's Antigravity. Each one is positioned a little differently, but a lot of people get tripped up trying to figure out which one is which.

Personally I mostly use VSCode with GitHub Copilot and Claude Code, and I've poked around with Cursor, Kiro, and Antigravity. Friends keep asking me which one they should pick, so this is my attempt at walking through the differences across all five so you can make the call based on your own workflow.

A quick intro to all five tools

Before the comparison, a fast pass on what each one actually does.

GitHub Copilot

GitHub Copilot is the AI coding assistant from GitHub (Microsoft). It launched back in 2021, which makes it kind of the OG of AI coding tools. It's integrated into VSCode and most major IDEs, and the core feature is real-time code completion and suggestions.

After 2025 Copilot changed quite a bit. Inside VSCode it now has Agent Mode, which can search across your codebase, modify multiple files, and run terminal commands automatically.

On top of that, GitHub released Copilot Coding Agent, which is a separate thing from Agent Mode. Coding Agent runs in a cloud environment on GitHub Actions — assign an Issue to Copilot or click "Send to Copilot Coding Agent" in chat, and it goes off and finishes the task in the background, then opens a PR for you to review. It even runs its own code review and security scan first (including Secret Scanning and dependency vulnerability checks), and only hands the PR over once those pass.

Cursor

Cursor is a code editor built around AI from the ground up. It's a fork of VSCode under the hood, so if you're already a VSCode user the switch is basically zero learning curve.

Cursor understands your entire codebase, and its Agent Mode can search files, run terminal commands, and edit across multiple files automatically. In 2026 Cursor also rolled out Background Agents, which clone your repo to the cloud and work on a separate branch — and you can spin up multiple agents to work on different tasks in parallel.

Claude Code

Claude Code is Anthropic's CLI (command-line interface) AI dev tool. Unlike the previous two, Claude Code isn't an editor or an editor plugin — it's a standalone terminal tool.

Basically, you open a terminal, type claude, and it starts understanding your project structure and can read, modify, and create files for you, even running commands. What makes Claude Code special is its support for Hooks, Skills, MCP Server, Channels, and other advanced features that let you compose all kinds of automated workflows around your needs.

If you're new to Claude Code, you might want to start with my earlier post Want to develop with Claude Code? This guide takes you from beginner to advanced — it has a much more detailed walkthrough.

Kiro

Kiro is AWS's AI dev tool. It's also a VSCode fork and it runs on top of Amazon Bedrock. The thing that sets Kiro apart from the rest is that it's built around Spec-Driven Development.

Before Kiro writes a single line of code, it first generates a complete spec document — requirements, system design, data models, task breakdown — and only starts implementing once you've signed off on it. The upside is every piece of code can be traced back to a specific requirement, which is genuinely useful if your project needs proper documentation.

Kiro also has its own Agent Hooks that can trigger an AI Agent automatically on file save, create, or delete events — handy for things like auto-formatting, doc updates, or running git commits. And Kiro's default Auto Agent mixes different models with caching and other optimizations to balance performance against credit consumption.

Google Antigravity

Google Antigravity is the AI development platform Google launched in November 2025 alongside Gemini 3. The base is also a VSCode fork, but the changes are extensive enough that it feels pretty different from regular VSCode.

Antigravity goes hard on the Agent-First approach. It has two main surfaces: a traditional Editor View where you write code, and a Manager Surface, where you can dispatch multiple agents to work asynchronously across different workspaces, watching each agent's progress in real time.

Once an agent finishes a task, it produces an artifact report — task list, implementation plan, screenshots, even browser recordings — so you can see what it actually did. Model-wise, beyond Google's own Gemini 3.1 Pro and Gemini 3 Flash, it also supports Claude Sonnet 4.6, Claude Opus 4.6, and GPT-OSS-120B.

That said, Antigravity is still iterating fast right now. Rate limits and pricing are in flux, so check the official site for the latest before you commit.

Core feature comparison

Now that you know what each tool does at a high level, here's how they compare on core features.

Feature	GitHub Copilot	Cursor	Claude Code	Kiro	Antigravity
Code completion	Real-time inline	Real-time inline + Tab prediction	No real-time completion	Real-time inline	Real-time inline
Chat	Copilot Chat	Built-in chat	CLI chat	Built-in chat	Built-in chat
Multi-file editing	Yes (Agent Mode)	Yes (Composer)	Native	Yes (Agent Mode)	Yes (Multi-Agent)
Codebase understanding	Needs @workspace	Auto-indexed	Auto-scanned	Auto-indexed	Auto-indexed
Terminal integration	Inside VSCode only	Inside Cursor only	Native terminal	Inside Kiro only	Inside Antigravity only
MCP Server	Supported	Supported	Supported	Supported	Supported
Custom rules file	copilot-instructions.md	.cursor/rules/*.mdc	CLAUDE.md	Steering Files	Not specified
Background tasks	Yes (via GitHub Actions)	Yes (cloud async)	Yes (terminal background)	No	Yes (multi-agent dispatch)
Remote control	No	No	Yes (Channels / Remote Control)	No	No
Built-in spec generation	No	No	No	Yes (Spec-Driven)	No
Multi-model support	Yes (Claude, GPT, etc.)	Yes (Claude, GPT, Gemini, etc.)	Claude family only	Claude family only (via Bedrock)	Yes (Gemini, Claude, GPT-OSS)

What the table tells you is that the five tools are converging on features, but each has a clear strength.

In short — Copilot leans on inline completion and GitHub ecosystem integration, Cursor focuses on the editor-native AI experience, Claude Code goes the terminal automation route, Kiro centers on Spec-Driven Development, and Antigravity wraps everything around an Agent-First multi-model architecture. None of them is objectively "better" — it really depends on which one fits your workflow.

Pricing breakdown

Pricing matters to most people, so here's the snapshot for March 2026:

Plan	GitHub Copilot	Cursor	Claude Code	Kiro	Antigravity
Free	2,000 completions + 50 chat requests/month	2,000 completions + 50 slow requests/month	None	50 credits/month	Yes (stricter rate limit)
Personal entry	$10/mo (Pro, 300 premium requests)	$20/mo (Pro)	$20/mo (Claude Pro, limited usage)	$20/mo (Pro, 1,000 credits)	$20/mo (AI Pro)
Personal advanced	$39/mo (Pro+, 1,500 premium requests)	$60/mo (Pro+, 3x usage) / $200/mo (Ultra, 20x usage)	$100/mo (Max 5x) / $200/mo (Max 20x)	$40/mo (Pro+, 2,000 credits) / $200/mo (Power, 10,000 credits)	$249.99/mo (AI Ultra)
Team	$19/mo/user (Business)	$40/mo/user (Teams)	Per Team / Enterprise plan	Per enterprise plan	Per enterprise plan

Pricing follows each vendor's official site and can change at any time. GitHub Copilot also offers free Pro access for verified students and open source contributors.

And a quick word on each tool's billing model, because they all count things differently.

GitHub Copilot bills in Premium Requests — the number of times you hit the high-end models (Claude Opus 4.6, o3, etc.). Pro gets 300/month, Pro+ gets 1,500. Once you blow through the quota, Copilot doesn't cut you off — it falls back to the included base model (like GPT-4.1), but response quality drops noticeably. If you don't want the downgrade you can set an overage budget at $0.04 per request.

Cursor gives every paid plan a credit pool. Pro+ has 3x the usage of Pro, Ultra has 20x. If you let Cursor pick the model automatically (Auto mode), it doesn't burn credits — effectively unlimited. If you manually pick a specific model (say Claude Sonnet 4.6), that comes out of the credit pool. Once you're out of credits, you can still use Auto mode, you just lose manual model selection.

Claude Code is bundled into Claude's subscription. Max comes in 5x ($100/mo) and 20x ($200/mo) — the multiplier is relative to Pro, meaning how much more you can do before hitting limits. Max also covers the Claude desktop app, mobile app, and Claude Code together, which makes sense if you also use Claude for non-coding things (writing docs, research, etc.).

Kiro uses a unified credit system — every operation pulls from the same pool, billed by prompt complexity. A simple edit might be under 1 credit, while complex tasks cost more. Kiro's Auto Agent mixes models automatically and runs about 23% cheaper than manually picking Sonnet 4 for the same workload — that's why it's the default. Overage is $0.04 per credit, but it's off by default; you have to opt in via settings.

Antigravity is bundled into Google's AI subscription tiers (AI Pro and AI Ultra). You can also buy AI Credits separately at $25 per 2,500 credits. The pricing and quota strategy is still being adjusted, so double-check the official site before signing up.

So which one fits which scenario?

If you're still not sure which to pick after all that, let me sort them by use case.

You mainly care about the feel of writing code

If your day is mostly writing code and you want a fast, always-on assistant by your side, GitHub Copilot is probably the best fit. Its inline completion is still the smoothest of the bunch, the VSCode integration is the deepest, and it has matching AI features for PR review and Issue triage too — especially handy if your team is already deep into the GitHub ecosystem.

Plus, Coding Agent can handle background tasks and open PRs for you automatically, which is fantastic for team collaboration.

You want an AI experience inside the editor

If you do a lot of cross-file refactoring or you want the entire editor experience designed around AI, Cursor is the way. Composer handles multi-file edits really well, and it supports multiple AI models (Claude, GPT, Gemini, etc.) so you can pick the model that fits the task.

Frontend work is a great example — one feature often touches a bunch of files (component, styles, routes, tests), and Composer chews through all of it in one go. Background Agents also let you spin up multiple agents in parallel in the cloud.

You want deep automation and customization

If you live in the terminal or you need next-level automation and customization, Claude Code is the pick. The combination of Hooks, Skills, and Channels lets you compose all sorts of automated workflows.

For example, you can wire up Hooks to run security checks before every command Claude Code executes, or use Skills to package your team's code review conventions as a reusable skill set. Pair it with Git Worktree and you can even run multiple Claude Code agents on the same project, each handling a different task.

Channels in particular is something the other tools can't really match — you can send messages directly to a running Claude Code session over Telegram or Discord, let it work on the task in the background, and have the result delivered back to your phone.

If Channels sounds interesting, I wrote a full setup walkthrough here: Don't want to use the lobster (OpenClaw)? Try Claude Code Channels.

Your project needs strict documentation and spec management

If you're working on a project that requires complete documentation — client work, internal enterprise systems, anything that needs to pass an audit — Kiro's Spec-Driven Development fits perfectly.

Before any code gets written, Kiro produces requirements, system design, and a task breakdown, and every chunk of code traces back to a specific requirement. The upfront cost is higher, but in exchange your project has solid documentation from day one — maintenance and handoffs get a lot easier.

And since Kiro runs on AWS, if your team is already on Lambda, CDK, CloudFormation, or other AWS services, the integration is much smoother than the alternatives.

You want an Agent-First development experience

If you want to play with multiple AI agents working for you in parallel, Google Antigravity is worth a try. Manager Surface lets you dispatch multiple agents to different workspaces, and once they finish a task they produce an artifact report you can review.

The model lineup is also varied — beyond Google's own Gemini family, you can also use the Claude lineup and GPT-OSS series.

That said, like I mentioned earlier, even though Antigravity is a VSCode fork, the UI changes are extensive enough that it feels pretty different from regular VSCode, and it's still iterating fast — so factor that in.

The 2026 main event: AI Agent capability

In 2026 the competition has shifted from code completion to AI Agent capability. The basic idea is: AI shouldn't just answer your questions, it should be able to plan, execute, and even self-correct to finish an entire task on its own.

Here's how the five tools stack up on Agent capability:

Agent capability	GitHub Copilot	Cursor	Claude Code	Kiro	Antigravity
In-editor agent	Yes (Agent Mode)	Yes (Agent Mode)	Yes (VSCode plugin)	Yes (Agent Mode)	Yes (Editor View)
Background tasks	Yes (via GitHub Actions)	Yes (cloud async)	Yes (terminal background)	No	Yes (multi-agent dispatch)
Multi-agent parallel	Yes	Yes	Yes (with Git Worktree)	No	Yes (multi-workspace)
Event-driven automation	No	No	Yes (Hooks)	Yes (Agent Hooks)	No
Auto code review	Yes (built into Coding Agent)	No	No	No	Yes (auto-validation)
Auto security scan	Yes (Secret / Dependency)	No	No	No	No
Spec generation	No	No	No	Yes	No
Artifact report output	No	No	No	No	Yes (plan, screenshots, recording)
External chat platform integration	No	No	Yes (Channels)	No	No

The basic Agent features are now table stakes for everyone — the real differences are in the advanced features, and which direction each tool has taken. Pick whichever one matches your needs.

One thing worth calling out: Claude Code and Kiro both have the concept of event-driven automation, but they implement it differently. Claude Code Hooks fire before and after tool execution — useful for safety checks, formatting, that kind of thing. Kiro's Agent Hooks trigger on file events (save, create, delete) and you describe them in natural language, which is a lower bar to entry.

My own workflow

After all that, you might be wondering what I actually use day-to-day.

I mostly use VSCode with GitHub Copilot and Claude Code, and the workflow looks roughly like this:

Day-to-day coding — GitHub Copilot's inline completion handles the bulk of it
Feature work or refactoring — I open Claude Code and let it plan and handle the cross-file edits
Anything that needs automation — also Claude Code, paired with Hooks and Skills to keep output quality consistent

I've touched all five — GitHub Copilot, Claude Code, Cursor, Kiro, Antigravity. Cursor's editor experience is genuinely solid, but I'm too deep into the VSCode ecosystem at this point, and the Copilot + Claude Code combo covers most of what I need, so VSCode stays as my main driver.

If your budget only allows for one, here's my take:

New to AI coding → GitHub Copilot (the free plan is plenty, students get free Pro)
Want the best editor-native AI experience → Cursor (Composer is great for cross-file edits, Background Agents are powerful too)
Prefer terminal or need deep automation → Claude Code (also has a VSCode extension; Hooks + Skills + Channels is a flexible combo)
Project needs full spec docs → Kiro (Spec-Driven Development takes you from spec to implementation in one flow)
Curious about Agent-First development → Google Antigravity (free plan is enough to play with)

Wrapping up

Whichever one you end up picking, one thing stays true: A better tool is still just a tool. Your domain knowledge and judgment are what really matter — even with all the driver assists in the world, you still need to understand the car, the road, and the rules.

AI coding tools can help you write code faster, but they won't make decisions for you, and they won't take responsibility for the outcome.

So when you're using these tools, keep your critical thinking on. Review what they suggest, test it, and make sure the final code is safe and reliable.

Hopefully this helps you find the AI coding tool that fits your workflow best — see you in the next post.

How Much of Your Blog Does AI Search Actually Grab? Breaking Down Claude's WebSearch and WebFetch

Ray — Fri, 19 Jun 2026 14:23:46 +0000

A while back I wrote Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search, and right after finishing it a question hit me: when AI does a web search, how much of my blog does it actually grab? The whole article verbatim? The first 500 characters? Or does it bail after seeing just the title? So I dug into it, and this post walks through Anthropic's official web_search and web_fetch tool specs, runs a quick test against my own blog, and ends with what all this concretely means for how you should write posts and copy.

"Search" and "fetch" are two different things

Before going further, the one thing worth being crystal clear on: when AI runs a query, "search" and "fetching the page body" are not the same operation. They're two separate stages.

Stage 1 (WebSearch): the AI takes your question and calls the WebSearch tool (which hits a search engine). What comes back is a list of search results — each entry has things like URL and title, but no page body.
Stage 2 (WebFetch): after looking at the search result list, the AI decides which entries are worth digging into, then fires a WebFetch request at each of those URLs, basically saying "give me the body of this page." That's when actual content gets pulled in.

Why doesn't it just grab the body during the search stage? Context window limits. If every search shoved 10 results' worth of full bodies in, your usable context would blow up fast — and then you'd start complaining the AI is dumb and forgets what you just asked it (because the context did overflow). So it's split into two stages: search first for a list, then decide which entries from that list to actually fetch.

Once that two-stage split makes sense, the rest of this post is about what each stage actually pulls in.

What does WebSearch pull in?

Going straight to Anthropic's official web_search tool docs — every search result entry has only four fields:

url: the page URL
title: the page title
page_age: when the page was last updated
encrypted_content: encrypted content, not for the AI to read the article — it's for multi-turn conversation citations

That's it. Four fields.

What the AI sees during the search stage is "URL, title, last updated" — three pieces of human-readable info. No body content at all.

What if the AI cites your content? There's a cap on that too:

Each web_search_result_location's cited_text is up to 150 characters of the cited content

In short: at most 150 characters of quoted text. And that's just the API-level spec.

Claude Code's built-in WebSearch shaves it down further. According to Mikhail Shilkov's breakdown of Claude Code's internal behavior, Claude Code even drops page_age and encrypted_content, keeping only title and url.

So basically — at the search stage, the AI sees nothing more than one title and one URL from your site. That's it.

What does WebFetch pull in?

Now for when the body actually gets pulled in — Stage 2, WebFetch.

Once the AI has the search results, if it decides to open up a few entries, it fires one WebFetch request per URL, and that's when the full body comes back. How much of it?

This needs to be split into two layers, because the API and Claude Code work differently.

Note
When I say "API" here, I mean the Anthropic API's web_fetch tool. "Claude Code" means the WebFetch feature built into Anthropic's own product. The two have different specs and flows.

API-level web_fetch

The Anthropic API's web_fetch tool has a parameter called max_content_tokens that developers can set themselves — though the official docs use 100,000 tokens in their examples.

The docs also give a reference conversion:

Content size	Estimated tokens
Average web page 10 KB	~2,500 tokens
Large doc page 100 KB	~25,000 tokens
Research paper PDF 500 KB	~125,000 tokens

So a medium-length blog post in plain text is usually 1–2,000 tokens, way below the 100K ceiling. Truncation basically isn't a concern unless you wrote a 50,000-character monster.

One thing to note: web_fetch's citation works differently from web_search. It uses start_char_index / end_char_index to pick out a specific position in the article (although the docs don't pin down a hard character limit).

Claude Code's built-in WebFetch

Claude Code's built-in WebFetch goes a different route.

Per Mikhail Shilkov's breakdown, the WebFetch flow is:

Convert HTML to Markdown using the Turndown package
Extract the first 100 KB of plain text
Pass that 100 KB to the Haiku 3.5 model
Haiku summarizes the answer based on your prompt and returns the summary to the main model

The real kicker is step 3. The main model — the Claude model you're actually using — never sees the page's original text. It only sees the version Haiku summarized. Which means what your writing turns into by the time it reaches the main model is decided by how Haiku reads it, not by how much you wrote.

The citation has a limit too. The rule Mikhail extracted from Claude Code's internal prompt is:

Enforce a strict 125-character maximum for quotes from any source document.

So quotes max out at 125 characters.

You're probably wondering — so how much is 100 KB of plain text? For Chinese, where each character is roughly 3 bytes, 100 KB fits 30,000+ characters; for English at 1 byte per character, that's well over 100,000 characters. No regular blog post is going to hit that ceiling. So the real concern should be "how do I get Haiku to extract the parts I want to be quoted," not "how much content am I feeding Haiku."

Running a real test against my own blog

Enough theory — let's actually test it.

I had Claude call the WebFetch tool against my blog's homepage (israynotarray.com) and asked it to grab the body.

To make this measurable, I leaned on Claude Code's Hook feature to log every stage's input and output. Here's the hook config — every WebSearch and WebFetch I/O gets logged:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "WebSearch|WebFetch",
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/log-web-tool.sh"
          }
        ]
      }
    ],
    "SessionEnd": [
      {
        "hooks": [
          {
            "type": "command",
            "command": "/claude-tools/hooks/dump-transcript.sh"
          }
        ]
      }
    ]
  }
}

The prompt I used:

Please use the WebFetch tool to grab the content of israynotarray.com's homepage, then tell me what you extracted from that page.

The actual input passed to WebFetch:

{
  "url": "https://israynotarray.com",
  "prompt": "Tell me the main content of this homepage: site name, author bio, latest article list (titles and summaries), categories, plus any navbar or sidebar info."
}

After it ran, the WebFetch result — the summary that came back — looked roughly like this:

{
  "bytes": 1068,
  "code": 200,
  "codeText": "OK",
  "result": "# Page Content Analysis\n\nBased on the content snippet provided, I can share the following:\n\n## Site Structure\n- **Main category**: AI-related content, including Gemini and Gemini Omni products\n\n## Latest Article\n**Title**: \"Gemini Omni Video Generation: 30 Prompt Templates for Reels, Product Demos, and Logo Reveals\"\n\n**Summary**:\nThis post introduces Gemini Omni video generation, released by Google at I/O in May 2026, and explains how it replaces the earlier Veo 3.1. The article covers new features like '10-second clips, native synced audio, and Chinese text rendering,' and provides 30 practical prompt templates for use cases including Instagram Reels, product demos, logo animations, B-roll, transitions, quote cards, lifestyle, and food and travel.\n\n**Publish date**: 2026-06-06  \n**Reading time**: ~23 min\n\n## Limitations\nThe content snippet provided does not include author bio or full navbar info.",
  "durationMs": 4603,
  "url": "https://israynotarray.com"
}

See it? What the AI actually pulled from my blog is just this tiny under-1,000-character summary. A whole website obviously contains way more than this — the rest was never seen by the main model. The Haiku middle layer reads the page and only extracts what it judges relevant to my prompt. If I opened the same page in a browser, I'd see a full grid of article cards plus a sidebar — but Haiku doesn't ship the full grid back.

I also tried an older post with a deliberately broken URL path, and got this:

{
  "bytes": 0,
  "code": 404,
  "codeText": "Not Found",
  "result": "The server returned HTTP 404 Not Found.\n\nThe response body was not retrieved. If this URL requires authentication, use an authenticated tool (e.g. `gh` for GitHub, or an MCP-provided fetch tool) instead of WebFetch.",
  "durationMs": 588,
  "url": "https://israynotarray.com/dqwdqwdqwd"
}

Even the content of your 404 page is invisible to the AI — WebFetch just reports the 404 and the AI has no way to see what your 404 page says. Which means if your site has path issues, you've refactored URLs, or you only have frontend routing without real pages, the AI can't pull anything.

Side note — this lines up with a caveat in Claude's official docs:

The web fetch tool currently does not support websites dynamically rendered with JavaScript.

If your blog is a frontend SPA where content is entirely rendered by JavaScript at runtime, what the AI grabs might just be empty-shell HTML with no articles visible. Static generators (Hexo, Astro, Next.js in SSG mode) are relatively safe, since the build output is fully rendered HTML — the AI grabs and immediately sees content.

Don't forget the robots.txt layer

There's one more important piece — whether the AI can pull your site has a major prerequisite: robots.txt.

AI crawlers basically split into two types: search-style (cite and link back to your site) and training-style (eat content to feed the model, not necessarily linking back). The common mapping:

Crawler	Type	Behavior
Claude-SearchBot / Claude-User	Search	Real-time fetch when Claude answers, cites back
ClaudeBot	Training	Fetches content to feed Claude training
OAI-SearchBot / ChatGPT-User	Search	Real-time fetch when ChatGPT answers, cites back
GPTBot	Training	Fetches content to feed GPT training
PerplexityBot	Search	Used by Perplexity engine, cites back
Google-Extended	Training	For Gemini training
CCBot	Training	Common Crawl public dataset

If you want to be cited by AI but don't want your content used for training, the most common strategy is "allow search-style, block training-style."

Here's a robots.txt template you can copy-paste:

# Search-style AI crawlers: allow (they cite back)
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

# Training-style AI crawlers: block (consume data without citing)
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: Bytespider
Disallow: /

# Content signal: searchable but not for training, not for direct AI input
Content-Signal: ai-train=no, search=yes, ai-input=no

For a full Agent Readiness setup to score 100, see From 3 to 100! How to Get Your Site to Pass isitagentready's AI Agent Readiness Check.

So what does this concretely mean for writing?

Once you understand all the constraints above, there are four things worth specifically working on.

Titles need to stand on their own

At the search stage, all the AI sees about your article is two fields — title and URL.

If your title needs a subtitle or context to make sense, when the AI lines it up against ten other results it'll get skipped.

A quick comparison:

Weaker: "Implementation Notes"
Stronger: "Complete Implementation Notes: Content Negotiation for HTML-to-Markdown in a Cloudflare Worker"

The stronger version packs in "topic, tool, what it does, and article type" — the AI doesn't even need to open the page to know whether it's worth fetching.

Lead with the answer

At the WebFetch stage, the Haiku middle layer reads top-down. The first 300–500 characters decide what it summarizes back. If your opening is "Before we get into X, let's recap a bit of history…", Haiku reads halfway through and discovers the intro is all background and no answer — so it just summarizes the background.

The right move is to make the first sentence of every H2 a direct conclusion, then add the context after. I covered this principle in Is SEO Not Enough? Meet AEO — Getting Your Site Found by AI Search too — worth reading alongside this.

Design single sentences that can be quoted standalone

cited_text is 150 characters on web_search and 125 characters on Claude Code's built-in WebFetch. That means when the AI quotes you, the slot it has is one short sentence that "makes sense without context."

Consciously design sentences like that. For example:

Weaker: "This is a bit different from the method mentioned earlier — the main difference is…" (makes no sense without context)
Stronger: "llms.txt was proposed by Answer.AI co-founder Jeremy Howard in 2024, with the goal of proactively telling AI what important content a site has." (stands on its own without surrounding text)

After writing a paragraph, pick one sentence and ask yourself: if someone who hadn't read the rest of the post saw just this sentence, would they get it? If yes, it has a shot at being quoted.

Use structure as Haiku's navigation markers

H2, H3, - lists, tables, `js code blocks — these Markdown structures are especially useful for the middleware summary layer. When Haiku reads the Markdown converted from your HTML, it treats headings as "what this section is about" indexes, lists as "main points" signals, and tables as "supporting data" units.

If your whole article is pure prose paragraphs, Haiku has no markers and has to grind through it semantically — what comes out is scattered. If you have clear structure, Haiku can summarize along the markers, and the result lines up with the points you actually want quoted.

Wrap-up

So how much of your blog does AI search actually pull?

The answer breaks down into three layers:

Search stage: title + URL only
Body fetch stage: the API default can fit your whole article, but Claude Code goes through Haiku summarization with a 100 KB cut-off
Citation stage: web_search is 150 characters, Claude Code WebFetch is 125 characters

Writing for AI search means targeting those three gates — it's not about getting the AI to memorize your entire post.

If your blog hasn't set up AI bot routing yet, copy the robots.txt template above to get the basics in place — the rest is just content over time.

Is AI Getting Quietly Dumber? A 24/7 Benchmark That Catches LLM Degradation

Ray — Thu, 18 Jun 2026 03:54:54 +0000

You've probably hit this before — yesterday the AI felt sharp, fixed your bug without you even asking, and threw in a few extra cleanups along the way. Then today, same kind of problem, and suddenly it refuses to touch anything you didn't explicitly point at, or starts going in circles. And you start wondering "wait, is AI getting quietly dumber?"

A lot of people have this feeling but it's hard to pin down. Is the provider messing with the model? Is today's problem genuinely harder? Or is it all in your head? Without a number to point at, it just stays an urban legend.

So this post is about a site called AIStupidLevel — the name pretty much spells it out. It runs round-the-clock tests against the major AI models and turns "is AI getting dumber" into an actual curve you can look at.

Why would an AI "get dumber" in the first place?

A reasonable pushback: doesn't the model get trained once and then frozen? A few things are actually going on behind the scenes:

Silent updates. If you've worked with the API, you've probably seen version strings like gpt-4.0-20240924 — that trailing date hints at a specific build. But when you pick a model inside Codex or similar tools, you don't see that level of detail. Some models don't even have versioned IDs, just a generic gpt-4. So you can't actually tell which build you're hitting, and a silent swap is really hard to notice.
Quantization. To handle global peak traffic and save compute, providers sometimes compress the model.
Throttled compute. Once usage crosses a threshold, providers may cap how much compute each user can pull, which makes outputs feel worse.
Compute migration. When a new model is about to launch, providers sometimes shift compute away from the older model. From my own time hitting the API, this is the one I see most — the same prompt suddenly degrades overnight, then a few days later a stronger version drops, and shortly after the old version goes back to normal.

I covered quantization more thoroughly in Want to Run an LLM Yourself? Understanding Model Parameters and Quantization So You Stop Picking the Wrong Model — short version, it's a compression technique that saves resources but loses some detail.

So an AI model isn't actually a frozen thing. Providers can silently update it, quantize it, throttle it, or migrate compute away from it — and any of those will change what you experience. The hard part is, you usually can't tell.

So what is AIStupidLevel exactly?

AIStupidLevel is a third-party benchmark platform (open source, MIT) that continuously monitors whether AI models are regressing. Think of it like a system health check, except it's checking the "health" of AI models. It runs 24/7 against 21 production models from 7 providers — OpenAI, Anthropic, Google, xAI, DeepSeek, Kimi, GLM — and turns each model's current performance into a score on a dashboard.

It's not run by any AI company, which matters here. You don't want the people grading the models to also be selling them.

How does it actually test?

The core idea is simple:

Fixed question bank, run repeatedly. It maintains a fixed set of tasks, throws them at each model on a schedule, and logs the scores.
Same task N times. Because model outputs are stochastic, it runs each task 5 times and takes the median, plus a 95% confidence interval.

Four suites on a rotation

It doesn't just run one kind of test — it has four suites taking turns, each watching for different things:

Test Suite	Frequency	What it tests
Speed (coding)	Every 4 hours	147 coding problems, overall coding ability
Deep reasoning	Daily	5–7 turn multi-turn dialogues, checks long-conversation logic
Tool calling	Daily	Spins up a real Docker sandbox so the AI can actually run multi-step `execute-command` / `read-file` / `write-file` flows
Drift detection (canary)	Hourly	12 lightweight quick checks, first line of defense

The hourly canary plays sentinel — if something starts looking off, it sounds the alarm. The daily deep reasoning and tool-calling runs are the heavier full-body checkup.

The tool-calling suite is the one I find especially interesting. It actually spins up a Docker sandbox (think of it as an isolated mini-computer) and has the AI run real commands inside it, instead of just "verbally" claiming it can use tools. The results end up much closer to what you actually feel when using AI to write code.

Scoring isn't just right vs wrong

A single coding task gets scored across 9 weighted dimensions:

Dimension	Weight
Correctness	40%
Complexity	20%
Code Quality	15%
Stability	10%
Efficiency	5%
Edge Cases	3%
Debugging	3%
Format	2%
Safety	2%

Correctness clearly dominates, but even if the answer runs, you'll still lose points if the code is a mess, misses edge cases, or spews garbage formatting.

How does it catch the moment a model starts getting dumber?

Just having a score isn't enough — the score naturally bounces around because AI is non-deterministic. The real question is: is this drop a real regression, or just noise?

This is where the project gets technically interesting. It uses an algorithm called CUSUM — short for Cumulative Sum Control Chart. CUSUM didn't come from AI research; it's an old quality-control method from manufacturing. The idea is to keep accumulating the gap between observed performance and the baseline. Once the accumulated gap crosses a threshold, you call it: this isn't noise, something actually changed.

On top of CUSUM, it also runs statistical significance testing (checking whether the difference is statistically meaningful, p-value below 0.05) as a second pass, to keep false alarms down.

The real win: with this statistical machinery, a degradation can be detected within hours of starting, rather than waiting until people are venting on social media that "AI got dumber." The system has 29 warning categories built in for different anomaly patterns.

Reading the Stupid Meter

Each model has a live 0–100 score next to it — higher is better. Next to the score is a status tag telling you which of four states the model is in:

STABLE — performing normally
VOLATILE — jittering
DEGRADED — already worse
RECOVERING — climbing back up

Beyond the current score, it lays out each model's historical curve, so you can compare scenarios like "this model was rock-solid last week, why is it jumping around this week."

In practice it feels more like a stock-trading dashboard — you're not looking at a static ranking, you're checking "right at this moment, which model is worth using." If you're still wrestling with which AI tool to pick in the first place, my earlier post Which AI Coding Tool Should You Pick in 2026? pairs well with this one.

Smart Router: route around degraded models automatically

Beyond monitoring, AIStupidLevel also ships a feature called Smart Router, which is a pretty interesting extension of the project.

It's an OpenAI-compatible API endpoint, meaning code you wrote against OpenAI barely needs to change. You drop each provider's API key into it (stored with AES-256 encryption), point your base URL at it, and it routes each request to whichever model is currently in the best shape based on the live monitoring.

It offers six routing strategies — just set the model field in your API call to one of these and it picks dynamically:

auto-best — overall pick, whichever model has the best combined score right now
auto-coding — best at coding right now
auto-reasoning — strongest at reasoning
auto-creative — leans toward creative output
auto-cheapest — cheapest model above the quality bar
auto-fastest — fastest model above the quality bar

If you just pass auto, it uses whichever strategy you've set as your default. And if you pass a specific model name (like claude-opus-4-7), it pins that model directly and skips the router. So when a model quietly starts slipping and Smart Router catches it, traffic auto-routes to a sibling that's still healthy. Pretty useful if you actually want to wire AI into a product.

Wrap-up

So what is AIStupidLevel?

An independent, open source (MIT) third-party benchmark site monitoring 21 production AI models across 7 providers (OpenAI, Anthropic, Google, xAI, DeepSeek, Kimi, GLM), 24 hours a day
Method: fixed question bank, run repeatedly — each task runs 5 times, median plus 95% confidence interval, across four test suites on a rotation
Uses CUSUM change-point detection plus statistical testing to catch a model quietly degrading within hours
Ships Smart Router that auto-routes API traffic to whichever model is in the best shape based on live monitoring

Next time you feel like the AI suddenly got dumber, don't jump straight to blaming yourself — pop the dashboard open and you might actually find the evidence.

Claude Code Auto Memory: Stop Re-Explaining Your Preferences Every Session

Ray — Wed, 17 Jun 2026 02:50:33 +0000

If you've spent any real time with Claude Code, you know this pain: you open a fresh session and it's forgotten everything you taught it yesterday. I'd tell it over and over not to dump a pile of emojis into my writing because it looks too AI — and the next morning, fresh session, emojis everywhere again.

Turns out there's a fix built for exactly this, called Auto Memory. I'd already written about CLAUDE.md for setting project rules by hand, but Claude Code's memory system is bigger than just CLAUDE.md. Auto Memory quietly records your corrections and preferences as you work, so next session you don't have to say them again.

Auto Memory needs Claude Code v2.1.59 or newer. Check with claude --version.

Auto Memory vs CLAUDE.md

A lot of people mix these two up, so let's get it straight first.

	CLAUDE.md	Auto Memory
Who writes it	You	Claude, automatically
What's in it	Instructions and rules	Learned lessons and patterns
Scope	Project / user / org	Per working tree
Loading	Fully loaded every session	First 200 lines / 25KB only
Best for	Code conventions, workflow, architecture	Build commands, debugging notes, preferences

In short: CLAUDE.md is the set of rules you write for Claude on purpose — "use 2-space indentation," "write commit messages in English." Auto Memory is what Claude picks up on its own while working with you. You don't write it; it does.

What Auto Memory actually does

Auto Memory is basically Claude taking its own notes. When you correct it, or tell it whether an approach worked, Claude decides whether that's worth remembering for later. If it is, it saves it.

So when I told it:

Don't use so many emojis, it looks too AI

it saved that as a memory file, and every new session loads it back. Now it remembers I don't like emoji soup.

It doesn't save everything, though — one-off discussions or things irrelevant to future work don't get written down.

Where it's stored (this is the confusing part)

Here's the thing that trips everyone up. Your project has a .claude/ folder, but Auto Memory files are not in there. The project .claude/ only holds settings (settings.json, settings.local.json) and rules.

The actual memory lives under your home directory:

/Users/ray/my-project/.claude/                              <- project settings, inside your repo
/Users/ray/.claude/projects/-Users-ray-my-project/memory/   <- Auto Memory, under your home dir

Both paths contain .claude, but one is in your project and one is in your home directory — don't mix them up. If you peek into your project's .claude/ and only find settings.local.json, that's normal. The memory was never there.

The memory path is derived from your git repo, so subdirectories and worktrees of the same repo share one memory. And it's machine-local — it does not sync to other machines or cloud environments.

If you want it somewhere else, set autoMemoryDirectory (Policy / Local / User level only — not project level, on purpose, so a shared repo can't redirect your memory somewhere shady):

{
  "autoMemoryDirectory": "~/my-custom-memory-dir"
}

The directory structure

~/.claude/projects/<project>/memory/
├── MEMORY.md          # index, loaded every session
├── debugging.md       # debugging notes
├── api-conventions.md # API design decisions
└── ...                # other topic files Claude creates

MEMORY.md is the index — Claude uses it to track which memories exist and which file each one lives in. Only the first 200 lines / 25KB of MEMORY.md loads at session start, so Claude keeps it lean and pushes detail into separate topic files. Those topic files aren't loaded on startup; Claude reads them on demand. So your memory can grow without bloating context every single time.

(That 200-line cap is only for MEMORY.md. CLAUDE.md always loads in full — though shorter still means better adherence.)

How do you know it's writing memory?

When you see "Writing memory" or "Recalled memory" in the Claude Code UI, that's it updating or reading memory files. You don't have to do anything; it handles it.

Managing it

/memory — the main entry point. See what's loaded, toggle Auto Memory on/off, open the memory folder, or open any file in your editor.
Just ask — "remember: always use pnpm, never npm" saves a memory. "forget the rule about pnpm" updates or deletes it.
Edit by hand — they're plain Markdown files under ~/.claude/projects/. Add, edit, or delete whatever you want.
Turn it off — toggle it in /memory, set "autoMemoryEnabled": false, or set CLAUDE_CODE_DISABLE_AUTO_MEMORY=1.

And in case you were wondering: /compact does not wipe it. The files are on disk, not in the conversation. Compacting only squeezes the chat history; MEMORY.md and CLAUDE.md get re-read from disk and injected back afterward.

That's the whole thing. It's a small feature, but once you stop repeating yourself every session it's hard to go back.