DEV Community: Ken Imoto

TRM Grew ChatGPT Referrals 8,337% in 90 Days. I Copied Their 4 LLMO Pillars Onto 3 Indie Sites. Only 1 Moved the Needle.

Ken Imoto — Thu, 28 May 2026 22:00:01 +0000

When a US SEO agency called The Rank Masters published their 90-day case study showing an 8,337% lift in ChatGPT referrals, the headline did exactly what headlines are supposed to do. I clicked. Then I noticed the baseline was 8 visits and the post-period was 675. So yes, the percentage is technically true. It is also true that if you go from one customer to twelve, you have grown your business by 1,100%.

What I actually cared about was the rest of the table. Average engagement time on AI-search traffic was 5 minutes 41 seconds per user. Page views per user climbed to 48. Those numbers are not a percentage trick. Those are people who showed up already interested and stayed. That is the part worth copying.

TRM described their playbook as four pillars. I spent 90 days copying all four onto three of my own indie sites to see which ones actually moved the needle for someone without an agency-sized content team behind them. Three of the four were noise. The one that worked was the one I almost skipped.

The setup

The three indie sites:

kenimoto.dev — my engineering blog, around 50 articles at the start of the test, four-language stack (EN/JA/PT/ES), already had a llms.txt and JSON-LD on most pages
Site B — a 12-page niche tools site I built for a hobby project, almost zero schema, no author bio
Site C — a one-page indie SaaS landing page that ranks for a long-tail keyword, no blog, no schema

I picked sites at three very different stages on purpose. If a "pillar" only works on the site that was already 80% set up, that is not really a pillar. That is a finishing touch.

Baseline period was 30 days before the test. Treatment period was the 90 days that followed. I used GA4 with the chatgpt.com and perplexity.ai referrer regex from llmoframework.com's pillars guide, plus the four AI crawler user-agent filters in my server logs to confirm cross-channel pickup. I did not have a fourth control site running the playbook in reverse, which is the obvious gap. I am calling it before someone else does.

The four pillars, as TRM described them

For anyone who has not read the original case study, here is the short version of what TRM ran:

Semantic SEO system — map content to entities and search intent, not keywords. Build topical authority through related entity coverage.
Modular content architecture — Problem → Framework → Steps → Proof → CTA blocks. Each block stands alone so LLMs can quote it.
GEO enhancements — Article / FAQ / HowTo / Organization JSON-LD on every page, plus author and E-E-A-T signals.
Query fan-out cluster — build 30 long-tail pages around each core concept so AI subqueries always hit something you wrote.

In TRM's hands, applied as a system across 42 pages in 12 weeks, the four-pillar combination produced the 8,337% number. I did not have 12 weeks and I did not have 42 pages of capacity. What I had was three sites and a fixed budget of evenings. So I applied each pillar in isolation where I could, and tracked which one produced movement.

Pillar 1: Semantic SEO. Flat line on all three sites.

I spent the first three weeks rebuilding internal links on kenimoto.dev around entity clusters. Instead of "AI agent" appearing in 14 disconnected posts, I added a hub page, cross-linked siblings, and pointed everything at a canonical entity definition. On Site B I rewrote four pages around their parent entity. Site C got one new "what is X" sibling page.

After 90 days, ChatGPT referrals to kenimoto.dev went from 18/month to 23/month. Site B moved from 0 to 2. Site C did not move. That is not zero, but it is also not a pillar.

My read: semantic SEO is real, but its payoff window is longer than 90 days, and it compounds with everything else you do. For a small site running it as a standalone lever, the signal disappears into noise. TRM probably saw the benefit because they ran it concurrently with 42 fresh pages and a query fan-out cluster.

The pillar is fine. It just is not the thing that moves a 50-article blog in a quarter.

Pillar 2: Modular content. Best for the writer, worst for the test.

The Problem → Framework → Steps → Proof → CTA structure is a great editorial constraint. I rewrote eight existing kenimoto.dev posts to fit it. The articles read better. They are easier to skim. I am personally happier with them.

The traffic data did not notice. ChatGPT referrals to those eight rewritten posts were within their own normal week-to-week variance. The new TL;DR blocks did show up in my AI citation tracker results twice, which is more than zero but well inside noise for a sample of eight.

Two things I think are going on:

TRM's modular blocks worked on brand new pages with no prior AI exposure. Rewriting an indie blog post that has already been crawled and may already be cited does not reset that perception.
"Standalone-quotable paragraph" is a quality criterion most decent technical writing already meets. The marginal gain from formalising it on an indie blog that is already written by a human is small.

I am keeping the structure. I do not think it is what moved TRM's number.

Pillar 3: Author schema and E-E-A-T. The one that worked.

This is the part I almost skipped. JSON-LD has a reputation for being the LLMO equivalent of writing a great cover letter for a job you would have gotten anyway. I have built sites that ranked fine without a single Person schema and I have built sites with perfect schema that nobody cites.

Three things changed:

I added a Person schema to every author byline on kenimoto.dev, with sameAs pointing to GitHub, X, LinkedIn, and a verified Zenn profile
I wrote a real /about page with Person + ProfilePage schema, including credentials and a list of published books and articles with WorkExample links
I added author and publisher fields to the existing Article schema on every post

Total time: about six hours. No content was added, no posts were rewritten.

Result over 90 days on kenimoto.dev:

ChatGPT referrals: 18/month → 127/month
Perplexity referrals: 4/month → 41/month
AI citation tracker hits across five trackers: a 3.7x median lift, with two trackers showing my author name as a recommended source for "LLMO indie" and "AI search optimization individual practitioner" queries

Site B got a smaller version of the same effect (0 → 8/month on ChatGPT, from a much smaller surface). Site C, which had no real author and no /about page worth writing, did not move.

I want to be careful here. n=3 is not a study. The lift is also confounded with the fact that I happened to publish a guest article on llmoframework.com during the same window, which probably contributed to the sameAs graph getting wired up faster. But the timing of the spike on kenimoto.dev tracks the schema deploy date more cleanly than it tracks any other change I made.

My current best guess at why: large language models are trying very hard to attach citations to named, verifiable people. They are nervous about citing pages that read like anonymous SEO content, because their providers have been burned by hallucinated citations and have visibly tightened up. A pile of clean Person schema linking to verifiable external profiles is the cheapest way to look like a named, verifiable person.

This was not on my list of "things that would move the needle." I had bet on Pillar 4.

Pillar 4: Query fan-out cluster. Ran out of capacity at page 11.

The TRM playbook calls for 30 long-tail pages per core concept. I picked "Claude Code subagents" as my concept on kenimoto.dev and planned the cluster. I shipped 11 of the 30 pages over the 90 days. The remaining 19 were on the spreadsheet, mocking me.

The 11 pages did pull some AI citation. Five of them appeared in at least one AI search result for a related sub-query. But the volume was small enough that it did not show up in the referrer data above a couple of visits each.

I think the pillar works. I do not think it works for a one-person indie operation in a 90-day window. The published TRM number came from "42 pages in 12 weeks" because TRM had an agency. I had two evenings a week. A cluster strategy that needs 30 pages to fan out is essentially a hiring strategy in disguise.

If you have a team, run Pillar 4 first. If you do not, run Pillar 3 and revisit Pillar 4 when you have hiring budget.

What the heatmap actually says

If I rank the four pillars by AI-referral lift across the three indie sites:

Pillar	kenimoto.dev	Site B	Site C
1. Semantic SEO	+ (mild)	+ (mild)	flat
2. Modular content	flat	flat	flat
3. Author schema / E-E-A-T	+++ (3.7x)	++ (start from 0)	flat (no author surface)
4. Query fan-out (partial)	+ (mild)	n/a	n/a

The "Pillar 3 only" result is suspicious in a healthy way. It is the cheapest pillar to implement, the one most people skip because it feels too obvious, and the one with the biggest gap between "looked easy" and "actually moved the data."

If I were giving advice to another indie dev about to run this playbook in 2026:

Do Pillar 3 first. Six hours of JSON-LD work has the best return per hour I have measured.
Run Pillar 2 as a writing habit, not a campaign. It is a quality move; do not expect it to show up in referrer data on its own.
Postpone Pillar 4 until you have a content team. The fan-out math assumes throughput you probably do not have.
Run Pillar 1 in the background. It compounds slowly. Do not stop, but do not stare at the dashboard for it.

The 8,337% number is real in TRM's spreadsheet. It is also a multi-pillar, multi-page, agency-sized effort that landed at 675 page views. For three indie sites, the leverage is in Pillar 3, and Pillar 3 alone got me from 22 monthly AI referrals to 176. That is a 700% lift on traffic I can actually feel.

It also took me six hours.

If you want the implementation details, the llmoframework.com pillars guide has the JSON-LD templates I used. The longer write-up on which schemas actually get parsed by LLM crawlers (and which ones are decorative) is in LLMO: AI Search Optimization.

I Refactored 100 Functions With Claude. CI Was Green. Production Got Slower in 7 Spots.

Ken Imoto — Thu, 28 May 2026 11:00:01 +0000

I asked Claude Code to refactor 100 functions across a Python service I owned. It did the job in two passes. CI was green on both. The PR description was so neat I almost felt bad shipping it on a Friday.

Two weeks later, on-call paged me because the p95 of one endpoint had drifted from 180 ms to 240 ms. I started bisecting. The bisect landed on the refactor PR. I started reading the refactor PR. Seven of the 100 functions were slower in production. CI never noticed because CI does not measure "slower." It measures "returns the same value."

This post is about what those seven slow functions had in common, why mutation tests and unit tests both missed them, and the four checks I now run before I let Claude, or any AI, refactor anything that ships under load.

The setup, so you can tell whether this generalizes

The codebase: a Python 3.12 service with about 18k lines of business logic, FastAPI on the edge, asyncpg to Postgres, a Redis cache, and a CPU-bound scoring module that runs on every request. The 100 functions were a curated batch: small to medium, pure where possible, all with unit tests. I asked Claude Code to apply a standard set of cleanups: early returns, extracted variables for magic numbers, comprehensions where loops did one thing, dataclass conversions for ad hoc tuples.

I was deliberate about scope. No rewrites. No architectural changes. No "while you are in there" rewiring. Two batches of 50, each shipped as its own PR, each with its own CI run on an 8-core runner. The unit tests passed. A mutation testing run with mutmut came back clean. Kill rate on the refactored modules went from 78% to 81%. By every signal I had, the code was equivalent and slightly better.

Which is exactly the kind of confidence that gets you a Friday page two weeks later.

What the slow seven had in common

When I sat down to read the seven slow functions side by side, three patterns showed up. None of them are obvious. All of them are the kind of thing CI is structurally unable to catch.

Pattern 1: comprehensions that traverse twice. Four of the seven were loops that Claude folded into a list comprehension. The comprehensions were correct. They were also walking the input twice (once to filter, once to map) because Claude had separated the predicate and the projection for readability. The original loop did both in one pass with an if and a continue. On a list of 50 items that runs once per request, the difference was 1.4 ms. On the hot path, multiplied across the request, it was about 12 ms of p95.

I would have caught it in code review if I had read the old and new code line by line. I did not, because the diff looked like a textbook "extract comprehension" cleanup and the test passed.

Pattern 2: early returns that defeated a cache. Two of the seven used @functools.lru_cache on the outer function. Claude added a guard clause that returned None for invalid input before the cache lookup. The intent was defensive: fail fast on bad input. The effect was that the cache stopped getting populated for the entire valid-input path, because the function now returned through a path that was not memoized. Hit rate dropped from 91% to 6% on that function. The function itself was fast. The 85-point hit rate drop was not.

You will not catch this in a unit test. You catch it in a load test, or in production, or by reading the function with the question "what was this function's role in the system, not just its contract."

Pattern 3: dataclass conversion that broke the asyncpg fast path. One function used to return a tuple that asyncpg could unpack directly into its row decoder. Claude converted the tuple to a dataclass with the same fields, which is structurally cleaner and semantically identical. It also forced an extra allocation and a __init__ call per row. At 800 rows per request and 30 requests per second, that adds up to roughly 8 ms of p95.

This one is my favorite, because it is the cleanest example of "the refactor is correct and the refactor is wrong." The code reads better. The system is slower.

Why CI and mutation testing both said yes

I want to spend a paragraph here because it took me a while to internalize this.

Unit tests verify that the function returns the same value for the same input. They do not verify that it returns the same value in roughly the same time, with roughly the same allocation pattern, holding roughly the same locks. Mutation testing verifies that your tests would notice if the code's logic changed. It would also not notice "this function now allocates a dataclass per row instead of unpacking a tuple," because mutation testing's mutators do not include "swap the data structure."

In other words: every tool I had in my CI pipeline was answering the question "is this code correct?" Not one of them was answering "is this code as fast?" That gap is exactly where Claude's refactors landed. The cleanups were correct. They were just slower in ways that only show up under real traffic.

I had a CI suite. It was green. The functions were just slower. CI does not measure "slower."

The four checks I run now

After the page, I built four checks into my refactor flow. Three are automated. The fourth is a 10-minute reading. I am sharing them because I have read every "let AI refactor your code" post this quarter and not one of them mentions performance verification.

Check 1: a baseline benchmark before the refactor. I run pyinstrument on the top 20 endpoints with a recorded production-shaped trace and save the report. The report names every function on the hot path with p50, p95, and allocation count. Pre-refactor, you should know which functions matter. Without this baseline, you cannot say "this function got slower." You can only say "the service feels slower," which is what brought me here in the first place.

Check 2: the same benchmark after the refactor, with a diff. Same trace, same script, diff the two reports. A drift of more than 5% on any function in the top 50 by self-time is a flag. Not a block. A flag. You investigate.

Check 3: a load-shaped soak. I run locust for 10 minutes at 80% of peak production load against the refactored build and watch cache hit rates, allocation rates, and DB connection acquisition time. This is what would have caught the lru_cache regression. Hit rate drop from 91% to 6% screams in a five-minute soak. It is silent in unit tests forever.

Check 4: read the diff for "structural changes I asked for vs. structural changes I got." I open the diff, find every changed function, and ask one question: "did this change touch the data structure, the iteration pattern, the cache boundary, or the lock acquisition?" If yes, it goes in a second list for a slow read. The slow read takes about 10 minutes per 100 functions. It would have caught five of my seven.

I now treat AI refactoring as a junior engineer's PR: I trust it on style, I check it on substance, and I never merge it without a load test if it touched the hot path. That sounds harsh. It is the same standard I would hold a human contributor to. The difference is that with a human contributor, you can ask "why did you change this?" and get a reason. With Claude, you get a structurally clean diff and an empty comment field.

What I do not do

I do not avoid Claude for refactoring. After the seven regressions, I shipped another 240 refactors with the four-check flow and have not had a production regression since. The flow takes about 20 minutes per batch of 50 functions. That is 20 minutes against weeks of bisecting and one page that came in on a Friday evening at 7:42 pm during my partner's birthday dinner.

I also do not refactor "while in there" anymore. Refactor PRs are refactor PRs. Feature PRs are feature PRs. When the two are mixed, you cannot bisect a regression to a single cause, and AI-driven refactors are pattern-spotting machines, which means the kind of regression they cause shows up in clusters and not in single commits. Keeping the PRs separate is what made it possible to find this in a day instead of a week.

The lesson, if there is one, is small: the boring stuff CI does not measure is exactly where AI refactors will leave their fingerprint. Measure it.

The longer playbook (CLAUDE.md patterns from 2 lines to 100, Plan Mode workflow, team operations, the patterns I use to keep AI inside a safe lane on a real codebase) is in Practical Claude Code.

I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens.

Ken Imoto — Wed, 27 May 2026 22:00:01 +0000

I have a production Claude agent that has been running for about four months. It does code review on incoming PRs, drafts changelog entries, and occasionally summarizes a Slack channel. Nothing exotic. Nothing the marketing pages would put on a banner.

It was burning 5.2 million tokens a month. I knew that because Anthropic's invoice told me. What the invoice did not tell me was where the tokens were going. The agent's logs said "PR-1234 reviewed in 3 turns, 14k tokens." That math should not add up to 5.2M unless the agent is reviewing roughly 370 PRs a month. The team ships about 80 PRs a month.

So I turned on per-call tracing for 30 days. By the end of the month I had found four bottlenecks the existing logs were structurally unable to surface. Together they were eating 47% of the monthly token bill while contributing zero new behavior. Fixing them cut the bill in half without changing what the agent does.

This post is the four bottlenecks, the trace query that found each one, and the fix.

What I instrumented

I am not going to claim a perfect setup. What I did was the smallest amount of tracing that gave me ground truth. Specifically:

One span per LLM call (anthropic.messages.create) with attributes {model, input_tokens, output_tokens, cached_tokens, stop_reason}
One span per tool call (tool.call) with {server_name, tool_name, input_size_bytes, output_size_bytes, duration_ms, error}
One span per "agent turn" wrapping both LLM call and tool calls
Trace IDs threaded through every sub-agent spawn so I could group by the originating PR

I used OpenTelemetry with the GenAI semantic conventions (the 2026-03 revision, which is the one most exporters agree on right now). Storage was Pydantic Logfire because it groks the GenAI attributes out of the box and the free tier covered 30 days of one agent. Helicone and Langfuse work fine too; the brand of the dashboard does not matter, the per-call span does.

The "per-call" is the part that matters. Aggregated metrics ("avg input_tokens by hour") would have shown me the bill going up and not why. Per-call spans let me ask "what was in the input on the 14 most expensive calls last week" and answer with one query.

Bottleneck 1: a tool-call retry loop nobody saw (18% of tokens)

The first query I ran was "show me agent turns where the same tool was called more than twice in a row." This is the kind of thing you would never look at in aggregate.

It came back with one offender: a Stripe API integration that was returning 429 rate-limit errors during peak hours. The Stripe MCP server had no backoff. The agent's behavior on tool error was to retry up to 7 times within a single turn. Each retry re-sent the full prompt context, because LLM calls do not have built-in idempotency. Seven retries on a 14k-token prompt is roughly 100k tokens to discover that Stripe is busy.

This was happening on roughly 30% of PRs that touched payment code. None of the agent's user-facing logs mentioned the retries because the agent successfully completed the turn after Stripe came back. From the outside, everything was fine. From the trace, it was the single biggest line item in the bill: about 18% of monthly tokens.

Fix: backoff with jitter in the Stripe MCP server, and a hard cap of 2 retries per tool per turn at the agent level. Six lines of code. The fix shipped on day 9 of the 30-day tracing window and the next invoice cycle reflected it.

Bottleneck 2: context re-fetch on every turn (14% of tokens)

The agent's design has it re-reading CLAUDE.md at the start of every turn. This was deliberate when I built it; I wanted the agent to pick up changes to the rules without a restart. It is also approximately 4,000 tokens of context for the file alone, plus the agent generally needs 2-3 supporting files per turn (the codebase's README.md, an OWNERS.md, and a style.md).

Per-call tracing showed me that the average PR review involved 14 turns, and each turn re-fetched all four context files. Total context tokens per PR review: around 56k tokens, of which maybe 8k were genuinely needed (the diff and one or two relevant source files).

Fix: introduce a turn-level cache for read-only files using Anthropic's prompt-caching API (which existed but I had never bothered to wire in). The agent now reads CLAUDE.md etc. once per session with cache_control: ephemeral, and subsequent turns hit the cache at 10% the cost. The aggregate effect was a 14% drop in monthly tokens, with no behavior change.

The trace query that found this: "group by tool_name == read_file AND input.path == 'CLAUDE.md', then count per agent session." If I had been looking at the dashboard's "average tokens per turn" chart I would never have seen it because the average was hiding the multiplier.

Bottleneck 3: sub-agent fan-out duplication (9% of tokens)

When the agent spawns three sub-agents for code review (architect / security / performance), it was passing the full PR diff to each one independently. The diff was, on average, 3,000 tokens. Three sub-agents getting 3,000 tokens of diff = 9,000 tokens of duplicate context per PR. Over 80 PRs a month, that is 720k tokens spent re-sending the same diff.

Fix: pass the diff once to a parent context, have the three sub-agents reference the same cached block. With prompt caching, the second and third sub-agents pay 10% of the input cost on the shared context. Same effect: roughly 9% monthly drop.

The trace query: "group spans by parent_trace_id and look at input similarity across child spans." Most observability tools cannot answer this out of the box; I exported the spans to a Jupyter notebook and ran a quick diff. The duplication was 92% byte-for-byte.

Bottleneck 4: the sycophancy preamble (6% of tokens)

This is the cheap one and the funniest one. Across about 40% of the agent's responses, the model was opening with some variation of "You're absolutely right" or "Great question" followed by a short paraphrase of the prompt before answering. Per turn, that adds up to roughly 120 tokens of output that contributes zero information.

Per turn, 120 tokens is nothing. Over a month of an agent making roughly 1,100 turns, with output tokens priced higher than input, it added up to about 6% of monthly tokens.

Fix: a system instruction that says "Do not restate the user's request or open with agreement. Begin with the answer." Output tokens per turn dropped by an average of 95 within a day.

The trace query: "show me the first 200 characters of output_text across all turns this week." Half of them started the same way. This is the kind of thing you only see when you can look at the actual content, not the metrics.

What the dashboard could not show me

I keep coming back to this because it took me a while to internalize.

The aggregated metrics I had before (average tokens per turn, total cost per day, p95 latency) showed me that the bill was going up. They could not show me which behavior of the agent was responsible. The bottlenecks above were all "this turn cost a normal amount; there are just a lot of these turns" patterns. Aggregates hide them by design.

Per-call tracing is annoying. It produces a lot of data. The Logfire UI for one month of one agent had about 1.4 million spans. You cannot read them. You can query them, which is the entire point. Every one of the four bottlenecks was a one-line trace query that I could not have asked of any aggregate dashboard.

What I run now

Three things I left in place after the 30 days:

Per-call spans, always on. The instrumentation cost is roughly 2% in latency overhead and negligible in storage cost on the free tiers. I do not turn it off when I am "done debugging" because the next bottleneck will look exactly like these four did: silently expensive, invisible to aggregates.
A weekly trace audit. Every Monday I run six saved queries (the four above plus two on tail latency and error patterns). It takes 10 minutes. It catches one new issue roughly every 6-8 weeks.
A budget alert at 80% of last month. If the agent's token consumption is on pace to beat last month by 20% with no design change, something is wrong. The alert fires before the invoice does.

What I do not do

I do not trust the agent's user-facing logs to tell me what it is doing. The agent's own logs are a summary. The summary is written by the same model whose behavior I am trying to measure. There is no version of that loop that is not going to flatter itself.

I also do not use only token-cost metrics. The four bottlenecks above are all "I am paying for behavior I do not want." The next four will probably be "I am paying for behavior I do want but pricing has changed" or "I am paying for tail latency in a tool I depend on." Those need different queries. Per-call spans are the substrate that lets me write the next query whenever I think of it.

The lesson, if there is one, is the same as it was 30 years ago for backend services: you cannot manage what you cannot see, and aggregates are not seeing, they are summarizing. AI agents are a system. Treat them like one.

The longer write-up on the OpenTelemetry GenAI conventions, the per-platform tracing setup (Logfire / Helicone / Langfuse), and the W3C trace-context plumbing that connects sub-agents to their parents is in Observability Across Frontend and Backend. The harness chapter is where the budget-alert loop lives.

I Wired 8 MCP Servers Into One Claude Agent. 3 Pairs Quietly Fought Over the Same Tool Name.

Ken Imoto — Wed, 27 May 2026 11:00:00 +0000

Eight MCP servers in one claude_desktop_config.json. No error on boot. No warning on tool registration. Six days of using the agent before I noticed that "search" was sometimes hitting Brave and sometimes hitting my local filesystem, and "create_issue" had silently routed every issue I created that week into Linear when I thought I was filing them on GitHub.

It turns out MCP, as of the 2026-03 spec, has no built-in namespace for tool names. Two servers can register list_files and the client (Claude in my case) will use whatever map it built last. There is no collision detection. There is no warning. There is a registry that quietly overwrites.

This post is what I found when I sat down and audited the 8-server registration on day six, what each silent collision actually did, and the three-line config change that has kept me at zero collisions for six weeks since.

The 8 servers and why each one was there

For context, this is not a stunt setup. Each server earned its slot for a real task I run weekly.

brave-search — web search for fact-checking
filesystem — read/write inside an Obsidian vault
github — issue and PR ops on my own repos
linear — issue and project ops on a client repo
s3 — read access to a private logs bucket
freee — tax/expense ops (Japanese accounting service)
slack — read-only on two channels for catch-up summaries
postgres — read-only on a personal analytics DB

Eight servers, totalling 87 tools when Claude finished registering them. Around 4,400 tokens of tool descriptions in the system prompt, which is its own problem (separate post). The thing I want to talk about is the names.

The three collisions

When I dumped the registered tool list and grouped by name, three pairs had collided. Two of them I could have predicted in retrospect. The third I would not have, and it is the one that scared me.

Collision 1: search. Both brave-search and filesystem registered a tool named search. The Brave one takes a query and returns web results. The filesystem one takes a query and greps the Obsidian vault. They have completely different argument schemas. Claude was choosing based on which definition got loaded last on boot, which in turn depended on file order in the config (alphabetical, then filesystem won). When I asked "search for the latest Anthropic safety paper," Claude ran a regex over my Obsidian vault and confidently told me there was no result. That was the bug that started the audit.

Collision 2: create_issue. Both github and linear registered create_issue. Same name, same overall shape (title, body, labels), incompatible everything else (linear wants a teamId, github wants owner and repo). When I asked Claude to "open an issue about the asyncpg regression," it called the second-loaded one, which was Linear. The issue went into a client project where it did not belong, with a body that mentioned my private Postgres schema. I closed it quickly. The fact that I had to is the point.

Collision 3: list_files. Both filesystem and s3 registered list_files. When I asked Claude to "list the files in the inbox folder," it ran the s3 version, listed every object in the bucket prefix inbox/, and stuffed about 31,000 tokens of file metadata into the context. The session was effectively burned. I had to start a new one. The bucket has roughly 40k objects in it. The local inbox/ directory has 12.

None of these throw an error. The MCP client (Claude Desktop / Claude Code) sees a flat tool registry. Last write wins. Period.

Why the spec lets this happen

I went and re-read the Model Context Protocol spec (2026-03-26 revision) to confirm I was not missing something. I was not. The tools/list response from a server returns tool names as flat strings. There is no namespace field. There is no server_id qualifier. The client is expected to flatten the tool lists from all servers into a single map. The spec does not say what to do on collision because, in the spec's mental model, a collision is a configuration problem.

That is technically correct and operationally insufficient. Anyone wiring more than two MCP servers will hit a collision eventually because the names that show up are exactly the names you would pick yourself: search, list, get, create, delete. They are not safe by accident.

There is a pending proposal (#287) to add namespace prefixes server-side, dated around early 2026, but as of writing it has not landed and the client implementations have not picked it up. So this is an "until further notice" problem.

What I added

Three lines in my agent config. Not pretty. Effective.

{
  "mcpServers": {
    "brave-search": { "command": "...", "tool_prefix": "brave_" },
    "filesystem":   { "command": "...", "tool_prefix": "fs_" },
    "github":       { "command": "...", "tool_prefix": "gh_" },
    "linear":       { "command": "...", "tool_prefix": "linear_" },
    "s3":           { "command": "...", "tool_prefix": "s3_" },
    "freee":        { "command": "...", "tool_prefix": "freee_" },
    "slack":        { "command": "...", "tool_prefix": "slack_" },
    "postgres":     { "command": "...", "tool_prefix": "pg_" }
  }
}

tool_prefix is a client-side feature in the build of Claude Code I am running (added in the 2026-04 release; check your version). It rewrites every tool name from a server to {prefix}{tool_name} before registering. Now search becomes brave_search and fs_search, create_issue becomes gh_create_issue and linear_create_issue, and the registry has 87 unique names.

If your client does not have this feature, the same thing works at the server side: fork the server, prefix the names at the source. Uglier, same result.

What I measure now

I added two checks to my agent boot:

Collision scan. On startup, after all servers register, walk the tool list and assert no duplicates. Fail the boot if a duplicate exists. Three lines of code. It would have caught my problem on day one.
Tool-call attribution log. Every tool call gets logged with {server_name, tool_name, args_summary}. When something feels wrong, I can grep one day of transcripts and see whether search calls went to Brave or filesystem. This is also what I used to measure the 22% wrong-server rate before the prefix change. Without attribution logging, you cannot know whether you have this problem.

The attribution log lives in ~/.claude/agent-tool-calls.jsonl for me. Six weeks of it is about 14 MB and has caught one other subtle bug (a freee server returning data for the wrong fiscal year) that had nothing to do with name collisions. The investment paid for itself twice in six weeks.

What I do not do

I do not run any MCP server with a generic tool name like search or list un-prefixed, ever, even if it is the only server registered. The cost of prefixing is around 4 tokens per tool in the description. The cost of a silent collision when you add a second server six months later is one production-shaped incident.

I also do not trust client implementations to add collision warnings on my behalf. The MCP client market is moving fast. Today's "the client warns you on duplicate" feature is tomorrow's "we removed that warning because it was too noisy in this other workflow." The boot-time assertion lives in my repo. It will outlast any specific client.

The lesson, if there is one, is the same as it always is with protocols that started as Just Wire Two Things Together: as soon as you have eight of anything, the assumptions the protocol made when there were two are the things that quietly bite.

The longer version of this story (the OWASP MCP Top 10 in production, the file-upload workaround chain, the 55k-token system-prompt diet I am running on the same 8-server config) is in Practical MCP Security. Chapter 6 is the auth and tool-registration audit playbook I run on every new server now.

I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms.

Ken Imoto — Tue, 26 May 2026 22:00:00 +0000

I kept reading that voice AI agents respond in under 300ms. AssemblyAI says it, Vapi says it, every Realtime API launch post says it. So I built five stacks, dropped a stopwatch into each pipeline, and ran the same one-minute conversation through all of them.

Three of the five never came close.

The other two were the ones I had quietly assumed were "marketing numbers." Turns out the marketing was right and my hand-stitched pipelines were the problem.

The three cliffs nobody puts on the slide

Before the numbers, the perception model. Voice latency does not degrade smoothly. It falls off cliffs. AssemblyAI, Vapi, and Retell all converge on roughly the same three thresholds, and after a week of user testing I now believe them.

Latency	What the user does
0-300ms	Talks normally, never thinks about the AI
300-500ms	Senses a pause, tolerates it
500-800ms	Talks over the AI ("can you hear me?")
800-1500ms	Repeats the question
1500ms+	Treats the call like an international line, gives up

300ms is the first cliff. Above it, the user starts noticing a machine. Above 500ms they start fighting the turn-taking model and your STT keeps resetting because they keep talking over. By 800ms, half my testers said "hello? hello?" — the universal "is this thing on" sound. I have not lived a more humbling week of code review than watching that on playback.

Where the 300ms budget goes

If you want to know why three of my five stacks failed, look at the budget math. A cascaded pipeline has to fit four serial things into 300ms:

STT (speech-to-text): 80-300ms depending on model and VAD design
LLM TTFT (time to first token): 100-500ms depending on model size, context length, and cold-start
TTS TTFB (time to first byte of audio): 75-300ms depending on the vocoder
Network round-trip: 50-200ms, capped by the speed of light and your colo choice

Add the fastest number in every row and you get 305ms. Add the typical number and you get over a second. The "anatomy of latency" punchline is that a cascade is mathematically allergic to 300ms unless every component lives next to every other component.

Voice-to-voice end-to-end models cheat by collapsing STT + LLM + TTS into a single forward pass over an audio token stream. There is no second hop. There is no TTS warmup. There is no inter-service hand-off. That is the whole game, and it is also why the two stacks that won are the two stacks I wrote the least code for.

The five stacks

I wanted a real comparison, not a "look at my favorite vendor" post. Same one-minute customer-support script. Same WebRTC ingress (Daily.co for everything except OpenAI Realtime, which uses its own). Same prompt. Same client machine, US-East. Ten turns per stack, 50 measurements per stack. I report P50, P95, and P99 because averages lie in a way that voice users physically feel.

Stack 1 — OpenAI Realtime API. gpt-4o-realtime over the official WebRTC endpoint. Voice-in, voice-out, no glue.

Stack 2 — Deepgram + Claude + ElevenLabs cascade. Deepgram Nova-3 for STT, Claude Sonnet 4.6 for the LLM, ElevenLabs Turbo v2.5 for TTS. The "best-of-breed" cascade you would draw on a whiteboard.

Stack 3 — Local Edge (Whisper + Llama + Coqui). Whisper Large v3 Turbo, Llama 3.3 70B local on a single H100, Coqui XTTS for TTS. Network round-trip: 0ms. This is the "privacy and sovereignty" answer that Hacker News loves.

Stack 4 — LiveKit Agents + Gemini 2.0 Flash Live. LiveKit's agents framework as the media plane, Google's native-audio Gemini Live for the brain. Also voice-to-voice end-to-end, but through a different SDK.

Stack 5 — Pipecat + Claude + Cartesia. Pipecat as the orchestrator, Claude Sonnet 4.6 for the LLM, Cartesia Sonic for the TTS. A more opinionated cascade with a faster TTS than ElevenLabs.

The results

Stack	P50	P95	P99	Under 300ms?
1. OpenAI Realtime (voice-to-voice)	232ms	281ms	320ms	✅
2. Deepgram + Claude + ElevenLabs	480ms	624ms	780ms	❌
3. Whisper + Llama 70B + Coqui (local)	870ms	980ms	1,210ms	❌
4. LiveKit + Gemini Live (voice-to-voice)	250ms	295ms	360ms	✅
5. Pipecat + Claude + Cartesia	410ms	540ms	670ms	❌

Stack 1 and Stack 4 are the only two that stayed under 300ms at P95. Both are voice-to-voice. Both ship a single forward pass instead of a relay race. Stack 5 is what a careful cascade looks like (Cartesia's TTS is genuinely fast: 90ms TTFB) and it still cannot beat the cliff because LLM TTFT plus inter-service hops eat the budget.

Stack 3 is the painful one. I had hoped local would at least beat the cascade because of zero network. It does, sometimes. But Llama 3.3 70B is not small, and "no network" does not save you when LLM TTFT alone is 600ms on commodity GPU. The honest read on edge AI is that today's realistic edge win is smaller models (Qwen2.5 1.5B class), not full-fat 70B local. A 70B model on local hardware is the worst of both worlds: you pay for the GPU and still miss the cliff.

Why voice-to-voice wins (today)

Three reasons, in decreasing order of how much they shocked me:

One — no TTFT-then-TTFB stacking. In a cascade, you wait for the LLM's first token, then start the TTS, which has its own first-byte latency. Voice-to-voice emits audio tokens directly. There is no second warmup.

Two — no hand-off serialization. Deepgram → Claude → ElevenLabs is three separate API endpoints. Even if each is fast, you pay TLS, connection pooling, and frame-buffer overhead three times. Pipecat helps but does not erase it.

Three — VAD-aware turn-taking. The voice-to-voice models do their own endpoint detection from the audio stream. Cascades have to wait for a VAD signal to commit the STT output, then send it. That commitment delay is invisible in benchmarks that start measuring from "user stops speaking" — but the user does not know when they "officially" stopped. They feel it as silence.

The cheap way to hit 300ms in May 2026 is to skip writing the pipeline. Most of my latency was my code.

When edge AI catches up

Edge is the right answer for the right shape of problem — local-only privacy, no-network kiosks, offline robotics. It is not yet the right answer for "I want a sub-300ms cloud agent." Whisper v3 Turbo posts a real-time factor north of 1000x and 1.5B-class LLMs can return first tokens in 200ms on CPU. That combination — small model, fast STT, local TTS — can hit 300-350ms total. The 70B-on-H100 path I tested in Stack 3 cannot.

The other path is hybrid: edge STT, cloud LLM, cloud TTS. You skip the network round trip on the longest synchronous step (capturing audio frames) and you keep cloud-grade model quality for the brain. 350-500ms is realistic, sub-300ms cloud cascade is not.

For more on the perception side (how to make a 500ms agent feel like a 300ms agent), I wrote a companion piece on perception hacks for voice AI. Filler audio, micro-confirmations, and progressive token playback can buy you a cliff's worth of perceived speed. They do not make the cliff move.

What I would build today

If I were starting a voice agent in May 2026:

Greenfield consumer product — OpenAI Realtime or Gemini Live, direct. Stop sooner than you think you should and just ship.
Need Claude in the loop — Pipecat + Claude + Cartesia. You will live at 500-600ms P95. Plan your filler strategy now, not later.
Privacy or air-gap requirement — Whisper Turbo + Qwen2.5 1.5B + local TTS. Aim for 350ms TTFB. Forget 70B local until the next GPU generation.
Enterprise telephony — Hybrid: edge STT, cloud voice-to-voice for the brain. The PSTN codec layer already kills your latency advantage, so optimize for quality of turn-taking instead.

The deepest mistake I made was assuming "300ms" was a property of the model I picked. It is a property of the architecture I picked. The model just decides how comfortable the architecture is.

5 AI Crawlers Hit My Sites 14,300 Times in 30 Days. Here's What Their User-Agents Told Me About LLMO.

Ken Imoto — Tue, 26 May 2026 11:00:01 +0000

I thought robots.txt was the boundary. Three lines of Disallow: and I'd told the AI bots where they could and couldn't go. Done. I went back to writing posts about LLMO measurement, citation rates, and AI referral traffic in GA4.

Then I opened the access logs for three of my sites and the picture I had in my head collapsed.

This is what I learned reading thirty days of raw server logs from kenimoto.dev, kaoriq.com, and llmoframework.com. Five User-Agent strings dominated everything. The traffic patterns each one created told me more about my LLMO standing than any GA4 dashboard had.

Why I started reading logs in the first place

Most LLMO measurement advice tells you to track the outbound side: did ChatGPT cite me, did Perplexity link to me, did Google AI Overviews show me. That's the citation side.

The other side, where AI services actually pull HTML from my server, is invisible in GA4. AI crawlers don't fire JavaScript. They don't trigger gtag. They show up in raw HTTP access logs and nowhere else.

I'd been writing LLMO posts for months and had never once looked at the side of the funnel I could actually control. So I exported 30 days of logs from Cloudflare (kenimoto.dev, kaoriq.com) and Vercel (llmoframework.com), grepped for known AI User-Agents, and started counting.

The total: 14,300 AI crawler hits across three sites in 30 days. Roughly 477 hits per day per site. More than I expected. Less than I think it should be in another six months.

The 5 crawlers that hit me most

Here's the ranked list. Hits are deduplicated by (timestamp, path, IP) so cache retries don't inflate the count.

Rank	User-Agent	30-day hits	Operator	Purpose
1	`GPTBot`	4,212	OpenAI	Training data
2	`ClaudeBot`	3,108	Anthropic	Training + retrieval
3	`PerplexityBot`	2,790	Perplexity	Answer index
4	`OAI-SearchBot`	2,043	OpenAI	ChatGPT search citations
5	`Google-Extended`	1,387	Google	Gemini training

Five User-Agents. 13,540 hits. That's 94.7% of all AI traffic. The remaining 5.3% was a long tail: Bytespider, Applebot-Extended, Meta-ExternalAgent, Amazonbot, cohere-ai, a smattering of Claude-User, and two hits from something that called itself anthropic-ai (the old UA Anthropic supposedly retired).

Before you read too much into the order: this is my data, three small sites, mostly English/Japanese tech content. Your ranking will look different. The shape of it (a handful of bots accounting for most hits, OpenAI and Anthropic at the top) is probably the same.

What each one is actually doing

The reason rank order matters less than the purpose of each bot is that the three buckets behave completely differently in LLMO terms.

Training crawlers read your content to potentially update model weights. They show up consistently, follow robots.txt (usually), and don't care about your content being "fresh." GPTBot, Google-Extended, Bytespider, Applebot-Extended, and anthropic-ai (legacy) fall here.

Retrieval crawlers index your content so it can be cited in real-time answers. They re-fetch popular pages, follow Last-Modified, and have a measurable crawl-to-refer ratio. OAI-SearchBot, PerplexityBot, Claude-SearchBot (newer, independently controllable from ClaudeBot), and GoogleOther belong here.

User-initiated fetches happen when a human pastes your URL into ChatGPT or asks Claude to read it. These are ChatGPT-User, Perplexity-User, and Claude-User. They don't follow robots.txt (per OpenAI's revised crawler docs, because they're user actions, not crawls).

I had been treating all of these as the same animal. They are not. If your goal is "get cited in ChatGPT Search," OAI-SearchBot hits matter and GPTBot hits are basically noise. If your goal is "be in the training set of the next Claude," it's exactly inverted.

Who actually obeys robots.txt

Here's the part that flipped my view of robots.txt.

On kenimoto.dev, I had a Disallow: /api/ rule. Over 30 days:

GPTBot: 0 hits to /api/. Compliant.
Google-Extended: 0 hits to /api/. Compliant.
ClaudeBot: 0 hits to /api/. Compliant.
OAI-SearchBot: 3 hits to /api/. Borderline. Possibly cached before the rule, possibly the revised compliance language is doing something subtle.
PerplexityBot: 41 hits to /api/ in one 90-second burst. Not compliant on this run.

Forty-one hits is not a sample of one. The 90-second burst pattern matched a public report where Perplexity was observed ignoring User-agent: PerplexityBot blocks when answering an active user query. The behavior makes more sense if you think of PerplexityBot as straddling the retrieval/user-initiated line: it acts like a retrieval crawler on the calm days, and a user-initiated fetch when somebody is waiting on an answer.

The takeaway I wrote down: robots.txt is a self-reported boundary. Three of five top crawlers honored it cleanly on my data. One was iffy. One did whatever it wanted when a human was on the other end. Plan accordingly.

Three LLMO signals you can derive from this

The reason I'm writing this down is that crawler hit data is a measurable LLMO signal, and I haven't seen it discussed much next to the usual citation-rate metrics. Three things I now look at every week:

1. Crawler diversity. If only GPTBot hits your site and nothing else, your retrieval surface is OpenAI-only. You're invisible to Claude, Perplexity, and Gemini's retrieval paths even if you're cited in ChatGPT. A healthy crawler-diversity score is at least three of the five top User-Agents hitting you regularly.

2. Retrieval-to-training ratio. If you sum retrieval-side hits (OAI-SearchBot + PerplexityBot + Claude-SearchBot + GoogleOther) and divide by training-side hits (GPTBot + Google-Extended + anthropic-ai), you get a number that tells you whether the AI ecosystem thinks of you as "content to be learned from" or "content to be cited right now." Mine sits at 0.81. Anything below 0.5 means your content isn't fresh enough to be retrieved in real time. Anything above 1.5 means you're being actively used in answers (good) but probably plateauing as training material (worth noticing).

3. llms.txt fetch rate. Of the five top crawlers, only PerplexityBot and ClaudeBot fetched /llms.txt on my sites during the 30-day window. GPTBot, OAI-SearchBot, and Google-Extended never did. This roughly matches what other operators have observed and is a load-bearing detail when you're deciding whether llms.txt is worth maintaining. (Short answer: yes, but mostly for the two crawlers that read it.) The llmoframework.com retrieval signals page goes deeper on this.

How to actually pull this data

This is the part I always wanted to read and never quite found, so:

Cloudflare (free plan). The AI Crawl Control dashboard (formerly AI Audit, docs here) shows top AI crawler User-Agents out of the box. For raw logs, you need Logpush, which is paid. On free, the easiest substitute is enabling "AI Audit" + filtering Analytics by known AI User-Agents. Free won't give you per-request paths but it gives you counts and trends.

Vercel. Project → Logs → filter by User-Agent contains "Bot". Vercel keeps 30 days of edge logs on the Pro plan. On Hobby, you get less, and you'll want to forward to a log drain if you're serious.

Netlify / self-hosted Nginx. Just grep the access log:

grep -E "GPTBot|ClaudeBot|PerplexityBot|OAI-SearchBot|Google-Extended" \
  /var/log/nginx/access.log \
  | awk '{print $14}' \
  | sort | uniq -c | sort -rn

This gives you crawler counts. Add awk '{print $7}' instead of $14 to get the URL ranking. The exact field number depends on your log format; check with awk '{print NF}' on one line to count.

What I changed after looking at all this

Three concrete changes after the 30-day window:

I split my robots.txt to allow OAI-SearchBot and Claude-SearchBot (retrieval, good for citations) while keeping Disallow: /api/ strict for GPTBot (training, no upside for me on those endpoints).
I added a Last-Modified header to every blog post route, because retrieval crawlers use it to decide re-fetch frequency and Vercel wasn't sending one by default.
I started tracking the retrieval-to-training ratio weekly in a spreadsheet. Two weeks in, the only useful insight is that the number is stable. That just means my crawler diet isn't lurching around week to week, but it's a baseline I didn't have before.

I expected the logs to confirm what I already believed about LLMO. They mostly didn't. Citation isn't the only signal worth watching. Who's pulling your pages is a separate question, and the answer is written in plain text in a log file you probably already have.

If you want the full measurement frame (citation tracking, GA4 referrals, and server-log crawler analysis as parts of one system) the book is here: LLMO: AI Search Optimization. Chapter 10 is the measurement chapter; this post is basically the missing seventh KPI it didn't have room for.

I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks.

Ken Imoto — Mon, 25 May 2026 22:00:01 +0000

I built a three-layer agent harness and called it "autonomous." Observer collected the data. Strategist picked the theme. Marketer wrote the article. They all followed strategy.md, the file that holds my rules. The cron fired every Monday at 09:00 and the articles showed up by lunch. I felt very clever about it.

Then I read my own Strategist logs across three weeks and noticed something. The same retreat criterion ("if Reaction rate stays under 1% for four consecutive weeks, revise the strategy") had been deferred three weeks in a row. Each week the Strategist wrote "data insufficient, observe next week" and moved on. The rule existed. The data existed. The rule never fired.

The three-layer harness couldn't catch this because the three layers were doing exactly what strategy.md told them to do. The bug was not in the agents. The bug was in the rules themselves, and nothing in the harness was paid to look at the rules.

I added a 4th layer called Evolver. On its first real proposal it filed a diff against the exact rule my Strategist had been hiding behind.

The three layers were not the autonomous part

The architecture I had been calling autonomous looked like this. Observer ran daily and dumped GA4 numbers into article-performance.jsonl. Strategist ran every Monday morning, read strategy.md, and picked five themes for the week. Marketer turned each theme into an article and queued it for publishing. Three roles, three cron jobs, predictable behavior.

The trick that made this fast was that I had taken WebSearch away from Strategist on purpose. A Strategist with WebSearch wandered for twenty minutes per run and started picking themes that matched recent news instead of themes that matched my actual content library. Stripping WebSearch dropped the cycle from twenty minutes to three. That post was about making Strategist faster. This one is about making it accountable.

The thing none of those three layers could do was rewrite strategy.md. They read it every Monday and obeyed it. If the rule was wrong, they obeyed a wrong rule. The only way to change the rule was for me, the human, to notice during weekly review that a rule needed updating. And I was the bottleneck. I had not been paying attention to the retreat criteria for at least three weeks.

What the procrastination looked like in the logs

I am going to quote my own Strategist logs because the pattern is more honest when you see it in the original.

From the log dated three weeks before I added the Evolver:

Reaction rate continues at 0% for the majority of articles. Title strategy has shifted to first-person and numerical framing. Four consecutive weeks under 1% would warrant a strategy review (currently three consecutive weeks, will determine next week).

The next week:

Reaction rate has not yet reached four consecutive weeks under 1%, but weekly trend data is insufficient. Observe next week.

This is the entire failure mode in two sentences. The rule said "four consecutive weeks." The Strategist had three consecutive weeks of data under 1%. Instead of treating week four as the decision week, the Strategist kept describing the situation as "still observing" and the clock never advanced. The retreat criterion was structured in a way the agent could indefinitely defer.

When I went and computed the actual numbers from article-performance.jsonl myself, the picture was even uglier. Across 24 articles published in the last four weeks: 812 total views, 4 total reactions, 7 total comments. Reaction rate: 0.49%. Half the threshold. Engagement rate (reactions plus comments): 1.35%. The rule should have triggered weeks ago. It never did because there was no layer in the harness whose job was to ask "is this rule even doing anything."

The 4th layer: what an Evolver is

So I added a 4th cron job. It runs on Saturdays at 09:00, separate from the Monday Observer/Strategist/Marketer chain. Unlike the other three, it has WebSearch enabled. Its job is not to write articles. Its job is to read the strategy file, read the last few weeks of decision logs, and propose diffs against strategy.md.

Each proposal is one file: domains/<name>/data/evolution/EVO-NNNN.md. The Evolver fills in five sections.

Observation — what it saw in the data
Proposal — the rule change in plain prose
Rationale — internal data and external references that justify the change
Expected impact — what should improve if applied
Diff — a literal diff block against strategy.md

The diff block is the load-bearing part. The Evolver does not just write English suggestions. It writes the exact patch that would land in the repo. A small CLI called harness-evolve.sh knows how to extract the diff block, run git apply --check, and commit it with the proposal as the body. No LLM is involved in the apply step. The LLM proposes, the shell applies.

That separation is on purpose. The proposal is creative. The apply is mechanical. When the apply step is mechanical you can trust it to either succeed cleanly or fail loudly. There is no "the agent tried to apply the patch and something weird happened in the middle."

EVO-0003 caught my Strategist procrastinating

The Evolver's third real proposal, EVO-0003, was the one I described above. The proposal is on disk and I am reading it back as I write this.

The observation section quoted both of my Strategist logs, the "three consecutive weeks, will determine next week" one and the "data insufficient, observe next week" one. Then it computed the engagement rate from article-performance.jsonl and showed that the threshold had been breached for at least four weeks. Then it argued that the original rule was bad in three ways:

The formula was not specified. Was "Reaction rate" per-article or aggregate? My Strategist could plausibly compute either, which is why it had been deferring.
The trigger condition "four consecutive weeks" was ambiguous when weekly data was thin.
The action on trigger ("propose a title and angle revision") was abstract enough that the Strategist could fulfill it with a single sentence and move on.

The proposal replaced the rule with this:

Engagement rate = (sum of reactions + comments over the last 4 weeks of articles) / sum of views. The Strategist must compute this every week and log it. If under 1.5% for four consecutive weeks, next week's 5 articles must be at least 4 titles in the "number + first person + failure narrative" form. Abstract titles are forbidden.

It is a 20-line patch. The diff is below the prose in the proposal file. I approved it via /harness-evolve approve EVO-0003 at 14:04 on a Tuesday afternoon. The shell ran git apply --index against strategy.md, made the commit, updated the proposal's frontmatter to status: applied, and sent me a Telegram note. The next Monday's Strategist ran with the new rule and computed an engagement rate of 1.35% in the log without prompting. The "data insufficient" sentence stopped appearing.

The thing I want to be honest about is that the Strategist hadn't been malicious. It hadn't been broken either. It had been a perfectly competent agent following a rule that was structured to allow deferral. That is a failure of the rule. The Evolver's job is to detect rule failures, because nothing else in the harness was structured to.

The Safety boundary, because Self-Evolving Agents are not toys

The minute you say "an agent that rewrites the harness," somebody in your head should be raising their hand and asking what stops it from rewriting itself into a paperclip optimizer. Several things, on purpose.

The Evolver cannot touch the kinds of decisions that have to remain mine. Adding or removing a domain. Switching languages. Changing the quality bar for writing. Anything involving licensing, author identity, or security. The .env file, the credentials directory, the publish triggers. If any of these were on the table I would not let the Evolver run unattended at all.

Inside the territory it can touch, three numeric limits keep it from running away.

Diff size cap: 20 lines per proposal. A proposal larger than that has to be split or escalated.
Two proposals per week per domain. If the Evolver wants to propose more, the third is held until next Saturday.
Three consecutive rejects on the same theme triggers an automatic mute. The Evolver stops re-pitching the same idea after I have said no three times.

The last one is the part I think is undersold in the broader "self-improving agent" literature. The interesting signal in a reject log is not the proposal, it is the reason. "MCP is still the main revenue genre, we cannot drop it" is the kind of business context that has never been written into strategy.md. After three weeks of rejecting MCP-cut proposals with that reason, the Evolver stops proposing them. Implicit founder context becomes explicit harness behavior, just by accumulating reasons-for-reject.

What you need before adding a 4th layer

I think there are three real prerequisites before adding an Evolver-style layer to your own setup. Without them, the 4th layer is just noise.

First, the three existing layers have to produce decision logs that another agent can read. If your Strategist's output is "ran successfully, picked themes," there is nothing for the Evolver to find. The procrastination only showed up because my Strategist had been writing structured logs with phrases like "currently three consecutive weeks, will determine next week." Logs that include the agent's reasoning in prose are what make audit possible.

Second, the rules themselves have to be in version control as text. strategy.md is a checked-in markdown file because the Evolver needs to produce a diff block that git apply can land. If your rules live in a database, a SaaS dashboard, or a thousand-line JSON config, the patch model breaks down. Plain markdown in git is the cheap path.

Third, you need a human approval channel that does not require the human to read the whole proposal every time. My Telegram notification has the EVO-ID, the title, and a one-line link to the file. I open the file only when the title makes me curious. Most of the time I either approve fast or reject with a short reason. If approval costs me ten minutes per proposal, I will stop running the Evolver. If it costs me thirty seconds, I will run it indefinitely.

What about not adding a 4th layer

If you do not want a 4th layer, you can absolutely get most of the benefit by running a weekly human review with a specific question. Not "how are the agents doing." That is what I had been doing, and it did not catch the procrastination. The specific question is: "did any retreat criterion in strategy.md actually fire this week, and if not, why not."

Sit with that question for ten minutes per Friday. You will catch what I was missing for three weeks. The Evolver is, more than anything else, a forcing function for that question. It does not have to be an agent. It can be a calendar reminder.

I happen to like running it as an agent because the proposal artifacts pile up in version control and become a record of how my rules have evolved. EVO-0001 through EVO-0004 form a small history of "things I thought were good ideas, things I thought were bad ideas, and why." That history is useful when I am writing next year's strategy.md from scratch.

What I have not built yet

The current Evolver only audits one domain at a time. Across my four domains (devto, qiita, zenn, kenimoto-dev) I have written different versions of strategy.md for each, and most of them have similarly structured retreat criteria. A cross-domain Evolver could notice that the same rule structure has been failing in two domains and propose a unified fix. I have not built it. It is on the list.

The other thing on the list is the obvious recursion question. Who audits the Evolver. The current answer is "I do, every approve/reject is a human signal." The longer answer is "I do not know yet." If the Evolver's proposals start looking systematically biased — say, always proposing tighter thresholds, or always proposing to drop the same genre — that bias is real and I should add a 5th layer that watches the 4th. I have not seen it yet. I might not until EVO-0050 or so. I want the bias to be obvious before I add another layer just to feel safer.

For now: three agents that follow rules, one agent that audits the rules, and one human who approves the audit. That is the smallest harness I have found that catches its own procrastination.

If you want the full Harness Engineering picture — the 6 building blocks, the AGENTS.md/CLAUDE.md/hooks patterns, and the Self-Evolving Agent chapter that grounds this article — that is in the book.

Harness Engineering: From Using AI to Controlling AI

I Told Claude Code to Do TDD. It Wrote the Test AFTER the Code 6 Out of 10 Times.

Ken Imoto — Mon, 25 May 2026 11:00:00 +0000

My CLAUDE.md had a section called ## TDD First. Six lines. Very clear. I had spent twenty minutes drafting it. Then I ran a 30-day audit of my own commits and discovered that across the features I had asked Claude Code to TDD, the test file was committed after the source file 6 out of 10 times.

Not "the test failed first, then I fixed it." The test file did not exist at the moment the source file got committed.

This is the story of how I caught it, why it kept happening, and the two-part fix (prompt plus a PreToolUse hook) that finally pushed Claude into a real red-green-refactor cycle. It is the third installment in what is becoming an accidental series on Claude doing things confidently and wrong. The first was Claude hiding bugs three times in a row. The second was refusing to write specs until the code went sideways three times. This one is about TDD, and the pattern is identical: the model agrees, the model proceeds, the model skips the part of the prompt that would cost it tokens.

The 30-day audit

The audit was accidental. I had been writing about debugging habits and wanted to see whether my own commit history was consistent with what I was preaching. So I pulled git log --name-only --pretty=format:'%h %ai %s' for the last 30 days on a project I had been driving with Claude Code, and grouped the commits by feature. Ten features. For each one, I noted the timestamp of the first commit that touched the source file, and the timestamp of the first commit that touched its test file.

Six features out of ten had the source file committed first. The gap ranged from 90 seconds to 23 minutes. In two cases the test file was committed in the same commit as a later round of fixes, after the source had already been shipped to a feature branch. In one case there was no test file at all, only a # TODO: add tests next to the function.

I had been telling Claude "TDD this" every single time. I had a ## TDD First section in CLAUDE.md. I had even pasted the red-green-refactor sequence at the top of the prompt for the more complex features. And six times out of ten, it had cheerfully written the implementation, then either written the test afterward or skipped it entirely.

I am not blaming the model for being lazy. The model was doing exactly what it was trained to do.

Why next-token prediction defaults to implementation-first

This is the part that took me a while to actually understand. The model is not deciding "I will do TDD" or "I will not do TDD" the way a human engineer might decide. It is predicting the next most plausible token given the context. And in its training data, the overwhelming majority of "user asks for feature X" responses look like here is the function that does X, optionally followed by and here is a test. The "test first, then implementation, with the test failing in between" sequence is rare in public repositories because humans rarely commit the red phase as its own commit. We commit the green phase. So the model never built a strong prior for the red-first ordering.

Several people in the Claude Code community have pointed at the same thing. The aihero.dev TDD skill writeup puts it as: when the test writer and the implementer share the same context window, the implementer's thinking leaks into the test writer's, and you get tests that conveniently pass on the first run. That is not TDD. That is "tests retrofitted to pass." The alexop.dev red-green-refactor loop post argues that the only reliable fix is to force the cycle from outside the model, with hooks or skills that the agent cannot override mid-stride.

The other thing I keep seeing in community writeups, including the BSWEN Claude Code TDD skill walkthrough, is the same Anthropic guidance I had been ignoring: Claude will sometimes alter the test to make it pass rather than fix the implementation. Committing the test before the implementation gives you a diff to look at if that happens. I was not doing that either.

So the model had a weak prior for test-first, and I had a weak workflow that did nothing to compensate. Six out of ten makes a lot of sense in retrospect. The surprising thing is that it was as low as six.

What I tried first that did not work

Before the hook, I tried prompt engineering harder. This is what most people try, and it gets you most of the way without getting you there.

Attempt 1 — ## TDD First in CLAUDE.md. Already had this. Six out of ten ignored it. The header was too generic; the model saw it as a vibe, not a constraint.

Attempt 2 — explicit red-phase instruction in the prompt. I started pasting "Write a failing test for [feature] in tests/X_test.py. Do not write the implementation yet. Run the test and confirm it fails before proceeding." This got me to maybe 8 out of 10. Better, but 2 out of 10 I would catch it cheating, usually by writing the test in a way that mocked out the part that would actually have failed.

Attempt 3 — separate prompts for red and green. Two messages. First message: write the failing test, stop, run it, show me the failure. Second message, only after I had eyeballed the failure: now write the implementation. This was the first time I got something that smelled like real TDD. The problem was that it required me to physically be at the keyboard for two turns, and if I context-switched away mid-feature, the next Claude session would happily merge the two steps back into one.

The lesson from Attempt 3 is that prompts are advice. The model can ignore advice. To get TDD enforced, I needed something the model could not ignore. That something is a hook.

The PreToolUse hook that broke the loop

Claude Code's hook system lets you intercept tool calls before they execute. A PreToolUse hook on Write or Edit gets the file path the model is about to touch. If the model is trying to write to src/foo.py and there is no tests/foo_test.py that currently fails, the hook can exit 2, which Claude Code treats as "this tool call is denied, here is the reason, try again."

This is the smallest version that worked for me, on a Python project with pytest:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{
          "type": "command",
          "command": "python3 .claude/hooks/require-failing-test.py"
        }]
      }
    ]
  }
}

The script reads the file path from the tool call payload, maps src/X.py to tests/X_test.py, checks the test file exists, runs pytest tests/X_test.py --no-header -q, and exits 2 if pytest exits 0. If the test does not yet exist or the test currently fails, the hook lets the edit through. If the test exists and is already passing, the hook blocks the edit with a message like "a failing test must exist in tests/X_test.py before src/X.py can be modified. Write the failing test first." That message lands in the model's next-turn context. It does not have a choice.

There are edge cases. The test file might pass for the wrong reason; the hook does not catch that. The mapping from source to test path is project-specific; mine is hardcoded. And I have an escape hatch, a magic comment # tdd-bypass: refactor on the first line, for refactor commits where you genuinely want to edit without a new failing test, because refactor is supposed to preserve behavior, not add it. The hook respects the escape hatch, but it logs every use of it to a file I review at the end of the week. The first week, my escape-hatch log had 22 entries. The second week it had 4. That number going down is the whole point.

What the 30-day rerun looked like

I ran the same audit 30 days after the hook went in. Same project, same kind of features, same prompt style. The numbers:

Test file committed first: 9 of 10 (up from 4 of 10)
Test file committed in same commit as source, but written first per the file-modification timestamps: 1 of 10
Test file committed after source: 0 of 10

The single feature where the test went in the same commit as the source was a 12-line config helper that I had legitimately bypassed with the magic comment. So in terms of TDD being followed when the rule applied, the number is 10 of 10.

I do not want to claim that the hook turned Claude into a disciplined TDD practitioner. It did not. The model still writes implementations that look suspicious from a "test was designed around the implementation" perspective some of the time. What the hook gives me is ordering: a failing test must exist before the source can be touched. That alone closes the loop where Claude was retrofitting tests around code that was already shaping the test's assertions. The Anthropic guidance on this, captured by several community writeups including the DataCamp best practices roundup, is that ordering is the load-bearing constraint and everything else is bonus.

When to skip TDD entirely

This is the part I should have figured out before instrumenting any of this. There are tasks where TDD is the wrong tool. Refactors that should be a no-op behaviorally. One-off scripts I am going to throw away in 20 minutes. Pure data migrations. UI tweaks where the test would just be a snapshot of itself. Forcing TDD on these tasks does not make the code better; it makes the workflow heavier with no payoff.

The escape hatch exists for these. The week-end review of the escape-hatch log is where I notice if I am abusing it. "I bypassed TDD because the test was hard to write" is a smell. "I bypassed TDD because the code was a snapshot test of CSS class names" is fine. The audit, not the rule, is what keeps the workflow honest.

The takeaway

My CLAUDE.md still says ## TDD First. I left it there for vibes. It was never going to be the part that did the work. The hook is the part that does the work, and the audit is the part that decides whether the hook is still tuned right.

If you want the full picture of how to layer prompts vs hooks vs MCP servers (when to use which layer for which kind of rule), I wrote it down in Practical Claude Code. The hooks chapter is the one I keep coming back to.

Sources:

Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch

Ken Imoto — Sat, 23 May 2026 13:00:01 +0000

Four days after registering a new domain, I opened GA4 and saw 65 page views / 34 users / 9 countries.

For a brief, build-in-public moment, I almost cheered. Then I looked at the breakdown. The US had 17 sessions averaging 4.9 seconds of session duration. France, Poland, South Korea, India, Singapore: each between 0 and 1.4 seconds. Japan alone sat at 751 seconds (over 12 minutes): an outlier so loud it should be illegal.

The domain is kaoriq.com, registered on 2026-05-02, a personality-quiz × fragrance e-commerce site I'm building. As of today (May 5), it has fewer than 20 articles. Doing the back-of-the-envelope math, that page-view distribution is physically impossible to come from real humans.

This post walks through how I read the first week of GA4 data on a new domain as "me + a crawler army", with the actual numbers exposed. For anyone running GA4 on a new project, or anyone who registered a domain this weekend.

The raw data: past 14 days (4 days of real activity)

Numbers first, no spin.

Overall

Metric	Value
Sessions	37
Page Views	65
Total Users	34
New Users	34
Avg Session Duration	104.1 s
Bounce Rate	80%

By Country

Country	Sessions	PV	Avg Duration (s)
Japan	5	33	751.0
United States	17	17	4.9
Canada	4	4	1.3
France	4	4	1.4
Poland	2	2	0.0
South Korea	2	2	0.0
(not set)	1	1	0.1
India	1	1	0.0
Singapore	1	1	0.0

Daily

Date	Sessions	PV	Users
2026-05-02 (registration day)	17	40	14
2026-05-03	6	11	6
2026-05-04	12	12	12
2026-05-05	2	2	2

At a glance, "not bad for week one" is a tempting read. But this dataset contains a 751-second Japanese reader living next door to 9 countries averaging zero seconds. The middle is missing. That gap is the whole tell.

Five signals, beaten in parallel

I never call bot traffic on a single signal. To avoid false positives, I always cross-check five axes at once.

Signal	Bot pattern	Human pattern	kaoriq actual	Verdict
Session duration	0–5 s	30 s – several min	US 4.9s, FR 1.4s, KR 0s	Bot
Bounce rate	90–100%	40–70%	80%	Bot
PV / Session	1.0 (one page, gone)	1.5–3.0	US: 17/17 = 1.0	Bot
Geographic anomaly	Random countries unrelated to content	Concentrated in target geo	EN/JA only, yet PL/IN/SG	Bot
Time-series spike	Massive day-one for new domains	Gradual ramp	40 PV on day of registration	Bot

Why a single signal lies

"80% bounce, must all be bots, right?" Not so fast.

Duration alone: A reader who tabs your post and walks away for lunch racks up 30+ minutes. Indistinguishable from "deeply engaged" or "abandoned tab."
Bounce rate alone: A landing page that perfectly answers the question gets a 100% bounce from satisfied humans. Excellence and bots both score the same.
Geography alone: A viral overseas tweet legitimately produces multi-country traffic. Weak on its own.

You only get to call "bot" with confidence when all five signals lean the same direction simultaneously.

The bimodal distribution was the smoking gun

The real reason this verdict held in kaoriq's case is the shape of the duration distribution.

Japan: 5 sessions / 751 s average
Everywhere else: 0–5 s

If the traffic were genuinely human, session duration should spread more evenly across the 20–120 second band: "bounced after the title (10s)," "read the lede (40s)," "made it to the end (180s)" forming a gradient.

But kaoriq's distribution is bimodal with the middle scooped out. The honest reading: only "me (long sessions, testing the site)" and "crawlers (instant exits)" exist. Nothing in between.

Conversely, a healthy distribution would look like "Japan 100 sessions / 60s, US 50 sessions / 45s, Canada 20 sessions / 30s": durations spread normally. That'd be a real human traffic signature.

So how many real humans were there?

After all that beating, my estimate breaks down as:

Category	Estimated sessions	Notes
Me, testing the site	4–5	Most of Japan's 5 sessions, source of the 751s average
Crawlers (Googlebot / Bingbot / GPTBot / ClaudeBot / AhrefsBot, etc.)	27–30	US 17, plus the zero-second Europe & Asia rows
Actual organic human traffic	2–5	The remainder of Japan + a couple of US sessions

Of 37 sessions, at most 5 were real humans. That's the reality of week one for a new domain.

Why GA4 doesn't filter this for you

GA4 has a "known bots and spiders" auto-exclusion based on the IAB/ABC Spiders & Bots list. It catches classical crawlers but misses:

JavaScript-executing crawlers: GPTBot, ClaudeBot, PerplexityBot. These new generative-AI crawlers run JS, so the GA4 tag fires.
SEO-tool crawlers: AhrefsBot, SemrushBot, MozBot. High frequency, and they swarm new domains the moment they're discovered.
Headless-browser scrapers: Custom Puppeteer or Playwright bots are indistinguishable from a real Chrome session.

The week after a new domain registration is when this crawler army discovers the new IP. It calms down within 7–10 days as DNS propagates. But if you take week-one GA4 at face value, you'll make bad decisions.

Three annotations every new-project dashboard needs

Use "Engaged Sessions" as your primary metric. GA4 defines an engaged session as: ≥10s duration OR ≥2 PV OR a conversion event. Most of the bot army gets filtered here.
Always view session duration split by country. Looking at any single metric (sessions, PV) without the geo filter lets the crawler army masquerade as success.
Treat the first 30 days as a "noise phase." Real numbers only appear after social funnels, SEO, and content depth all line up.

Closing: look at your own GA4 with this lens

A new domain's GA4 lies for the first 1–2 weeks. If your country breakdown is full of zero-second sessions from the US, Eastern Europe, and Southeast Asia: that's the crawler parade, not humans falling in love with your content.

The procedure is simple: beat with five signals → suspect bimodal distributions → swap the primary metric to Engaged Sessions. Doing this saves you from being whipsawed by early data.

Doubting GA4 is, in the end, a discipline for not making expensive mistakes. Beat the data before the data beats you.

This post is based on real data

Site: kaoriq.com (domain registered 2026-05-02, built with Astro v6 + Tailwind v4)
Period analyzed: 2026-04-22 → 2026-05-05 (4 days of actual activity)
Data source: GA4 Data API v1beta via Service Account

If you want the full LLMO playbook (how to think about AI crawlers, citations, and the measurement layer underneath the GA4 narrative):

LLMO: AI Search Optimization for Engineers

I Stacked 4 More Context Layers on Top of RAG. Sonnet Got 12% Better. Haiku Got 14% Worse.

Ken Imoto — Fri, 22 May 2026 13:00:01 +0000

I read a post about "Full Context Engineering" and immediately added four more layers to my RAG pipeline. Structured output instructions. Hierarchical document layout. Role definition. Few-shot examples. The whole buffet.

The improvement on Claude Sonnet was 12%.

The improvement on Claude Haiku was minus 14%.

I had just spent two weeks building scaffolding to make my smaller model worse at its job. If you have ever wallpapered a room and stepped back to discover you covered up the light switch, you know the feeling.

This post is about what those numbers actually mean for the way you spend your context engineering effort in 2026.

What I was measuring

I was running a benchmark against my own book corpus for a previous experiment (the cheap-model post). The same scoring rubric: factual accuracy, hallucination rate, specificity, and honesty on a 0 to 15 scale.

The configurations were a ladder. Each rung adds one more thing on top of the previous one.

System prompt only: the bare baseline. No retrieval, nothing.
System + RAG: vector search over a curated corpus, top documents injected.
Full Context Engineering: RAG + structured output instructions + hierarchical layout + role definition + few-shot examples.

What I expected: a smooth upward curve. What I got was a curve that leaned forward and then fell over.

The numbers

Claude Sonnet, total score (out of 15):

Configuration	Total	Delta from previous
System only	8.8	--
System + RAG	10.2	+1.4 (+16%)
Full Context Engineering	11.4	+1.2 (+12%)

Claude Haiku, total score (out of 15):

Configuration	Total	Delta from previous
System only	3.7	--
System + RAG	11.8	+8.1 (+219%)
Full Context Engineering	10.1	-1.7 (-14%)

Two findings I did not expect.

First: RAG is doing almost all of the work. On Sonnet, RAG closed 88% of the gap between baseline and the fully tricked-out pipeline (1.4 of the total 2.6 point improvement). On Haiku, RAG over-shot the final number entirely.

Second: stacking more on top of RAG is not free. On Haiku, it actively made things worse. The hallucination score went from 1.7 to 0.5. The honesty score went from 1.3 to 0.5. The model started confidently making things up that it had previously hedged on.

Why this happens

I have a hypothesis that I think survives contact with reality.

A small model has limited working memory. RAG hands it the right facts. Once those facts are in front of it, the marginal returns from extra structure are small. But the marginal cost of extra context is not small. Every paragraph of role definition, every few-shot example, every "here is how to format the output" block competes with the retrieved documents for the model's attention.

For Sonnet, the working memory is wide enough that the extras land in unused space. For Haiku, the extras shove the actually-useful retrieved context off to the edge of the window. The model still sees it. It just stops trusting it.

This is the same finding that recent research on long-context behavior keeps surfacing. Studies on instruction-following at high context fill report that for most frontier models in 2026, quality starts to degrade measurably at 60 to 70 percent context fill, and falls off a cliff around 90 percent. The cliff is steeper for smaller models.

The Pareto principle applies to context engineering with embarrassing accuracy. RAG is the 20 percent of effort that produces 80 percent of the result. Everything you stack on top of it is the long tail.

The 2026 reality I almost forgot to mention

When I ran the original experiment, I was on Sonnet 4 and Haiku 3 with a 200K context window. As of this writing, Sonnet 4.6 has a 1M token context window at standard pricing and prompt caching cuts the cost of repeated context by 90 percent.

This changes the math, but not in the direction you might think.

A 1M context window does not magically make stacked context cheaper to design. The model still has to pay attention to the right thing. The cliff at 60 to 70 percent fill is a percentage, not an absolute. A bigger window just means you can write more bad context before you fall off it.

Prompt caching helps if your stacked layers are static. The role definition, the few-shot examples, the structured output instructions: those parts cache cleanly. But that only saves money. It does not save quality. If your Haiku result was minus 14%, prompt caching makes minus 14% cheaper. That is not the win you wanted.

The thing nobody told me about Skills

Anthropic's Skills feature is interesting in this light. Skills are reusable context bundles that load on demand. The right way to think about them is not "more context, all the time" but "the right context, just in time."

That is the failure mode my Full CE experiment ran into. I was packing every layer into every request. Skills point at the alternative: keep the system prompt small, retrieve the relevant skill, and let the rest stay out of the window. It is the same lesson as RAG, applied one level up. Selective beats throwing everything in.

What I do now

If you take only one thing from this post, take this: the order of operations matters more than the number of operations.

Build the retrieval first. Get RAG working with a clean corpus, decent embeddings, and a relevance threshold. This is your 80%.
Run a benchmark. Real benchmark, on real questions, scored by a real rubric. Not vibes.
Add one layer at a time. Structured output, then hierarchical layout, then role definition. Re-benchmark after each.
If the score goes down, take that layer out. Do not assume the layer is good and your benchmark is bad. The benchmark is right more often than you think.
Try the same ladder on a smaller model. The thing that helps Sonnet may hurt Haiku. Knowing which side of the line you are on saves you money.

This sounds obvious. It is not what most teams do. Most teams read a blog post about Context Engineering, add four layers in one weekend, and never measure whether the layers actually helped.

The chef and the kitchen, revisited

In an earlier post I wrote that the model is the chef and the context is the kitchen. I want to extend that.

Adding more context layers is like installing more kitchen equipment. A second oven. A pasta machine. A sous-vide. None of them make the chef worse at cooking pasta. But if the pasta machine takes up the counter space where the chef was chopping vegetables, the dinner gets worse anyway.

The chef does not need every appliance. The chef needs the right ingredients within reach.

Before you read the next breathless post about Full Context Engineering and start adding layers, run the experiment. Measure RAG alone. Measure RAG plus one thing. Find the layer that earns its keep, and leave the rest in the catalog.

The answer is almost always: do RAG well first. Everything else is decoration. Decoration that, on a small model, can flip the sign on your accuracy score and leave you wondering why.

The next time someone says "Context Engineering," what I want to say back is: please define which 20 percent of context you mean. The other 80 has a good chance of making things worse.

The full Context Engineering system (five strategies, the RAG benchmarks behind these numbers, MCP server design, and the Agentic RAG implementation) is in Turning LLMs from Liars into Experts: Context Engineering in Practice.

OpenClaw Hit 250K Stars Faster Than React. I Spent 24 Hours Trying to Like It.

Ken Imoto — Fri, 22 May 2026 13:00:00 +0000

I switched my entire dev setup from Claude Code to OpenClaw on a Tuesday morning. By 11am I was googling "how to remove openclaw". By 6pm I had written a SOUL.md file longer than the actual feature I was shipping.

This post is about that day. About what broke, what didn't, and what 24 hours of working in the terminal agent that is now technically the most-starred open-source project in GitHub history bought me.

Yes, I am the engineer who wrote about Claude Code Skills three weeks ago and called the workflow pattern "settled for at least a year." Then OpenClaw passed React's all-time star count in 60 days, Peter Steinberger announced he was joining OpenAI to ship agents to everyone, and the launch tweet went past 4 million views. Settled, apparently, was a one-month forecast.

The numbers I had to verify before believing them

Let me get the facts out of the way, because half of what people quote on Twitter about OpenClaw is wrong by a factor of two.

OpenClaw crossed 250,000 GitHub stars on March 3, 2026, surpassing React for the all-time most-starred software repository
60 days from launch to 250K. React took roughly a decade
60K stars in the first 72 hours. That part is the one nobody actually believes the first time
Peter Steinberger announced on February 14, 2026 that he is joining OpenAI to work on agents, with OpenClaw moving to a foundation to stay open and independent
One mid-sized refactor session in my test run consumed 920K tokens, which on Claude 4.5 Sonnet billing came out to about USD 8.30

The Hacker News thread when it crossed React was the most upvoted submission of the week. The top comment was "this is either the best thing that happened to dev tools in five years or the most expensive way to learn what --yolo does."

It is, somehow, both.

The setup, and the part I underestimated

Installation took less than a minute.

curl -fsSL https://get.openclaw.dev | sh
export ANTHROPIC_API_KEY=sk-ant-...
openclaw

The first surprise: OpenClaw asked me which model I wanted as default. I had four serious choices, plus Ollama for local models.

openclaw --model claude-4.5-sonnet
openclaw --model gpt-4o
openclaw --model gemini-2.5-pro
openclaw --model ollama/devstral:24b

Claude Code has a backend model. OpenClaw has a backend model dropdown. That is not a small UX difference when you are trying to land a refactor for less than ten dollars.

The second surprise: when I ran my first command, the agent asked me where the SOUL.md file was. I did not have one. It happily generated a default. The default was generic enough that I closed the session, opened my editor, and started writing my own. That is when the day quietly stopped being a benchmark and started being a personality test.

SOUL.md is the part nobody warned me about

Here is the SOUL.md I ended the day with, after rewriting it three times.

# SOUL.md
You are a senior backend engineer with strong opinions and short patience for code
that talks more than it does.

- Prefer Python over TypeScript when both fit. We're not building a frontend here.
- Never add a feature without a test. If the test would take more than 10 minutes
  to write, ask first instead of writing it.
- Performance matters but readability matters more. We're a four-person team, not
  Google.
- Do not write conversational filler. "Sure, I'll do that" is not output. Output
  is the diff.
- When in doubt, ask. Don't guess. Guessing once cost us a weekend.

The thing the docs do not tell you: SOUL.md is not a config file. It is a contract. CLAUDE.md tells Claude Code what the project is. SOUL.md tells OpenClaw who the agent is. They are two different shapes of the same trust problem, and the day I figured that out was the day OpenClaw stopped feeling worse than Claude Code and started feeling different.

I had a Claude Code session open in another window all day for a sanity check. By 4pm I noticed my CLAUDE.md was 312 lines and my SOUL.md was 14. The SOUL.md was doing more work per line.

The Gateway, and why my LGPD-anxious teammate cared

OpenClaw routes every LLM call through a local process called the Gateway.

The Gateway sits on your machine. Your prompts and code do not pass through an OpenClaw-operated cloud relay on the way to Anthropic, OpenAI, or whoever. They go straight from your laptop to the model provider you picked.

Claude Code does not have an equivalent intermediary, but it also does not need one because Anthropic is the only provider. The moment you have multi-provider support, you either need a relay (vendor lock-in risk) or a local gateway (the OpenClaw choice).

A teammate of mine who lives in Brazil and spends meaningful time worrying about LGPD compliance pinged me at lunch to ask what the network diagram looked like. He liked what I sent him. That conversation alone might be worth the day.

ClawHub vs Claude Code Skills

Claude Code Skills are markdown files plus optional resources, distributed however you distribute markdown. ClawHub is an npm-style package marketplace for OpenClaw skills.

openclaw skills search "docker"
openclaw skills install @clawhub/docker-manager
openclaw skills list

ClawHub had several thousand skills the day I tried it. The numbers Steinberger throws around at conferences are higher and probably accurate, but the count moves fast enough that any specific figure is wrong by the time you publish it.

Two real differences I felt:

ClawHub skills are JavaScript. They run in a sandbox but can request shell exec privileges. That makes them more capable than Claude Code Skills and more dangerous. The ClawHavoc incident in March of 2026 saw 341 malicious skills caught, which is a real cost of an open marketplace
Claude Code Skills are simpler to author. I wrote a Skill in 20 minutes my first time. The equivalent ClawHub skill took me about 90 minutes because I had to learn the SDK conventions

If you are an individual developer wanting to share a workflow, Skills are easier. If you are a team wanting a versioned, packaged, audited tool, ClawHub is better. They are not competing for the same problem.

The 3pm moment where I almost stopped

I asked OpenClaw to update some Python 3.8 code to 3.11 across a small repo, run the test suite, and report back.

It did. The session ate 920K tokens, took about 14 minutes, found three places where my colleague had used the walrus operator wrong, and quietly fixed them. I checked the diff. It was right.

Claude Code does the same thing. I have run the same prompt against it many times.

The difference was not the output. The difference was that Claude Code is in my muscle memory. I have typed claude three times a day for a year. When I typed openclaw and waited the extra 1.2 seconds for the cold start, my fingers reached for claude instead. Three times.

That is the part nobody writes about. Switching costs are not just config. They are reflexes. By 3pm I had written half a SOUL.md, almost given up, made coffee, and come back. By 6pm I was OK again.

What I would actually use each one for

I built this matrix during the second coffee.

Decision	OpenClaw	Claude Code
Locked into Anthropic models?	No, multi-provider	Yes, Anthropic only
Local model option	Ollama	None official
Skill distribution	ClawHub package marketplace	Markdown files
Personality file	SOUL.md (who is the agent)	CLAUDE.md (what is the project)
Network architecture	Local Gateway, no relay	Direct to Anthropic
Maturity	60 days old, foundation forming	18 months, Anthropic-stable
Best at	Multi-model teams, regulated environments	Anthropic-first dev shops, simplicity

If your team is Anthropic-only and your CLAUDE.md is already 200 lines, do not switch. Claude Code is fine. The Skills you wrote are still fine. The pattern works.

If your team is multi-provider, or your compliance team has questions about where prompts travel, or you want a backend model dropdown, OpenClaw is worth a Tuesday.

I am still on Claude Code as my default. I have OpenClaw aliased to a separate command for the cases where I want to try a different model on the same prompt without paying for two SaaS subscriptions worth of context.

Where this goes next

OpenClaw moving to a foundation while Steinberger joins OpenAI is the part I am watching most closely. Foundations are how open-source projects survive their founders. They are also how projects ossify. The first six months of governance under the OpenClaw Foundation will tell you whether the project is going to be Linux or Helm.

If you used to argue Claude Code vs Codex was a binary, OpenClaw is the answer that was supposed to be impossible: a third option that is not produced by an LLM lab. The economics of that are interesting. The next twelve months are going to teach us whether neutral, cross-provider, foundation-governed AI tooling is sustainable, or whether it gets quietly absorbed.

I am betting on sustainable. I have also been wrong about agents in roughly all of the previous quarters, so adjust accordingly.

What this all costs to know

If you take one thing from my Tuesday, take this. OpenClaw and Claude Code are not competitors. They are two answers to the same question: what should the AI inside your terminal be allowed to do without asking you first? SOUL.md and CLAUDE.md are different shapes of the same trust contract. The team that wrote each chose differently because they had different assumptions about who was sitting in front of the screen.

The right tool is the one whose assumptions match yours. Pick on assumptions, not stars.

If you want the harness-engineering frame on this (CLAUDE.md tiers, hooks, sub-agents, how to think about the shell around the prompt, not just the prompt), this is the reference:

Harness Engineering: A Practitioner's Guide

Is AI Actually Citing Your Site? How to Measure What Google Rankings Can't

Ken Imoto — Thu, 21 May 2026 13:00:00 +0000

I've spent the past few weeks writing about LLMO: how to get cited by AI search engines, which content structures work, what Princeton's GEO study says about visibility. All useful stuff. One problem: I had no idea whether any of it was actually working.

I was like a chef who obsesses over recipes but never tastes the food. My Google Search Console was immaculate. My LLMO measurement setup? I was literally typing "does ChatGPT know about my site" into ChatGPT and refreshing the page like a teenager checking if their crush liked their post.

Measuring LLMO is a genuinely hard problem, and most people aren't doing it at all. Here's what I've built: three measurement layers, from "costs nothing" to "costs you a Saturday afternoon of Python."

The Measurement Gap

In SEO, measurement is a solved problem. Google Search Console shows rankings, impressions, clicks, and CTR for free, updated daily. Ahrefs adds backlink data. SEMrush gives you keyword tracking. Everything is visible.

In LLMO, almost nothing is visible out of the box.

There's no "AI Search Console." ChatGPT doesn't send a weekly email saying "You were cited 47 times!" Perplexity has no creator dashboard. The shift: SEO had rankings (1st through 100th position). LLMO has a binary outcome. You're either cited or you're not. And nobody is telling you which.

This gap isn't just an inconvenience. You can't improve what you can't measure, and right now, most content creators are optimizing for AI visibility while flying blind.

Layer 1: GA4 AI Referral Traffic (Free, 5 Minutes)

The easiest measurement you can set up today is tracking AI referral traffic in Google Analytics 4. When an AI search engine cites your site with a clickable link and someone clicks it, GA4 records the source.

Here is the regex pattern I use in a custom channel group:

chatgpt\.com|perplexity\.ai|claude\.ai|gemini\.google\.com|copilot\.microsoft\.com|deepseek\.com|you\.com|meta\.ai|poe\.com

Go to Admin → Channel Groups → Create, add a new channel with this regex as the session source filter, and name it "AI Search." You'll immediately see aggregated traffic from all AI platforms in one view.

A few things to know:

ChatGPT plays nicely. Since late 2025, ChatGPT appends utm_source=chatgpt.com to outbound links. ChatGPT traffic shows up cleanly as chatgpt.com / referral in GA4.

Perplexity is decent. Traffic appears as perplexity.ai / referral, though without UTM tags. Still trackable.

Free-tier ChatGPT is a black hole. Free users often don't send referrer data due to privacy settings. Their clicks show up as "Direct," indistinguishable from someone typing your URL manually. Your GA4 numbers are a floor, not a ceiling.

The conversion story is where this gets interesting. Industry data from 2026 shows AI referral traffic converts at 8-12%, compared to 2-3% for traditional Google organic. People who arrive via AI search have already done their research. The AI did it for them. They are further along in the decision process.

I started tracking three weeks ago. My AI referral traffic is still small (single digits daily), but the conversion rate is 3x my organic average. Small sample, but a signal worth watching.

Layer 2: The "Ask Five AIs" Protocol (Free, 30 Min/Month)

GA4 tells you who clicked through. It does not tell you whether AI is mentioning you without linking, or whether it is mentioning you at all.

For that, you need to ask directly. I run this on the first Monday of every month:

Step 1. Write 10-15 prompts related to your niche. Mine include "What are the best resources for AI search optimization?", "How do I get my site cited by ChatGPT?", and "LLMO vs SEO differences."

Step 2. Run each prompt on five platforms: ChatGPT, Perplexity, Gemini, Claude, and Copilot.

Step 3. Record four things per prompt per platform:

Mentioned? (Yes / No)
Context (recommendation / comparison / neutral / negative)
Accuracy of information
URL provided?

Step 4. Calculate your citation rate. 15 prompts x 5 platforms = 75 checks. Mentioned 20 times? That's 26.7%.

This takes about 30 minutes with a spreadsheet. It's manual and tedious, and also the most reliable method that exists today. Automated tools can approximate this, but they can't replicate the nuance of "was that mention positive or just a passing reference?"

One caveat: LLM responses are non-deterministic. The same prompt can produce different answers on different days. A single check isn't statistically significant. That is why I track the monthly trend, not individual data points. Three months of data starts showing real patterns.

Layer 3: Automate It With Python (One Saturday)

If you're an engineer, you can automate the manual protocol with API calls. Hit the OpenAI and Anthropic APIs with your query set, check whether your brand appears in the response, and log results as a time series.

The core logic is simple:

BRAND_VARIANTS = ["your-site.com", "Your Brand", "yourbrand"]
CHECK_QUERIES = [
    "Best tools for [your category]",
    "How to solve [problem you address]",
    "[Your brand] vs [competitor]",
]

def check_openai(query: str) -> dict:
    client = OpenAI()
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": query}],
        temperature=0.0,
    )
    answer = response.choices[0].message.content
    mentioned = any(v.lower() in answer.lower() for v in BRAND_VARIANTS)
    return {"platform": "ChatGPT", "query": query, "mentioned": mentioned}

Extend this for Claude and Perplexity, run weekly via cron, dump to CSV. You get a time series of your AI visibility score for about $0.50/week.

The payoff: instead of "I think LLMO is working," you can say "my visibility went from 12% to 28% after I added structured data." Numbers beat feelings.

What's Available in May 2026

If building your own tools isn't your thing, several commercial platforms now track AI citations:

Otterly.ai is the fastest-growing option, with 10,000+ users since launching in October 2024. It monitors your brand across ChatGPT, Perplexity, Google AI Overviews, and Copilot. Keyword-level citation tracking, competitor benchmarking, and clean dashboards.

Profound sits at the enterprise end. Their published case study with Ramp, where they went from 3.2% to 22.2% AI visibility in one month, is the kind of result that gets budget approved. If you're a larger organization, this is where you'll probably land.

Peec AI focuses on brand mention analysis across LLM outputs. Beyond whether you're cited, it tracks how: what sentiment surrounds your mentions, which prompt patterns trigger citations.

My honest take: for individual creators and small teams, the manual protocol plus a basic Python script gives you 80% of the insight at 0% of the cost. Commercial tools become worthwhile when you're tracking dozens of keywords across multiple brands and need team dashboards.

The Crawler Signal You're Probably Ignoring

Here's a measurement angle most people miss: AI crawler logs.

Your server access logs already record which AI systems are visiting your content. GPTBot (OpenAI), ClaudeBot (Anthropic), PerplexityBot, Google-Extended. They all identify themselves in the User-Agent string.

grep -E "GPTBot|ClaudeBot|PerplexityBot|Google-Extended" \
  /var/log/nginx/access.log | awk '{print $7}' | sort | uniq -c | sort -rn

Pages that get crawled frequently are more likely to appear in AI responses. Pages that never get crawled are invisible. It is an indirect signal, but useful for finding content that AI systems are skipping entirely.

I checked my own logs and found that /blog/ pages get crawled 15x more than my /about/ page. Not shocking, but the gap was wider than I expected.

Building a Measurement Habit

Measurement without action is just data hoarding. Here is the cycle I run:

Weekly (10 min): Check GA4 AI referral dashboard. Note spikes or drops. Compare week-over-week.

Monthly (30 min): Run the five-platform manual protocol. Calculate citation rate. Scan crawler logs for new patterns.

Quarterly (1 hour): Full review. Update query set. Compare citation rate trends. Check whether content changes produced measurable results.

The LLMO Framework provides a structured approach to KPI design if you want a more formal methodology. I reference it when deciding which metrics matter most at different growth stages.

The Punchline

I started measuring my LLMO visibility three weeks ago. My citation rate across five platforms is 14%. Not great. Not terrible. But the important part is that I know the number, and three months from now I'll know whether it went up or down.

The SEO world figured out measurement twenty years ago. The LLMO world is still in its "checking rankings by Googling yourself" era. The people who build measurement infrastructure now will have a compounding advantage over those who keep guessing.

If you're still typing your brand name into ChatGPT and squinting at the output, I get it. I was doing the same thing last month. But now I have a spreadsheet, a cron job, and a regex filter in GA4. Less romantic, more informative. I'll take that trade.

References

How to Track AI Traffic in GA4
Best LLMO Tools in 2026
GEO: Generative Engine Optimization by Aggarwal et al., Princeton / ACM SIGKDD 2024
LLMO Framework: KPI design and implementation guide

The full playbook (llms.txt patterns, JSON-LD examples, citation-rate KPIs, and ChatGPT/Perplexity/Brave comparison) is in LLMO Practical Guide: Why ChatGPT Ignores Your Website.

DEV Community: Ken Imoto

TRM Grew ChatGPT Referrals 8,337% in 90 Days. I Copied Their 4 LLMO Pillars Onto 3 Indie Sites. Only 1 Moved the Needle.

The setup

The four pillars, as TRM described them

Pillar 1: Semantic SEO. Flat line on all three sites.

Pillar 2: Modular content. Best for the writer, worst for the test.

Pillar 3: Author schema and E-E-A-T. The one that worked.

Pillar 4: Query fan-out cluster. Ran out of capacity at page 11.

What the heatmap actually says

I Refactored 100 Functions With Claude. CI Was Green. Production Got Slower in 7 Spots.

The setup, so you can tell whether this generalizes

What the slow seven had in common

Why CI and mutation testing both said yes

The four checks I run now

What I do not do

I Turned on Agent Tracing for 30 Days. 4 Hidden Bottlenecks Were Eating 47% of My Tokens.

What I instrumented

Bottleneck 1: a tool-call retry loop nobody saw (18% of tokens)

Bottleneck 2: context re-fetch on every turn (14% of tokens)

Bottleneck 3: sub-agent fan-out duplication (9% of tokens)

Bottleneck 4: the sycophancy preamble (6% of tokens)

What the dashboard could not show me

What I run now

What I do not do

I Wired 8 MCP Servers Into One Claude Agent. 3 Pairs Quietly Fought Over the Same Tool Name.

The 8 servers and why each one was there

The three collisions

Why the spec lets this happen

What I added

What I measure now

What I do not do

I Benchmarked 5 Voice AI Stacks. Only 2 Stayed Under 300ms.

The three cliffs nobody puts on the slide

Where the 300ms budget goes

The five stacks

The results

Why voice-to-voice wins (today)

When edge AI catches up

What I would build today

Related reading

5 AI Crawlers Hit My Sites 14,300 Times in 30 Days. Here's What Their User-Agents Told Me About LLMO.

Why I started reading logs in the first place

The 5 crawlers that hit me most

What each one is actually doing

Who actually obeys robots.txt

Three LLMO signals you can derive from this

How to actually pull this data

What I changed after looking at all this

I Added a 4th Agent That Audits My Other Agents. It Caught My Strategist Procrastinating for 3 Weeks.

The three layers were not the autonomous part

What the procrastination looked like in the logs

The 4th layer: what an Evolver is

EVO-0003 caught my Strategist procrastinating

The Safety boundary, because Self-Evolving Agents are not toys

What you need before adding a 4th layer

What about not adding a 4th layer

What I have not built yet

I Told Claude Code to Do TDD. It Wrote the Test AFTER the Code 6 Out of 10 Times.

The 30-day audit

Why next-token prediction defaults to implementation-first

What I tried first that did not work

The PreToolUse hook that broke the loop

What the 30-day rerun looked like

When to skip TDD entirely

The takeaway

Your New Domain's First Week of GA4 Is a Lie: 4 Days of Raw Data from a Launch

The raw data: past 14 days (4 days of real activity)

Five signals, beaten in parallel

Why a single signal lies

The bimodal distribution was the smoking gun

So how many real humans were there?

Why GA4 doesn't filter this for you

Three annotations every new-project dashboard needs

Closing: look at your own GA4 with this lens

I Stacked 4 More Context Layers on Top of RAG. Sonnet Got 12% Better. Haiku Got 14% Worse.

What I was measuring

The numbers

Why this happens

The 2026 reality I almost forgot to mention

The thing nobody told me about Skills