DEV Community: Hiroshi Toyama

A Layered Evaluation Strategy for LLM Agents (Google ADK's 12 Criteria)

Hiroshi Toyama — Sat, 01 Aug 2026 04:17:14 +0000

Google's Agent Development Kit (ADK) ships 12 evaluation criteria for testing agent behavior: tool-call trajectories, final response quality, hallucination detection, safety, multi-turn task success, and more. The natural instinct is to wire all 12 into your PR pipeline and call it done. That instinct is wrong, and the reason why is worth understanding before you build an eval suite for any agent, ADK or not.

The 12 criteria, grouped by what they actually measure

ADK's criteria split cleanly along two axes: does it need a golden reference or rubric, and does it call an LLM-as-a-Judge.

Criteria	Golden Reference	Rubric	LLM Judge	User Simulation
`tool_trajectory_avg_score`	Yes	No	No	No
`response_match_score`	Yes	No	No	No
`final_response_match_v2`	Yes	No	Yes	No
`rubric_based_final_response_quality_v1`	No	Yes	Yes	Yes
`rubric_based_tool_use_quality_v1`	No	Yes	Yes	Yes
`rubric_based_multi_turn_trajectory_quality_v1`	No	Yes	Yes	Yes
`hallucinations_v1`	No	No	Yes	Yes
`safety_v1`	No	No	Yes	Yes
`per_turn_user_simulator_quality_v1`	No	No	Yes	Yes
`multi_turn_task_success_v1`	No	No	Yes	Yes
`multi_turn_trajectory_quality_v1`	No	No	Yes	Yes
`multi_turn_tool_use_quality_v1`	No	No	Yes	Yes

Ten out of twelve depend on an LLM Judge. That single fact is why "run everything on every PR" doesn't survive contact with a real pipeline.

Why full coverage in CI breaks down

Three structural constraints show up as soon as you try:

Latency. LLM-as-a-Judge and User Simulation cases take tens of seconds to minutes each. Multiply by a real test matrix and PR turnaround goes from minutes to an hour-plus.
Cost. For N test cases and T turns, a User Simulation + Judge suite costs roughly O(N × T) in tokens. Running that on every commit is not something FinOps will sign off on.
Flakiness. Even at temperature=0.0, Judge and Simulator outputs drift. A red CI run stops meaning "the code is broken" — it might just mean the Judge had an off moment. Once developers can't trust red, they start ignoring it.

There's also a subtler problem with the two deterministic criteria (tool_trajectory_avg_score, response_match_score). response_match_score uses ROUGE-1 — raw unigram overlap — which is brittle against formatting or phrasing changes and produces false negatives whenever you touch a prompt. And tool_trajectory_avg_score in EXACT or IN_ORDER mode punishes an agent for finding a more efficient (e.g., parallelized) tool-call order after a model upgrade. Strict matching modes actively discourage improvement.

The fix: layer by execution frequency and determinism

The resolution is a three-layer pyramid, ordered by how often it runs and how deterministic it is.

                 Layer 3: E2E / Scenario Eval        [weekly / pre-release]
                 - User Simulation                    high cost / non-deterministic
                 - multi_turn_task_success
            ------------------------------------
              Layer 2: Nightly / Quality & Safety     [daily]
              - hallucinations_v1 / safety_v1          medium cost / LLM Judge
              - rubric_based_final_response
            ------------------------------------
            Layer 1: PR / Pre-commit (deterministic CI) [per commit/PR]
            - tool_trajectory_avg_score (ANY_ORDER)     low cost / deterministic

Layer 1 — PR gate: no LLM calls at all

The only criterion here is tool_trajectory_avg_score, run in ANY_ORDER (not EXACT) mode — asserting only that the required tools fired, not the exact sequence. External APIs are mocked. Zero LLM calls, zero dollars, sub-minute runtime.

import pytest
from google.adk.evaluation import AgentEvaluator
from my_agent.agent import root_agent

@pytest.mark.asyncio
async def test_layer1_tool_trajectory():
    """PR/CI gate: deterministic tool-call check, no LLM involved."""
    evaluator = AgentEvaluator(
        agent=root_agent,
        config_path="tests/eval_cases/layer1_ci.json"
    )
    results = await evaluator.run_evaluation()
    for result in results.case_results:
        score = result.metrics["tool_trajectory_avg_score"].score
        assert score == 1.0, f"Failed case {result.case_name}: score={score}"

layer1_ci.json pins the config that makes this deterministic:

{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "ANY_ORDER"
    }
  }
}

Layer 2 — Nightly: quality, hallucination, safety

This is where LLM-as-a-Judge enters, against a fixed, versioned dataset (hundreds of cases, not the full corpus). Two things matter for keeping this layer trustworthy: pin the Judge model and its temperature, and track score deltas over time rather than just absolute thresholds.

evaluator = AgentEvaluator(
    agent=root_agent,
    config_path="tests/eval_cases/layer2_nightly.json",
    judge_model="gemini-1.5-pro",
    judge_model_config={"temperature": 0.0},
)
results = await evaluator.run_evaluation()

for result in results.case_results:
    assert result.metrics["hallucinations_v1"].score >= 0.9
    assert result.metrics["safety_v1"].score == 1.0
    rubric = result.metrics["rubric_based_final_response_quality_v1"]
    assert rubric.score >= 0.8, rubric.reasoning

A rubric criterion needs an explicit, weighted rubric — not a vague adjective. This is the part teams get wrong most often:

{
  "rubric_based_final_response_quality_v1": {
    "threshold": 0.8,
    "rubrics": [
      {"name": "conclusion_first", "description": "Does the first 1-2 sentences state a clear conclusion?", "weight": 0.3},
      {"name": "formatting_structure", "description": "Is the response structured with Markdown lists/headings?", "weight": 0.3},
      {"name": "technical_explanation", "description": "Are technical terms briefly defined in context?", "weight": 0.4}
    ]
  }
}

Write rubric items as objectively checkable conditions, not subjective adjectives:

Bad: "explained in an easy-to-understand way"
Good: "each technical term is followed by a one-sentence definition, either in parentheses or the next sentence"

The Judge LLM receives the query, the response, and the weighted rubric list, and returns per-item scores plus a weighted average — so a failing assertion comes with a reasoning string you can read directly, not just a number.

Layer 3 — pre-release E2E: multi-turn simulation

This layer runs a User Simulator LLM against the agent across several turns and checks goal completion (multi_turn_task_success_v1) plus process quality (multi_turn_trajectory_quality_v1, multi_turn_tool_use_quality_v1).

evaluator = AgentEvaluator(
    agent=root_agent,
    config_path="tests/eval_cases/layer3_e2e.json",
    user_simulator_model="gemini-1.5-flash",
    judge_model="gemini-1.5-pro",
    judge_model_config={"temperature": 0.0},
)
results = await evaluator.run_evaluation()

for result in results.case_results:
    sim_quality = result.metrics["per_turn_user_simulator_quality_v1"].score
    if sim_quality < 0.8:
        pytest.skip(f"Simulator degraded in {result.case_name}")
    assert result.metrics["multi_turn_task_success_v1"].score == 1.0
    assert result.metrics["multi_turn_trajectory_quality_v1"].score >= 0.8

Note the pytest.skip guard on per_turn_user_simulator_quality_v1. This layer has a structural gotcha worth calling out explicitly.

The gotcha: meta-evaluation error propagation

per_turn_user_simulator_quality_v1 exists because this layer has three LLMs talking past each other: a Simulator LLM playing the user, the Agent under test, and a Judge LLM scoring the outcome. That's a "Judge judging a Judge" structure — if the Simulator goes off-script (loops, ignores its own goal, hallucinates a constraint), the resulting task_success score reflects the Simulator's failure, not the agent's. Skipping low-quality-simulation sessions before asserting on task success is not optional defensive coding — without it, a bad night for the Simulator model silently fails your agent's release gate.

Wiring it into CI/CD

name: Agent Evaluation Pipeline
on:
  pull_request:
    branches: [ main ]
  schedule:
    - cron: '0 18 * * *'  # nightly

jobs:
  layer1-ci:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -r requirements.txt
      - run: pytest tests/test_layer1_ci.py

  layer2-layer3-nightly:
    if: github.event_name == 'schedule'
    needs: layer1-ci
    runs-on: ubuntu-latest
    env:
      GEMINI_API_KEY: ${{ secrets.GEMINI_API_KEY }}
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: { python-version: '3.11' }
      - run: pip install -r requirements.txt
      - run: pytest tests/test_layer2_nightly.py
      - run: pytest tests/test_layer3_e2e.py

Only Layer 1 blocks the PR. Layers 2 and 3 run on schedule, against main, and their failures route to a human triage step rather than a red PR check.

Where to put the eval code

A recurring question: does eval code belong under tests/? The honest answer is that tests/ and evals/ are different disciplines and benefit from being separated at the top level:

.
├── src/
│   └── my_agent/
├── tests/                 # deterministic, mocked, no API calls
│   ├── test_tools.py
│   └── test_state_management.py
└── evals/                 # probabilistic, real API/Simulator calls
    ├── datasets/
    ├── metrics/            # custom rubric definitions
    └── run_evals.py

tests/ should stay fast and free so a developer can run it locally without worrying about API spend. If eval code does live under tests/, mark it explicitly (pytest.mark.eval) so pytest -m "not eval" stays the default local loop.

Summary

Only 1-2 of ADK's 12 criteria are realistic PR-blocking gates — tool_trajectory_avg_score in ANY_ORDER mode, backed by deterministic assertions and mocks. Everything that touches an LLM Judge or a User Simulator belongs in a nightly or pre-release layer, versioned against a fixed dataset, with the Judge model and temperature pinned for reproducibility. If your eval suite tries to be a single flat list run on every commit, expect flaky red builds and a PR queue nobody trusts — the fix isn't a better assertion, it's a different pyramid.

Editing an ext4 Partition Directly from macOS (No Linux VM Required)

Hiroshi Toyama — Thu, 30 Jul 2026 13:30:27 +0000

I lost the private key for a Raspberry Pi and needed to add a new public key to /root/.ssh/authorized_keys. The Pi's SD card was plugged into a Mac via a USB reader, so diskutil list could see the partitions fine — but the root filesystem was ext4, which macOS cannot mount. No spare Linux box, no VM set up, just Docker Desktop (which, on Apple Silicon, only gives you a Linux kernel inside its VM — it doesn't let a container touch a host block device like /dev/disk5s3). This is the path that worked without either of those.

The trick: e2fsprogs is userspace

e2fsck, debugfs, and friends from e2fsprogs don't rely on the kernel's ext4 driver at all — they implement the ext2/3/4 on-disk format entirely in userspace and just do read()/write()/lseek() on whatever path you give them. That means they work identically whether the target is a regular file, a disk image, or — critically — a raw device node on macOS.

brew install e2fsprogs
# keg-only, so call it by full path or add it to PATH
DEBUGFS=/opt/homebrew/opt/e2fsprogs/sbin/debugfs

diskutil list showed the target as disk5, with the Linux root filesystem on the third slice:

/dev/disk5 (external, physical):
   1: Windows_FAT_32   ...   disk5s1
   2: Linux_Swap       ...   disk5s2
   3: Linux            ...   disk5s3

Read-only exploration works immediately, no mount required:

$DEBUGFS -R "ls -l /root/.ssh" /dev/rdisk5s3
$DEBUGFS -R "dump /root/.ssh/authorized_keys ./authorized_keys.bak" /dev/rdisk5s3

ls -l, stat, and dump give you inode numbers, permissions, timestamps, and file contents without ever mounting the partition. For a simple edit — appending one line to an existing file — I dumped the file, merged in the new key locally, and wrote it back with debugfs -w:

$DEBUGFS -w -R "rm /root/.ssh/authorized_keys" \
            -R "write ./authorized_keys.merged /root/.ssh/authorized_keys" \
            -R "sif /root/.ssh/authorized_keys mode 0100600" \
            -R "sif /root/.ssh/authorized_keys uid 0" \
            -R "sif /root/.ssh/authorized_keys gid 0" \
            /dev/rdisk5s3

That's the whole mechanism. Two things went wrong on the way there, and both are worth knowing before you try this on a filesystem you actually care about.

Gotcha 1: chained `-R` commands don't share a working directory

debugfs -w accepts multiple -R flags to run several commands in one invocation. I assumed cd would set the working directory for the rest of the chain, the way it would in an interactive debugfs session:

$DEBUGFS -w -R "cd /root/.ssh" -R "write ./authorized_keys.merged authorized_keys" /dev/rdisk5s3

It silently wrote the file to /authorized_keys (filesystem root) instead of /root/.ssh/authorized_keys. The cd had no effect across the command boundary. Lesson: when scripting debugfs -w with multiple -R flags, always give write/rm/sif a full path. Don't rely on cd persisting — verify with ls -l after every mutating step instead of trusting the command succeeded as intended.

Gotcha 2: raw device + journal replay can undo your own edits

The bigger issue: after the edit, I ran a routine consistency check:

e2fsck -y -f /dev/rdisk5s3

This printed recovering journal, then unable to set superblock flags on /dev/rdisk5s3 and aborted with errors still present. A subsequent read of the file showed its inode had been reverted to an unrelated, long-deleted file's metadata — wrong owner, dtime set, zero links. My freshly-written authorized_keys was gone, even though the data block itself was untouched.

Here's what actually happened. debugfs -w writes directly to blocks and bypasses the ext4 journal entirely. A live ext4 filesystem almost always has a non-empty journal — this is completely normal, not a sign of an unclean shutdown. As long as nothing tries to replay that journal, an empty or stale journal is harmless. But e2fsck -y does replay it, and if a journaled transaction happens to touch the same inode table block or bitmap you just hand-edited, replay wins: it overwrites your change with whatever stale state was recorded in the journal.

The specific failure to write the superblock flags turned out to be a macOS quirk: /dev/rdiskN is the character (raw) device, and some of the small, oddly-sized writes e2fsprogs issues when finalizing (clearing the "needs recovery" flag, in this case) get rejected with EINVAL because macOS raw devices require I/O aligned to the device's sector size. Switching to the block device node fixed it — no code changes, just a different path:

e2fsck -y -f /dev/disk5s3   # buffered block device, not /dev/rdisk5s3

Through the buffered device, the kernel's buffer cache absorbs the unaligned writes, and both the journal replay and the superblock update completed cleanly.

The correct order of operations

Putting the two gotchas together, the safe procedure for hand-editing a live ext4 image via debugfs -w is:

Recover the journal first, before touching anything by hand: e2fsck -y -f /dev/diskNsM (buffered device). This flushes any pending journaled transactions so they can't later clobber your edits.
Make your edits with debugfs -w, using absolute paths for every command, verifying each mutation with a read-only ls -l / stat immediately after.
Verify with a read-only check: e2fsck -n -f /dev/diskNsM. If it comes back clean, you're done — resist the urge to run e2fsck -y again "just to be safe"; there's nothing left for it to fix, and re-running it adds risk for no benefit.

# 1. clean baseline
e2fsck -y -f /dev/disk5s3

# 2. edit (absolute paths only)
debugfs -w -R "write ./authorized_keys.merged /root/.ssh/authorized_keys" \
            -R "sif /root/.ssh/authorized_keys mode 0100600" \
            -R "sif /root/.ssh/authorized_keys uid 0" \
            -R "sif /root/.ssh/authorized_keys gid 0" \
            /dev/disk5s3

# 3. confirm, read-only, no further writes
e2fsck -n -f /dev/disk5s3

Summary

e2fsprogs (debugfs, e2fsck) implements ext2/3/4 entirely in userspace, so it works against a raw macOS device node with no kernel ext4 driver and no Linux VM.
debugfs -w's chained -R commands do not share a working directory — always use absolute paths and verify after each mutation.
debugfs -w bypasses the journal. Running e2fsck -y afterward replays it, and can silently revert your hand-edits if they share a block with a pending journaled transaction. Recover the journal before you edit, not after.
On macOS, prefer the buffered block device (/dev/diskN) over the raw character device (/dev/rdiskN) for e2fsck/debugfs writes — the raw device's alignment requirements can reject small metadata writes with EINVAL.

Stop Symlinking Your Cursor and Claude Code Rules — Generate Them Instead

Hiroshi Toyama — Sun, 12 Jul 2026 11:58:52 +0000

If you use both Cursor and Claude Code on the same repo, you have probably noticed both tools support a rules/ directory of persistent instructions. The obvious move is to make them share one directory with a symlink. Don't. The two formats overlap just enough to look compatible and just differently enough to fail silently. This post walks through the exact incompatibilities and a small code generator that keeps both formats in sync from a single source of truth.

The two rule systems

Both tools let you drop markdown files with YAML frontmatter into a project directory. Instructions apply either always or scoped to file globs.

Cursor — .cursor/rules/*.mdc:

---
description: "Run ESLint after changing TS/JS"
globs: extensions/app/**/*.{ts,tsx,js}
alwaysApply: false
---
# ESLint after edits
Run `npm run lint` before considering the task done.

Cursor derives the rule type from a combination of three fields:

`alwaysApply`	`description`	`globs`	Type
`true`	—	—	Always
`false`	—	provided	Auto Attached
`false`	provided	omitted	Agent Requested
`false`	omitted	omitted	Manual (@-only)

Claude Code — .claude/rules/*.md:

---
description: Run ESLint after changing TS/JS
paths:
  - "extensions/app/**/*.{ts,tsx,js}"
---
# ESLint after edits
Run `npm run lint` before considering the task done.

Claude Code's model is simpler: a rule with a paths: list loads only when Claude reads a matching file; a rule with no paths: loads unconditionally every session (same priority as .claude/CLAUDE.md). There is no "agent decides from description" mode.

Why one shared directory can't work

The tempting shortcut:

ln -s ../.claude/rules .cursor/rules

One physical directory, both tools point at it, done. Except it breaks on two independent axes:

1. File extension. Cursor's rule loader only reads .mdc and ignores plain .md. Claude Code discovers .md and ignores .mdc. A single file on disk cannot have both extensions, so whichever you pick, exactly one tool silently sees zero rules. No error, no warning — the rules just don't apply.

2. Frontmatter keys. Even if you dodged the extension problem, the scoping keys differ:

Cursor: globs: as a comma-separated string, plus alwaysApply:.
Claude Code: paths: as a YAML list, no alwaysApply.

Point Claude Code at a Cursor-format file and it sees no paths: key, so it treats every rule as always-on — your carefully scoped API rule now loads on every session. The failure is invisible because the file is being read; it's just being interpreted with the wrong schema.

This is the trap that motivated the whole exercise: I had .cursor/rules -> ../.claude/rules and "fixed" the files to Claude Code's format, which instantly broke Cursor without a single error message.

The fix: one source, generated outputs

Since the bodies are identical and only the frontmatter differs, treat one format as the source of truth and generate the other. I picked .claude/rules/*.md as the source (Claude Code reads it directly) and generate .cursor/rules/*.mdc.

The mapping is mechanical:

Source (`.md`)	Generated (`.mdc`)
`paths:` list present	`globs: <comma-joined>` + `alwaysApply: false`
`paths:` absent	`alwaysApply: true`
`description:`	carried over verbatim

Here is the core of the generator — plain Node, zero dependencies, so there is no package.json to maintain just for this:

import { readdirSync, readFileSync, writeFileSync, mkdirSync, rmSync } from 'node:fs';
import { join } from 'node:path';

const SRC_DIR = '.claude/rules';
const OUT_DIR = '.cursor/rules';

function splitFrontmatter(text) {
  const m = text.match(/^---\n([\s\S]*?)\n---\n?([\s\S]*)$/);
  return m ? { fm: m[1], body: m[2] } : { fm: '', body: text };
}

// Minimal parser for our known shape: `description:` scalar + `paths:` list.
function parseFrontmatter(fm) {
  const out = { description: undefined, paths: [] };
  const lines = fm.split('\n');
  for (let i = 0; i < lines.length; i++) {
    const desc = lines[i].match(/^description:\s*(.*)$/);
    if (desc) { out.description = desc[1].trim(); continue; }
    if (/^paths:\s*$/.test(lines[i])) {
      for (let j = i + 1; j < lines.length; j++) {
        const item = lines[j].match(/^\s*-\s*(.*)$/);
        if (!item) break;
        out.paths.push(item[1].trim().replace(/^["']|["']$/g, ''));
        i = j;
      }
    }
  }
  return out;
}

function toCursorFrontmatter({ description, paths }) {
  const lines = ['---'];
  if (description) lines.push(`description: ${description}`);
  if (paths.length) {
    lines.push(`globs: ${paths.join(', ')}`);
    lines.push('alwaysApply: false');
  } else {
    lines.push('alwaysApply: true');
  }
  lines.push('---');
  return lines.join('\n');
}

mkdirSync(OUT_DIR, { recursive: true });
const sources = readdirSync(SRC_DIR).filter((f) => f.endsWith('.md'));
const expected = new Set(sources.map((f) => f.replace(/\.md$/, '.mdc')));

for (const file of sources) {
  const { fm, body } = splitFrontmatter(readFileSync(join(SRC_DIR, file), 'utf8'));
  const outName = file.replace(/\.md$/, '.mdc');
  const banner = '<!-- AUTO-GENERATED from .claude/rules. Edit the .md source. -->';
  writeFileSync(
    join(OUT_DIR, outName),
    `${toCursorFrontmatter(parseFrontmatter(fm))}\n\n${banner}\n${body.replace(/^\n+/, '')}`,
  );
}

// Drop generated files whose source .md was deleted.
for (const f of readdirSync(OUT_DIR)) {
  if (f.endsWith('.mdc') && !expected.has(f)) rmSync(join(OUT_DIR, f));
}

Run node scripts/sync-cursor-rules.mjs after editing any rule. The stale-file sweep at the end means deleting a source .md also removes its generated .mdc, so the two directories never drift.

Gotchas

A hand-written YAML parser is fine because the shape is fixed. These frontmatter blocks are tiny and you control them. Don't pull in a YAML dependency for five files with two keys each — but do keep the parser honest about what it supports (scalar description, list paths) and nothing more.
alwaysApply: false + no globs has no Claude Code equivalent. Cursor's "Agent Requested" type (the agent decides relevance from the description) doesn't map. Going Claude → Cursor, a rule with no paths becomes alwaysApply: true. If you want scoping on the Cursor side, add explicit paths to the source.
Commit the generated .mdc or gitignore them — pick one and be explicit. Committing means Cursor users don't need Node to bootstrap; gitignoring keeps the diff clean. Either is fine; leaving it ambiguous means someone edits the generated file by hand and loses the change on the next run. The AUTO-GENERATED banner exists to catch exactly that.

Summary

Cursor and Claude Code rules look shareable but aren't: different extensions (.mdc vs .md) and different scoping schemas (globs string vs paths list) both fail silently when you force one directory to serve both. A ~40-line dependency-free generator turns one source of truth into both formats and includes a stale-file sweep so the outputs never drift. Edit one place, run one command, both tools stay correct.

How an Unbounded fastmcp Version Constraint Took Down Production with 421 Misdirected Request

Hiroshi Toyama — Wed, 08 Jul 2026 05:33:45 +0000

We run a Model Context Protocol server on Cloud Run, built with fastmcp, sitting behind a custom domain on an HTTPS load balancer. It hadn't been redeployed in three months. When we finally shipped a routine release, every single request started failing with 421 Misdirected Request. No code changes. No infra changes we were aware of. Just a version bump.

This is the story of tracking that down, and why an unbounded fastmcp>=3.4.2 in pyproject.toml was the real culprit.

The symptom

After redeploying, every request through the custom domain returned:

HTTP/1.1 421 Misdirected Request

with an empty or near-empty body. Curl reproduced it instantly (1ms latency), which ruled out anything resembling real request processing — this was a fast, early rejection.

The confusing part: the exact same image, hit directly via its own *.run.app URL with no custom domain, returned a normal 404 for an unmounted path instead of 421. Something was specifically rejecting the custom domain Host header, not /mcp requests in general.

Chasing the wrong SDK version

pyproject.toml had:

dependencies = [
    "fastmcp>=3.4.2",
]

uv.lock on disk was already pinned to 3.4.2. I installed that exact version into a scratch venv and read through fastmcp's HTTP transport code and the underlying mcp SDK's mcp/server/transport_security.py, which implements DNS-rebinding protection:

class TransportSecurityMiddleware:
    def __init__(self, settings: TransportSecuritySettings | None = None):
        # If not specified, disable DNS rebinding protection by default
        # for backwards compatibility
        self.settings = settings or TransportSecuritySettings(enable_dns_rebinding_protection=False)

In fastmcp==3.4.2, nothing in the codebase ever passed security_settings to this middleware — a global search for TransportSecurity, allowed_hosts, and dns_rebinding came back empty. So this middleware should have been a no-op. And yet production was failing.

The lesson here: >=X in a dependency spec describes what was true when the constraint was written, not what's actually running. pyproject.toml said >=3.4.2; the container that was actually deployed had resolved something newer.

Finding the real version

The Cloud Run startup logs told the real story:

│                                FastMCP 3.4.3                                 │

Not 3.4.2. The routine ci.sh release step that bumps the package's own version had also run uv lock, and because the constraint had no upper bound, the resolver happily picked up the newest compatible release — 3.4.3 — which had shipped days earlier.

Re-running my investigation against fastmcp==3.4.3 immediately explained everything:

# fastmcp/settings.py
http_host_origin_protection: bool = True
http_allowed_hosts: list[str] | None = None

3.4.3 added host-origin protection enabled by default, wired through to the same TransportSecurityMiddleware we'd already found:

# fastmcp/server/mixins/transport.py
allowed_hosts=(
    allowed_hosts
    if allowed_hosts is not None
    else fastmcp.settings.http_allowed_hosts
),

With http_allowed_hosts left at its default None, the effective allowlist is empty. Every incoming Host header — our custom domain, anything — fails the check:

# mcp/server/transport_security.py
async def validate_request(self, request, is_post=False):
    ...
    host = request.headers.get("host")
    if not self._validate_host(host):
        return Response("Invalid Host header", status_code=421)

That's the 421. It's a real security feature — DNS-rebinding protection is a legitimate concern for locally-hosted MCP servers — but it shipped as a default-on breaking change in what looked, from the version number, like a patch release.

A red herring along the way

Before landing on the host-allowlist explanation, we chased a different theory: an OAuth authorization request from an MCP client returned a plausible-looking Misdirected Request, and it was tempting to blame a redirect_uri/CIMD (Client ID Metadata Document) validation mismatch — that's a real, separate failure mode worth knowing about, but it wasn't the cause here. It was only after reproducing 421 on a plain, unauthenticated request to the base /mcp endpoint — nothing OAuth-related at all — that it became clear this was happening below the application logic, in transport-level middleware.

The fix

Immediate mitigation — restore the pre-3.4.3 behavior explicitly, rather than relying on accidental version pinning:

env {
  name  = "FASTMCP_HTTP_HOST_ORIGIN_PROTECTION"
  value = "false"
}

fastmcp's settings model uses extra="ignore", so this env var is a safe no-op on any version that doesn't understand it yet — meaning it's also cheap insurance to add to services still running an older fastmcp, before they ever hit this problem.

Then, cap the blast radius of future upgrades:

dependencies = [
    "fastmcp>=3.4.2,<3.5.0",
]

Re-locking against that constraint happened to resolve back down to 3.4.2 — a nice confirmation that the newer patch wasn't otherwise needed.

Takeaways

An unbounded >= constraint on a fast-moving dependency is a live production risk, not just a style nit. A "patch" version bump (3.4.2 → 3.4.3) shipped a default-on behavioral change.
Any CI step that re-runs your lockfile resolver (a version-bump script calling uv lock, poetry lock, etc.) can silently upgrade transitive-looking dependencies you never touched. Review the lockfile diff, not just your own changed files.
New security defaults deserve explicit configuration, not silent adoption. DNS-rebinding protection is the right default for a library that's often run as a local, loopback-only server. It's a much riskier default to ship without a loud changelog entry when the same library is also commonly deployed behind a custom domain, where the request's Host header will never be localhost in the first place.
When a 1ms-latency error shows up right after a deploy, suspect the deploy, not the traffic. The fast rejection time was the tell that this was a cheap, early-stage check — not an app bug deep in request handling.

If you're running fastmcp behind a custom domain, go check whether you're pinned tightly enough to know which http_host_origin_protection behavior you're actually getting, and configure FASTMCP_HTTP_ALLOWED_HOSTS (or explicitly opt out) rather than finding out in production.

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

Hiroshi Toyama — Fri, 05 Jun 2026 08:35:50 +0000

If you only look at the spec sheet, the TPU story is overwhelming: lower cost-per-token, dramatically better watts-per-token, deterministic latency. Trainium tells the same story. And yet a large share of the industry — including most of the inference traffic behind consumer chat UIs like ChatGPT — still runs on NVIDIA. The gap between "cheaper on paper" and "what people actually deploy" is not a marketing failure. It's an architectural tax that systolic-array silicon charges you in code, pipelines, and org structure. This post is about where that tax comes from and why only a handful of companies can afford to pay it.

The one architectural fact that explains everything: static shapes

NVIDIA GPUs are SIMT (Single Instruction, Multiple Threads) processors. They schedule threads dynamically at runtime and page memory on demand. TPUs and AWS Trainium are not GPUs — they are systolic arrays: a grid of multiply-accumulate units wired directly to their neighbors, fed by an ahead-of-time compiler (XLA for TPU, the Neuron compiler for Trainium).

A systolic array hits peak utilization only when the shape of the data flowing through it is fixed at compile time. Weights are loaded once and stay stationary in the processing elements; activations slide through like a bucket brigade. Change the sequence length or batch size by even one token and the data routes and memory addresses have to be recomputed — which means the compiler has to generate a new binary.

That single constraint is the source of every downstream pain. Here's what it forces on you at inference time:

Runtime input	NVIDIA (dynamic)	TPU / Trainium (static)
Larger than the compiled bucket	Handled by dynamic allocation	Shape-mismatch crash
Smaller than the bucket	Handled with no waste	JIT recompile stall (minutes) or zero-pad waste
New, unseen length	Just runs	New binary must exist, or it stalls

So before any token reaches the chip, you need an answer to: "what shape is this, and which precompiled binary does it route to?" On NVIDIA you never ask that question.

The dynamic vs. static analogy: Python vs. Java

The cleanest mental model: NVIDIA is Python, TPU/Trainium is Java.

NVIDIA = Python. Dynamic typing ≈ dynamic shapes. The runtime absorbs chaos. You throw a 100-token prompt or a 50,000-token prompt at the same forward and it just works, "good enough" fast, with no compile step in your face.
TPU/Trainium = Java. Static typing ≈ static shapes. Nothing runs until it's compiled to a fixed binary (NEFF for Neuron, an XLA executable for TPU). In exchange for boilerplate and rigid discipline, you get extreme execution efficiency — once everything fits the contract.

AMD's Instinct line (CDNA, ROCm) sits firmly on the NVIDIA/Python side: SIMT, dynamic shapes, PagedAttention support, and a HIPIFY toolchain whose entire purpose is to run your existing CUDA code unchanged. The static/dynamic split is the real fault line — not the vendor logos.

What "handle dynamic input on static hardware" actually costs you in code

Suppose three users hit your endpoint at once: 3,000 / 4,000 / 1,000 tokens. On NVIDIA you don't pad and you don't build a mask. You concatenate them into one flat 8,000-token buffer and hand FlashAttention a cu_seqlens index marking the boundaries:

# NVIDIA: variable-length attention. No padding, no mask matrix.
# Just a flat buffer + cumulative sequence lengths [0, 3000, 7000, 8000].
outputs = flash_attn_varlen_func(
    q, k, v,
    cu_seqlens_q, cu_seqlens_k,
    max_seqlen_q, max_seqlen_k,
)

The kernel reads the boundary index and isolates each user's context in hardware. No wasted FLOPs on cross-user attention. The code is "just the model logic."

On a TPU you can't reshape the systolic array, so you do the opposite: force everything into one fixed [batch, STATIC_SEQ_LEN] rectangle and use math to erase the parts you don't want computed.

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch_xla.core.xla_model as xm

class StaticShapeAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads, self.d_k = n_heads, d_model // n_heads
        self.q = nn.Linear(d_model, d_model)
        self.k = nn.Linear(d_model, d_model)
        self.v = nn.Linear(d_model, d_model)
        self.out = nn.Linear(d_model, d_model)

    def forward(self, x, attention_mask):
        # x is ALWAYS [batch, STATIC_SEQ_LEN, d_model]. The shape never varies.
        b, s, _ = x.size()
        q = self.q(x).view(b, s, self.n_heads, self.d_k).transpose(1, 2)
        k = self.k(x).view(b, s, self.n_heads, self.d_k).transpose(1, 2)
        v = self.v(x).view(b, s, self.n_heads, self.d_k).transpose(1, 2)

        scores = torch.matmul(q, k.transpose(-2, -1)) / (self.d_k ** 0.5)

        # The systolic array DID compute every cell, including padding and
        # other users' regions. We retroactively delete them: e^(-1e9) -> 0.
        scores = scores.masked_fill(attention_mask == 0, -1e9)
        attn = F.softmax(scores, dim=-1)

        ctx = torch.matmul(attn, v).transpose(1, 2).contiguous().view(b, s, -1)
        return self.out(ctx)

Two things about running this on XLA are pure consequences of static silicon:

xm.mark_step() is the real execution trigger. That import torch_xla at the top isn't decoration. Unlike CUDA's eager mode, calling model(x) on XLA only accumulates a graph. Nothing runs on the chip until mark_step() — called in your serving loop, not inside forward — compiles the accumulated graph into one fixed binary and ships it. New shape → new compile. (Recent PyTorch/XLA adds an eager mode that hides this, but the underlying compile-per-shape model is unchanged.)
masked_fill(..., -1e9) is a hack, not an optimization. NVIDIA's varlen path skips the cross-user multiplications entirely. The systolic array can't skip — it must multiply every cell of the rectangle, including the zeros, and then you mathematically null them out in softmax afterward. You burn the watts, then throw the result away.

The "smallest input" trap

The crash-on-overflow case is intuitive: feed 1,025 tokens into a binary compiled for 1,024 and you get a shape mismatch. The nastier case is underflow — a 100-token request hitting a 1,024 system:

Let it through: XLA sees a new shape and triggers a JIT recompile. In production that's a multi-minute freeze. Stall.
Pad to 1,024: the array dutifully runs 0 × 0 + 0 across ~90% of its cells, consuming full power to compute nothing. Utilization collapses.

The escape hatch is packing: instead of one user per bucket, tile multiple users' requests into a fixed rectangle like Tetris, and generate a segment-ID mask so attention can't bleed across users.

Fixed bucket [ 8192 tokens ]
├─ User A query (3000)
├─ User B query (4000)
├─ User C query (1000)
└─ padding      (192)   <-- the only waste

It helps to be concrete about what "the rectangle" physically is. When you compile with BATCH_SIZE = 4, STATIC_SEQ_LEN = 8192, XLA reserves one contiguous [4, 8192] static region in the TPU's HBM — not four independent "rooms," but one big sheet the compiler hard-wires the array routes for. A single user rarely fills even one 8,192 lane, so the serving layer packs multiple users across the four lanes at once:

[ One TPU processor: one static [4 x 8192] sheet ]

lane[0] (8192): [ A(2000) + B(5000) + C(1000) + pad(192) ]
lane[1] (8192): [ D(8000)                      + pad(192) ]
lane[2] (8192): [ E(3000) + F(3000) + G(2100)  + pad(92)  ]
lane[3] (8192): [ H(4000) + I(4000)            + pad(192) ]

Physically there are 4 lanes (32K of space); logically the proxy just crammed 9 ragged users (A–I) into them. From the application side it looks like one TPU is concurrently servicing many small requests in parallel — but underneath it's one rigid sheet with a segment mask drawn over it. The reason the hardware wants one fat sheet instead of pre-carved small rooms is pure systolic-array physics: the bigger the matrix, the higher the array's fill rate and the fewer idle cycles between feeds.

Done right, MFU (Model FLOPs Utilization) climbs into the 50–60% band that well-tuned LLM serving actually achieves (PyTorch/XLA reports ~53% training MFU for Llama 2 70B on TPU) — versus the single digits a naive one-user-per-bucket scheme collapses to. 100% is a ceiling nobody touches; the point is that packing recovers most of the loss. But notice what you just built: a high-throughput Go/C++ proxy in front of the cluster whose only job is to catch ragged input and pack it into rectangles in real time. On NVIDIA, that entire layer does not exist.

It's not one function — the whole pipeline forks

People assume torch_xla abstracts the hardware away because xm.xla_device() transparently targets both TPU and Trainium (thanks to the shared OpenXLA/PJRT runtime — libtpu.so for TPU, libneuronpjrt.so for Neuron). That's true for model.to(device) and basic ops. It is emphatically not true for the parts that matter.

The forward signature itself diverges:

# NVIDIA forward: ragged data + boundary index. Length is arbitrary every call.
def forward(self, input_ids, cu_seqlens, max_seqlen):
    return self.flash_attn_func(input_ids, cu_seqlens, max_seqlen)

# Static forward: fixed rectangle + a mask matrix you must build yourself.
def forward(self, input_ids, attention_mask):  # input_ids is [batch, FixedSeqLen]
    return self.static_attn_func(input_ids, attention_mask)

And it cascades all the way down:

Component	NVIDIA pipeline	Trainium pipeline
Inference engine	`vLLM` (CUDA), `TensorRT-LLM`	`NxD` / `vllm-neuron`
Custom kernels	Triton, CUDA C++ (`FlashAttention`)	NKI (Neuron Kernel Interface), rewritten from scratch
Base image	`nvcr.io/nvidia/pytorch`	AWS Neuron DLC
CI build artifact	weights + CUDA/Triton binaries	weights + NEFF static binaries per bucket
Deploy target	`g5` / `p5` instances	`trn1` / `inf2` instances
Monitoring	`nvidia-smi`, DCGM exporter	`neuron-top`, Neuron exporter

Two completely parallel worlds. Your CUDA container, your eval scripts, your autoscaling triggers — none of it carries over. vLLM's hardware-plugin mechanism gives you "one skin" at the business-logic layer, but the engine underneath is 100% separate code with separate bugs.

Precision makes it worse

The data-type story isn't symmetric either. BF16 (which Google's TPU pioneered) is stable on both sides — its FP32-range exponent survives the -1e9 mask values without going NaN. But FP8, the current throughput play, favors NVIDIA: FP8 attention scores swing hard and need dynamic scaling at runtime to avoid clipping. A static compiler has to bake in a fixed scale factor at compile time, so on TPU/Trainium aggressive FP8 attention risks clipping that degrades model quality. "Just switch to FP8" is a one-liner on NVIDIA and a research project on static silicon.

The hidden cost: your org chart breaks

This is the part that kills adoption and nobody puts on a slide. On NVIDIA there's a clean abstraction boundary:

[ AI engineer / data scientist ]
   architecture, hyperparams, eval
        │
        ▼  boundary: Hugging Face weights / standard PyTorch
        │
[ MLOps / LLMOps engineer ]
   drop into vLLM, configure PagedAttention, scale out

The data scientist never thinks about memory layout. The MLOps engineer never reads the attention math. They ship artifacts across a clean interface.

On TPU that wall disappears, because model structure is directly coupled to physical constraints:

The packing scheme (MLOps) and the segment-mask logic inside forward (AI engineer) are two halves of one design. Change the batching strategy and the math has to change in lockstep. You cannot split that across a spec doc.
An AI engineer casually adding an if branch or changing layer count alters the compiled graph topology — and triggers JIT stalls or OOM in production. Debugging that requires dumping the XLA HLO graph, which pulls the AI engineer into an "infra" incident.
"BF16 → FP8 for 2x throughput" (MLOps) collides head-on with "FP8 static scaling causes hallucinations on certain tasks" (data scientist). On NVIDIA the runtime negotiates this for you. On TPU the two humans have to negotiate it face to face.

The organizations furthest along on TPU/Trainium — Google's Gemini team (custom silicon end to end), Anthropic's Claude team, and increasingly Meta, which began renting Google TPUs in 2026 to test Llama on both training and inference — lean away from the horizontal "data science dept / infra dept" split entirely. They run a single vertically-integrated team of people fluent in both the attention math and the compiler internals. Most companies cannot staff that, and the projects that try to keep the old division of labor die in a pile of compile errors and OOMs.

So why does anyone use them? Because the input is locked

The whole calculus flips when you control the input channel so the shapes are predictable. Two clean examples:

Google / YouTube summaries. The exact internal pipeline isn't public, but the shape is forced by the constraints: Google doesn't re-watch the video. At upload time, an async batch job (on spare TPU cycles) runs ASR and stores timestamped text in storage like Bigtable. When you ask for a summary, the exact text length is already known down to the token — so the router picks a just-right bucket, packing waste is near zero, and a light model like Gemini Flash scans pre-computed text. The "summarize a 2-hour video instantly" magic is really "scan a tiny text index that was built months ago for nearly free."
Anthropic / Claude Code. A CLI coding agent has an almost fully determined input: repo structure, tool definitions, git diffs, system prompt. The first ~90% of the context is invariant, which is exactly what static compilation and prompt caching love. Anthropic in fact serves Claude across a mix of Trainium, TPU, and NVIDIA — matching workloads to the most suitable chip — and runs Trainium fleets at scale (neuronx-distributed); a high-throughput Go/C++ packing proxy is the natural front-end for the static path, though Anthropic hasn't published the exact per-product split. Claude Code is — read cynically — close to the perfect input-locking channel that makes a Java-style chip worth the pain. Long-context workloads help too: a 200K-token prefill packs many buckets back-to-back, so the relative padding waste shrinks toward zero — the static array's weakness fades exactly where Claude is strongest.

The inverse is just as logical, and it explains why the chat UIs lean hardest on dynamic SIMT hardware. ChatGPT and Claude.ai's web frontends accept arbitrary text, surprise image uploads, and topic switches mid-conversation. The system can't predict the shape until the user hits send. That chaos is precisely what dynamic SIMT + PagedAttention were built for.

Takeaways

TPUs aren't unpopular because they're slow or expensive — they're cheaper per token. They're unpopular because cheapness is conditional on a discipline most teams can't enforce: every tensor shape fixed at compile time.
The cost moved, it didn't vanish. Static silicon pushes all the uncertainty out of the hardware and onto your software (packing, masking, bucket routing) and your people (collapsed dev/ops boundary). You trade CapEx (silicon, power) for OpEx (elite engineers maintaining hack layers).
The decision rule is about the channel, not the chip. If you own the input — a CLI, a fixed business workflow, your own storage pipeline — TPU/Trainium are a weapon. If your input is a free-form chat box or a third-party API integration, NVIDIA (or AMD) is the only sane choice, and reaching for TPU on EC2-sticker-price alone is how MFU quietly collapses to single digits.

The spec sheet was never lying about cost-per-token. It just wasn't pricing in the engineers, the forked pipeline, and the org redesign you have to buy first.

Upgrading Google ADK to 2.0 on a Cloud SQL Postgres Backend: The Three Things That Bit Us

Hiroshi Toyama — Thu, 04 Jun 2026 12:21:39 +0000

We run an agent built on Google's Agent Development Kit (ADK), deployed on Cloud Run with a Cloud SQL (PostgreSQL) session store via ADK's DatabaseSessionService. Bumping google-adk from 1.x to >=2.0.0 looked like a one-line dependency change. It wasn't.

Three things bit us, in increasing order of subtlety:

ADK 2.0 talks to Postgres through asyncpg, which forces a connection-URL change — and that URL is shared with sync code.
The events table needs two new columns that ADK 2.0 reads unconditionally. Deploy without them and chat silently 500s.
The legacy v0 (Pickle) schema still works, but throws a deprecation warning. Migrating to v1 (JSON) is optional and cannot be done in place.

Here's the field report.

1. The async driver switch — and the URL you now share with sync code

ADK 2.0's session service is async and expects an async Postgres driver. In practice that means your DATABASE_URL changes scheme:

postgresql://appuser:...@host/db          # 1.x
postgresql+asyncpg://appuser:...@host/db   # 2.0

Easy enough — update the secret, redeploy. The catch is that the same URL is read by code that is not async. We have custom storage (token storage, pending-state storage) built on plain synchronous SQLAlchemy, and create_engine() does not understand +asyncpg. Feed it the 2.0 URL and it tries to import an async driver into a sync engine and falls over.

The fix is a tiny normalization layer: store the async URL (because ADK is the primary consumer), and strip the driver suffix at the point where sync engines are created.

from sqlalchemy import create_engine
from sqlalchemy.engine import Engine


def _sync_db_url(db_url: str) -> str:
    """Normalize an async-driver URL for use with a sync SQLAlchemy engine."""
    return db_url.replace("postgresql+asyncpg://", "postgresql://", 1)


def create_db_engine(db_url: str) -> Engine:
    return create_engine(
        _sync_db_url(db_url),
        pool_size=2,
        max_overflow=1,
        pool_pre_ping=True,
        pool_recycle=300,
    )

The design decision worth calling out: one URL, normalized at the edge rather than two secrets. ADK gets the +asyncpg form it wants; every sync consumer goes through create_db_engine() and gets the driver suffix stripped. The replace(..., 1) only touches the scheme, so passwords containing the literal substring are safe. If you have any synchronous DB access alongside ADK 2.0, you need a shim like this — otherwise the async URL leaks into create_engine() and you get an import error at startup that looks unrelated to the upgrade.

2. The missing event columns — a silent 500 in production

This is the one that actually took the service down in our dev environment before we caught it.

ADK 2.0 added two columns to the events table:

input_transcription  jsonb
output_transcription  jsonb

ADK 2.0 reads these columns unconditionally on session GET and on the /run_sse streaming endpoint. If your database was created under 1.x, the columns don't exist, and Postgres raises UndefinedColumnError. The symptom is not a clear startup crash — the container boots fine, /health returns 200 — but every chat turn 500s and session reads fail. We reproduced it in dev as exactly that: healthy container, dead chat.

The fix is a forward-compatible ALTER TABLE that you must run before deploying the 2.0 image:

ALTER TABLE events ADD COLUMN IF NOT EXISTS input_transcription jsonb;
ALTER TABLE events ADD COLUMN IF NOT EXISTS output_transcription jsonb;

IF NOT EXISTS makes it idempotent, and adding nullable columns is non-blocking on Postgres — no table rewrite, safe on a live DB. The ordering matters: patch the DB first, then deploy. Do it the other way and you have a window where the new image is live against the old schema and chat is down.

Connecting through the Cloud SQL Auth Proxy, the whole patch is:

cloud_sql_proxy -instances=PROJECT:asia-northeast1:INSTANCE=tcp:127.0.0.1:15433 &

PGPASSWORD="$DB_PASSWORD" psql -h 127.0.0.1 -p 15433 -U appuser -d appdb <<'SQL'
ALTER TABLE events ADD COLUMN IF NOT EXISTS input_transcription jsonb;
ALTER TABLE events ADD COLUMN IF NOT EXISTS output_transcription jsonb;
SELECT column_name FROM information_schema.columns
WHERE table_name = 'events'
  AND column_name IN ('input_transcription', 'output_transcription');
-- expect 2 rows
SQL

Good news for rollback: these columns are ignored by ADK 1.x, so adding them doesn't break the old version. You can patch ahead of time without committing to the upgrade.

3. The v0 → v1 schema migration is optional (and you probably want to defer it)

On startup, ADK 2.0 logs this if your DB was created under 1.x:

The database is using the legacy v0 schema, which uses Pickle to serialize
event actions. The v0 schema will not be supported going forward and will be
deprecated in a few rollouts. Please migrate to the v1 schema which uses JSON
serialization for event data.

The key realization: ADK 2.0 reads and writes v0 fine. This is a deprecation warning, not a hard requirement. We chose to run 2.0 on the v0 schema and defer the migration — the upgrade and the migration are independent decisions, and decoupling them shrinks the risky deploy.

When you do migrate, the important constraint is that it cannot be done in place. The schemas are structurally different:

`events` column	v0	v1
`actions`	`bytea` (Pickle)	—
`event_data`	—	`jsonb` (all event data)
metadata table	none	`adk_internal_metadata`

v0 stores event actions as individual columns plus a pickled blob; v1 collapses everything into one event_data JSONB column. Because the column set changes, ADK ships a migration command that reads from one DB and writes to a freshly created one:

# CREATE DATABASE can't run inside a transaction — separate statement
psql ... -d postgres -c "CREATE DATABASE appdb_v1;"

SOURCE_URL="postgresql://appuser:${PW}@127.0.0.1:15433/appdb"
DEST_URL="postgresql://appuser:${PW}@127.0.0.1:15433/appdb_v1"

uv run adk migrate session \
  --source_db_url="${SOURCE_URL}" \
  --dest_db_url="${DEST_URL}"

adk migrate session covers ADK's own four tables: app_states, user_states, sessions, events. Anything you added yourself (OAuth tokens, app-specific state) is not touched and has to be copied separately — but that's outside ADK's scope and outside this post.

Verify the destination after migrating:

# 1 means v1
psql ... -d appdb_v1 -c \
  "SELECT value FROM adk_internal_metadata WHERE key='schema_version';"

# event_data present, actions gone
psql ... -d appdb_v1 -c \
  "SELECT column_name FROM information_schema.columns WHERE table_name='events';"

Cut over by repointing the connection secret at the new DB and redeploying. Because you migrated into a new database, the original is untouched — rollback is just repointing the secret back. No data loss, no destructive step until you're confident.

The deploy order that actually works

Pulling it together, the sequence is:

Patch the DB (ALTER TABLE events ...) — before anything else, to prevent the 500 window.
Switch the URL to postgresql+asyncpg:// (and make sure sync consumers normalize it back).
Deploy the 2.0 image.
Smoke test: /health → 200, an existing session GET → not 500, a new /run_sse chat → streams a response.
(Optional, later) migrate v0 → v1 into a new DB and cut over.

Gotchas worth pinning

pg_dump version skew. Don't reach for pg_dump to copy data if your local client is older than the Cloud SQL server (e.g. client 16 vs server 17) — it just refuses. Either match versions or copy via a script.
CREATE DATABASE outside a transaction. It can't run inside one, so it has to be its own statement — not bundled into a BEGIN ... COMMIT block with the grants.
Session compatibility across versions. Sessions written by 2.0 may not be readable by 1.x (especially older 1.x). Treat the version downgrade as lossy for any session created after cutover, and keep the old image only as a short-term escape hatch.
/health lies. A 200 from your health check says nothing about whether the schema matches. Smoke-test an actual session read and a real chat turn.

Summary

The google-adk 2.0 bump is small on paper and sharp in practice. The async driver switch ripples into any sync DB code sharing the URL; the new events columns turn a healthy-looking container into a chat outage if you deploy before patching; and the v0 deprecation warning is loud but not load-bearing — you can stay on v0 and migrate on your own schedule into a fresh DB. Patch first, normalize the URL at the edge, smoke-test the real path, and treat the schema migration as a separate project.

Chrome 126+ Broke My WXT Extension Dev Setup — Here's What Changed and How to Fix It

Hiroshi Toyama — Tue, 02 Jun 2026 12:12:47 +0000

I spent a weekend debugging a Chrome extension dev environment that stopped working after a Chrome update. No error messages. The extension loaded — I could open its options page — but content scripts never ran, service workers never started, and the UI stayed unchanged.

This post is about three separate failure modes I hit, why each happens, and the minimal fix for each.

The Setup

The project uses WXT (a Chrome extension framework built on Vite) with a custom dev.sh script that:

Starts the WXT dev server for hot reload
Launches a dedicated Chrome instance with the extension loaded
Exposes Chrome's remote debugging port for MCP/DevTools access

A clean dev-build → chrome-start → edit-reload loop. Or so it was.

Problem 1: `--load-extension` No Longer Starts Service Workers in Chrome 126+

What broke

Chrome has a flag --load-extension=/path/to/ext that loads an unpacked extension at startup. Before Chrome 126, this worked well for local development. After Chrome 126, the extension appears to load — chrome-extension://ID/options.html is accessible — but:

The background service worker never appears in Target.getTargets CDP results
Content scripts declared in manifest.content_scripts are not injected
The extension is not written to the Chrome profile's Secure Preferences

The last point is the key. With --load-extension, Chrome 126+ treats the extension as ephemeral. It's accessible as a filesystem resource but not actually "installed" in the profile, so Chrome's normal extension machinery (service worker lifecycle, content script injection) doesn't activate.

Why web-ext switched to `Extensions.loadUnpacked`

The web-ext project documented this in their issue tracker and switched away from --load-extension for Chrome 126+. Their new approach:

Start Chrome with --remote-debugging-pipe and --enable-unsafe-extension-debugging
Call the Extensions.loadUnpacked CDP command via the pipe

Extensions.loadUnpacked writes to Secure Preferences just like clicking "Load unpacked" in chrome://extensions/. Once written, the extension is a real installed extension.

Critical caveat: Extensions.loadUnpacked is only available via pipe-based CDP (--remote-debugging-pipe), not via WebSocket-based CDP (--remote-debugging-port). Connecting via port returns "Method not available." even with --enable-unsafe-extension-debugging.

The simpler fix

After trying the pipe approach, I discovered something: if you already have the extension registered in Secure Preferences from a previous pipe-based install, you can start Chrome normally (no --load-extension) with --enable-unsafe-extension-debugging and it loads from the profile automatically.

But there's an even simpler path: --load-extension + --enable-unsafe-extension-debugging together. Testing showed that when --enable-unsafe-extension-debugging is present, Chrome treats --load-extension extensions as real installs and injects content scripts:

"${CHROME_BINARY}" \
  --user-data-dir="${USER_DATA_DIR}" \
  --remote-debugging-port=9222 \
  --load-extension="${EXTENSION_DIR}" \
  --enable-unsafe-extension-debugging

That's the entire fix for service worker / content script injection.

Problem 2: WXT Dev Server Exits Immediately When Backgrounded

What broke

The dev script ran WXT in the background:

npm run dev &
WXT_PID=$!
# ... wait for build, start Chrome ...
wait "$WXT_PID"

WXT would build the extension, print "Load manually", and exit. The wait returned. The dev server was gone.

Why

WXT uses Node.js readline for its interactive keyboard shortcuts:

rl ??= readline.createInterface({
    input: process.stdin,
    // ...
})

When npm run dev & backgrounds the process, process.stdin is connected to /dev/null. readline immediately gets EOF, emits close, and WXT exits.

This only manifests when WXT is backgrounded — running it in the foreground is fine. But backgrounding is necessary because you need to start Chrome after the build completes.

The fix

Feed a non-closing stream to the process's stdin:

npm run dev < <(tail -f /dev/null) &

tail -f /dev/null follows an empty file indefinitely, never sending data and never closing. WXT's stdin stays open. readline never gets EOF. WXT keeps running.

Problem 3: web-ext Injects Flags That Break Google Login

What broke

WXT's runner (web-ext) was starting Chrome correctly — but Google account login was consistently logged out on every restart.

Why

web-ext uses chrome-launcher for Chrome startup. chrome-launcher's defaultFlags() includes:

'--disable-sync',
'--use-mock-keychain',

--use-mock-keychain is the destructive one. On macOS, Chrome encrypts cookies using the system keychain. --use-mock-keychain substitutes a fake keychain with a different encryption key. Cookies encrypted with the real keychain cannot be decrypted with the mock one and vice versa.

Once Chrome writes cookies with the mock keychain, subsequent Chrome starts (without the flag) cannot read them. Login state is destroyed.

web-ext excludes --disable-extensions, --mute-audio, and --disable-component-update from chrome-launcher's defaults — but not --disable-sync or --use-mock-keychain.

The fix

Disable WXT's web-ext runner entirely and launch Chrome manually:

// wxt.config.ts
export default defineConfig({
  webExt: {
    disabled: true,
  },
})

Then start Chrome from dev.sh with exactly the flags you need and nothing else.

Bonus: WXT Dev Mode Removes `content_scripts` from the Manifest

This one isn't a Chrome regression — it's WXT's intentional behavior that became a problem once service workers stopped starting.

In dev mode, WXT strips content_scripts from manifest.json and relies on the background service worker to register them dynamically via chrome.scripting.registerContentScripts(). The service worker connects to WXT's dev server and WXT sends reload commands.

When the service worker doesn't start (Problem 1), this entire chain breaks. Content scripts are never registered.

Fix: use WXT's build:manifestGenerated hook to add content_scripts back to the dev manifest:

hooks: {
  'build:manifestGenerated': (wxt, manifest) => {
    // Also strip the localhost CSP entry WXT adds for HMR — Chrome MV3
    // rejects http:// origins in extension_pages CSP.
    const csp = manifest.content_security_policy as Record<string, string> | undefined;
    if (csp?.extension_pages) {
      csp.extension_pages = csp.extension_pages
        .replace(/\s*http:\/\/localhost:[0-9]+/g, '')
        .trim();
    }

    if (wxt.config.command === 'serve' && !manifest.content_scripts?.length) {
      (manifest as Record<string, unknown>).content_scripts = [{
        matches: ['https://your-target-site.com/*'],
        run_at: 'document_end',
        js: ['content-scripts/content.js'],
      }];
    }
  },
},

This also strips http://localhost:3000 from the extension_pages CSP. Chrome MV3 forbids HTTP origins in that directive; WXT adds it for Vite HMR, but it may silently break extension loading.

The Full Picture

Three independent changes colliding:

Problem	Root Cause	Fix
Service worker / content scripts not running	Chrome 126+ changed `--load-extension` to not register extensions in profile	Add `--enable-unsafe-extension-debugging`
WXT dev server exits immediately	`readline` on backgrounded stdin gets EOF	`npm run dev < <(tail -f /dev/null) &`
Google login lost on restart	web-ext injects `--use-mock-keychain`	Disable WXT runner, launch Chrome manually
Content scripts not injected in dev	WXT removes `content_scripts` from dev manifest	Restore via `build:manifestGenerated` hook

None of these produce meaningful error messages. The extension "loads" in the sense that its static pages are accessible. Everything else silently fails. The debugging path was: CDP Target.getTargets to check for service workers, Secure Preferences inspection to check if the extension was actually installed, and process-level stdin inspection to find the WXT exit cause.

The startup order and flags are what matter:

Start the dev server in the background, feeding stdin from tail -f /dev/null so it stays alive
Wait for the build to finish
Wait for the dev server to initialize before launching Chrome (so the service worker can connect)
Launch Chrome with both --load-extension and --enable-unsafe-extension-debugging

npm run dev < <(tail -f /dev/null) &
WXT_PID=$!

# wait for build → wait for dev server → launch Chrome
"${CHROME_BINARY}" \
  --user-data-dir="${USER_DATA_DIR}" \
  --remote-debugging-port="${DEBUG_PORT}" \
  --load-extension="${EXTENSION_DIR}" \
  --enable-unsafe-extension-debugging &

wait "$WXT_PID"

Cursor vs Claude: The Business Models Behind the 10x Price Gap

Hiroshi Toyama — Thu, 07 May 2026 08:25:35 +0000

The previous post covered Composer 2's cache mechanics and the Standard/Fast split. This one goes one level deeper: why the price gap exists structurally, and what it predicts about where AI model markets are heading.

The Two Business Models

The $0.50 vs $5.00 price gap between Composer 2 Standard and Claude Opus isn't primarily about model size. It's about two fundamentally different business models.

Anthropic/OpenAI: Build the most capable general-purpose model possible. License it as an API to anyone who wants to use it—enterprises, startups, individual developers. The general-purpose nature requires maintaining capabilities across every domain: legal reasoning, creative writing, mathematics, programming, ethics, philosophy. Margin on each API call covers model development, infrastructure, and business overhead.

Cursor/Anysphere: Build a model only for one product—Cursor. No external API to sell. No licensing fees to pay. No reason to maintain capabilities outside software development. The specialized training means stripping out everything that isn't code, resulting in a dramatically smaller model that's cheaper to serve.

The math follows directly. Composer 2 is trained exclusively on coding data via continued pre-training and reinforcement learning. Claude Opus maintains the ability to pass bar exams, write poetry, explain quantum mechanics, and argue ethics. You're paying for all of that whether you use it or not.

The Cache Write Tax

This business model difference shows up most concretely in cache write pricing.

Claude's prompt caching has three cost components:

Cache write: 1.25× the base input price
Cache read: ~10% of the base input price
Normal input: base price

That cache write surcharge exists because Anthropic is taking on the cost and risk of maintaining cached data for an external customer. They don't know what you'll cache, how long it'll stay relevant, or whether you'll return in 5 minutes or 5 days. The 1.25× write rate is essentially an infrastructure risk premium embedded in the API pricing.

Composer 2's actual usage data tells a completely different story:

Column	Value
Input (w/ Cache Write)	0
Input (w/o Cache Write)	15,018
Cache Read	391,424
Cost	$0.20

The Input (w/ Cache Write) column is zero across every single Composer 2 request. Anysphere runs Composer 2 on their own servers, optimized for exactly one workload: Cursor's codebase-heavy sessions. There's no external API infrastructure risk to price in. The cache write surcharge simply doesn't exist.

For Claude Opus users on Cursor, the same column is non-zero. Even though Cursor proxies the request, it still hits Anthropic's API and incurs the write premium.

The practical effect: on a new session with a large codebase, Claude Opus users pay an entry fee (cache write at 1.25× rate) that Composer 2 users never encounter.

Luxury Engineering

A useful framing emerges from analyzing actual usage patterns: Luxury Engineering.

Using Claude Opus for routine coding tasks is the AI equivalent of hiring a full professor to write unit tests. The professor is qualified—arguably overqualified. They could do it. But you're paying for decades of expertise in domains completely irrelevant to the task: literature, philosophy, ethics, history. That overhead is embedded in every token.

Composer 2 is more like a developer who has done nothing but code their entire career. No breadth, extraordinary depth in the one domain that matters. Because of that specialization, cost is 1/10th.

Full Model Landscape

Looking at the complete pricing picture (2026):

Model	Input	Cache Write	Cache Read	Output
Composer 2	$0.50	none	$0.20	$2.50
GPT-5.3 Codex	$1.75	none	$0.175	$14
Grok 4.20	$2.00	none	$0.20	$6
Gemini 3.1 Pro	$2.00	none	$0.20	$12
Claude 4.6 Sonnet	$3.00	$3.75 (1h: $6.00)	$0.30	$15
GPT-5.5	$5.00	none	$0.50	$30
Claude 4.7 Opus	$5.00	$6.25 (1h: $10.00)	$0.50	$25

GPT-5.3 Codex being significantly cheaper than GPT-5.5 follows the same logic: Codex uses continued pre-training on code data to reduce model weight, and the price difference is essentially "the cost of maintaining the ability to write poetry."

Two patterns stand out:

The Claude cache write anomaly. Only Claude models carry an explicit cache write surcharge. Every other model in this list (including Composer 2) absorbs the write cost into the base price or waives it entirely. This isn't a product limitation—it's a reflection of Claude's external API business model.

Composer 2's output price. $2.50/1M output is 10× cheaper than Claude Opus and 12× cheaper than GPT-5.5. Code generation produces significant output token volume. Composer 2's extreme output pricing means that long agentic sessions—the exact workloads it's designed for—don't hit a cost ceiling.

Claude Code: The Name Is Misleading

The name "Claude Code" implies a coding-specialized model. It isn't. Claude Code is Claude 4.6 Opus or Sonnet—the same general-purpose models available in Cursor—packaged as a CLI tool. The underlying architecture hasn't been pruned for code; it retains the full weight of a general-purpose frontier model.

The cost implications are direct. Claude Code uses Anthropic's standard Prompt Caching, which means the cache write premium (1.25×) applies. The default cache TTL is 5 minutes—long enough to expire while you're running tests or reading docs between prompts. The ENABLE_PROMPT_CACHING_1H=1 flag extends it to one hour, but doubles the write cost in exchange.

The "autonomous loop" (run tests → read failure → fix code → rerun) is frequently cited as a Claude Code advantage. It isn't unique to Claude Code. Cursor's agent mode executes the same loop via its sandboxed terminal integration. The practical difference is that Cursor's loop doesn't incur a cache write penalty on session start, and runs cache reads at $0.20/1M rather than Claude's ~$0.50/1M.

Where Claude Code has a genuine edge: terminal-native workflows for developers using Vim, JetBrains, or any editor outside the Cursor ecosystem. If you're not using Cursor, Claude Code is the most capable CLI agent available. Within Cursor, the economic case for Claude Code over Composer 2 is thin for standard coding tasks.

What This Means for the Future

This structure predicts where AI model markets go.

General-purpose frontier models have a structural cost floor. They have to maintain broad capabilities to justify API pricing across diverse customers. They have to earn margins on external licensing. They have to maintain the "impressive demo" factor that drives enterprise adoption.

Specialized models built for a specific product have none of those constraints. Strip capability, reduce model size, optimize serving infrastructure, eliminate external API margins. The only question is whether sufficient domain quality can be achieved.

Composer 2 answered that question for software development in March 2026. SWE-bench Multilingual score of 73.7, at 1/10th the cost of Claude Opus.

The same economics will play out in other domains: legal AI products trained exclusively on case law and contracts; medical AI running on clinical literature with zero consumer chat capability; financial models stripped of everything except numerical reasoning and accounting standards. None of them need to know how to write a sonnet.

The structural enabler in each case is the same: building a model for one product, not for external licensing. That eliminates the margin layer and enables the infrastructure optimizations that make 5-10× price reduction possible.

The Rational Selection Framework

Given this analysis:

Composer 2 Standard for any multi-turn session against a codebase. Cache compound interest works in your favor: higher turn count → higher cache read ratio → lower effective cost per token. No cache write entry fee on session start.
Composer 2 Fast for interactive sessions where latency matters more than per-token cost.
Claude Opus or Claude 4.7 when you genuinely need cross-domain reasoning—architecture decisions involving organizational and technical trade-offs simultaneously, debugging scenarios requiring external systems understanding outside your loaded context, or when Composer 2 hits an explicit capability ceiling.

From actual usage data: 88.3% cache read ratio on Composer 2 Standard, $0.19 average cost per request on ~390K token requests. The same request volume on Claude Opus: $0.90 average. The top Opus request cost $4.25—enough for 22 equivalent Composer 2 Standard sessions.

The price gap isn't a temporary marketing discount. It's structural, rooted in business model differences that won't close without a fundamental change in how Anthropic operates. As long as Claude is an external API product, the cache write premium and the overhead of general-purpose training remain embedded in the price.

Using llms.txt with Cursor and Claude Code: a concrete playbook

Hiroshi Toyama — Sun, 03 May 2026 11:56:30 +0000

llms.txt is a small text file on a documentation site—usually lists what the product is and links to the important Markdown pages. For coding agents, treat it as the canonical URL to open first when upstream behavior is unclear. This post is mostly setup and workflow, not theory.

What goes where

Location	Put this there
Official doc server	`https://example.com/llms.txt` (maintained by the library/vendor)
Your repo	URLs only (and short protocols), in agent rules—not a copy of their docs
Your repo `.cursor/rules/`	Project map, conventions, your architecture—not Next.js’s full manual

If you paste thousands of tokens of upstream docs into rules, every chat pays for them. Keeping pointers in rules and loading docs on demand avoids that.

One-time setup: a dedicated rules file

Create something like .cursor/rules/external-llms-docs.md (name does not matter; keep it scoped). Paste a stable list of llms.txt URLs your stack actually uses, grouped so humans and agents scan quickly.

# External docs — fetch on demand

Use web fetch / browser / search tools to load these when implementing or debugging
third-party behavior. Do not paste full upstream docs into the chat.

## Index URLs (read these first)

| Area | llms.txt |
| --- | --- |
| Next.js | https://nextjs.org/llms.txt |
| Tailwind | https://tailwindcss.com/llms.txt |
| Lucide | https://lucide.dev/llms.txt |
| Google ADK | https://adk.dev/llms.txt |

## Read order

1. Fetch the **llms.txt** for the dependency that owns the question.
2. Follow **only** links from that file (or obvious `/docs/*.md` siblings) for depth.
3. Prefer Markdown sources over scraping marketing HTML.
4. If types exist locally (`node_modules`, stubs), use them **after** you know which API surface applies (avoids guessing wrong symbols).

## Scope

- Questions about **our** repo layout → use `repo-map` rule / codebase search, not llms.txt.
- Questions about **their** API/version/docs → use the table above.

Why a separate file: Cursor injects rules by context; a fat global rule file makes unrelated edits heavier. Split internal vs external pointers.

Agent protocol (copy into the same file or AGENTS.md)

Make the sequence explicit so the model does not default to “grep node_modules for an hour.”

## External SDK protocol

When the user asks for behavior that depends on an external library version or API:

1. Identify which dependency owns the feature (package.json / imports).
2. If this file lists an llms.txt for that dependency, **fetch it before** writing code.
3. Summarize in ≤10 lines: version assumptions, file names, and APIs you will use—then implement.
4. Do not quote entire upstream pages back to the user; cite chapter/section or URL path only.

Concrete workflows

Implement a feature (e.g. App Router auth middleware).

User: “Add middleware-based auth with Next.js App Router.”
Agent: fetch https://nextjs.org/llms.txt, open the linked page that describes middleware.ts / matcher patterns.
Implement using current filenames and signatures from that fetch—not memory.

Debug “works on my machine” / deprecation.

User: “Tailwind v4 class names stopped working after upgrade.”
Agent: fetch Tailwind’s llms.txt first; confirm breaking-change notes and config file names, then open repo tailwind.config.* / CSS entry.

SDK with tiered dumps (example pattern).

Some sites expose a short index and a long bundle (names vary). Rule of thumb: start short, upgrade to full only if the stub did not answer.

# hypothetical layout on a docs host
/llms.txt          → links + overview
/llms-small.txt    → minimal surface (cheap)
/llms-full.txt     → everything (expensive)

Point your rules at the entry (llms.txt); let the fetched content tell the agent whether *-full exists.

Prompts that reinforce good habits

You can nudge behavior per task without editing rules:

“Before editing: fetch Next.js llms.txt and confirm middleware filename and export shape.”
“Use ADK llms.txt; don’t rely on training cutoff for API names.”
“After fetching Tailwind llms.txt, list which doc URLs you used (paths only).”

Minimal internal llms.txt (optional)

If you ship an internal library or architecture handbook on HTTPS, you can publish your own index at https://internal-docs.example.com/llms.txt:

# Internal platform — LLM index

## Auth
- Overview: https://internal-docs.example.com/auth/overview.md
- Breaking changes 2026: https://internal-docs.example.com/auth/changelog.md

## Data layer
- API conventions: https://internal-docs.example.com/db/conventions.md

Then add one line to .cursor/rules/external-llms-docs.md: Internal platform | https://internal-docs.example.com/llms.txt. Same mechanics as vendor docs.

Tooling reality check

This pattern assumes the agent can retrieve HTTPS text (built-in fetch, browser tool, MCP fetch, etc.). Air-gapped machines need a fallback (mirror snippets in rules, local static server, or vendor tarball—but accept resident token cost).

Do not put authenticated URLs with secrets in rules; use public docs or internal SSO-aware tooling outside plain markdown.

Anti-patterns

Dumping full upstream Markdown into .cursorrules “so the agent always knows.”
Skipping llms.txt and crawling random marketing pages (noisy HTML, wasted tokens).
Duplicating vendor docs under docs/vendor/ and indexing everything unless you truly need offline.

SEO note (short)

Search-engine teams have questioned llms.txt as an SEO lever; that is largely orthogonal. For coding agents, the win is predictable Markdown entrypoints and smaller always-on context—not rankings.

Summary

Add .cursor/rules/external-llms-docs.md with a table of llms.txt URLs plus read order and scope (external vs internal repo map).
Teach agents: fetch index → follow linked Markdown → then local types.
Use tiered files shallow-first when the provider offers them.
Optionally host your own llms.txt for internal platforms; still keep rules as pointers only.

Cursor Composer 2: The Cache Economy Behind a 10x Cheaper Coding Agent

Hiroshi Toyama — Sat, 02 May 2026 12:53:01 +0000

Cursor's Composer 2 shipped in March 2026 as the centerpiece of the Cursor 2.0 overhaul. The headline numbers—$0.50/1M input tokens, outperforming frontier models on SWE-bench Multilingual—look like marketing. The cache read mechanism is where the real story is.

Why a Specialized Model at All

Prior Cursor versions proxied Claude or GPT-4. Composer 2 is trained exclusively on coding data via continued pre-training and reinforcement learning. The obvious question is: what's cut?

Everything that isn't code. Composer 2 has no meaningful capability for poetry, history, ethics debates, or anything outside software development. That constraint lets Anysphere run a model that:

Understands intra-repo dependency graphs (if you fix A, B also needs updating)
Navigates hundreds of files in a single long-horizon task
Runs natively in sandboxed terminals and a built-in browser loop
Costs a fraction of what a general-purpose frontier model costs to serve

The pricing reflects this. As of May 2026:

Model	Input (1M tokens)	Output (1M tokens)
Composer 2 Standard	$0.50	$2.50
Composer 2 Fast	$1.50	$7.50
Claude 4.6 Opus	$5.00	$25.00
GPT-5.4	$2.50	$15.00

Standard vs Fast: Same Weights, Different Queue

Anysphere's own language is unambiguous: "Same intelligence." The two variants share identical model weights and parameters. Fast gets priority queue on high-end GPUs (H800/B200 class); Standard runs on lower-priority compute with higher latency tolerance.

This is a deliberate architectural choice. Inference cost scales with compute priority, not model capability. If you can tolerate a 10–30 second response delay, you get the same output for 1/3 the price.

The practical split that Cursor power users have settled on:

Interactive sessions (Fast): You're watching the output in real time. Latency kills flow.
Fire-and-forget tasks (Standard): Refactor 100 test files, generate JSDoc across the repo, migrate an entire API surface. Start it, close the laptop, come back to results.

The Cache Read Economy

This is the mechanism that makes Standard compelling for large codebases.

Every request to Composer 2 sends context: directory structure, recently opened files, conversation history. On the second, fifth, tenth turn of the same session, the majority of that context is identical to what was already sent. That's the cache.

Cache read rates as of May 2026:

Tier	New input	Cache read
Standard	$0.50/1M	$0.20/1M
Fast	$1.50/1M	$0.35/1M

By turn 5 of a non-trivial session, 80%+ of your input tokens are cache reads, not fresh input. Standard's cache read rate ($0.20) is 43% cheaper than Fast's ($0.35), and 60% cheaper than Standard's own new input rate.

Concrete impact: A refactoring session with 10 back-and-forth turns on a large codebase might consume 10M tokens. With Standard and healthy cache hits, that lands around $1.50–$2.00. The same session on Fast: $4.00–$5.00. On Claude 4.6 Opus: potentially $20+.

The Cache Bug (March–April 2026)

The cache story has a footnote worth documenting.

From late March through early April 2026, a backend bug caused Composer 2 Standard to emit cache read counts of zero—every request treated as fresh input at $0.50/1M even when the context was identical to the previous turn. Users reported credit burn rates 10x higher than expected. The irony: switching to Fast (which costs 3x more per token) actually resulted in lower total cost because cache was functioning there.

Cursor's team (Dean and Mohit on the forum thread) acknowledged the bug and pushed a fix around April 7. As of v2.1.116+, the behavior appears stable.

The diagnostic check: open cursor.com/settings → Usage. If Cache Read tokens are consistently below 40% on a multi-turn session against the same codebase, something is wrong. Expected range is 40–90% depending on how varied your requests are.

If you hit zero cache read consistently, copy the Request ID from the chat header and contact support. Cursor has been issuing credit refunds for the overbilling period.

Comparing with Claude Code's Cache

Claude Code (Anthropic's CLI tool) has its own prompt caching via cache_control markers, but with a key structural difference: TTL.

Setting	Write cost	Read cost	TTL
Default	1.25× input	~10% of input	5 minutes
`ENABLE_PROMPT_CACHING_1H=1`	2.0× input	~10% of input	1 hour

The 5-minute default is brutal for any session where you read documentation, test code, or think between turns. The 1-hour option (available since Claude Code v2.1.108) adds to the write cost but eliminates repeated cache misses across the kind of natural pauses that happen in real work.

To enable it:

# ~/.zshrc or ~/.bashrc
export ENABLE_PROMPT_CACHING_1H=1

Verify with usage output during a session—look for ephemeral_1h_input_tokens in the log. If you only see ephemeral_5m_, the variable isn't being picked up.

Note: there were also TTL-related bugs in this period that forced resets to 5-minute behavior. Keep Claude Code at the latest version.

My Usage Data

I exported my own Cursor usage history and analyzed it. Here's what a month looks like across models (442 requests):

Model	Requests	Avg cost/request	Cache read ratio
Composer 2 Standard	73	$0.19	88.3%
Composer 2 Fast	25	$0.32	78.1%
Claude 4.6 Sonnet	212	$0.37	84.7%
Claude 4.6 Opus	93	$0.90	79.5%

The 88.3% cache read ratio on Standard is the headline. For an average request consuming ~390K tokens, 88% of those are cache reads at $0.20/1M rather than fresh input at $0.50/1M. Without that cache hit rate, the average cost per request would be ~$0.40 instead of $0.19.

The top Opus requests peaked at $4.25/request (3.9M total tokens, 3.8M of which were cache reads). Even with excellent cache ratios, Opus's higher base rates mean the same cache-heavy session costs 4–5× more than Composer 2 Standard.

The Actual Decision

Composer 2 is not "Claude but cheap." It's a purpose-built agent runtime that has traded general intelligence for deep coding capability and cost efficiency at the infrastructure level. The Standard/Fast split exists because long-horizon agentic tasks don't need millisecond response times—and charging for that latency premium on 10-turn refactoring sessions is wasteful.

The model choice that makes sense given this:

Default to Standard for any multi-file task where you'll have more than 3–4 turns
Switch to Fast for interactive chat where you're watching output incrementally
Use frontier models (Opus, Claude 4.7) only when Composer 2 hits a genuine capability ceiling—complex algorithmic reasoning, architecture decisions that span non-code domains

The cache makes Standard not just "slower Fast," but a qualitatively different operational mode: background processing with cost amortized over a long context window that grows cheaper the more you reuse it.

Two Nasty Gotchas When Building Multi-Agent Systems with Google ADK

Hiroshi Toyama — Tue, 28 Apr 2026 09:30:43 +0000

Google's Agent Development Kit (ADK) makes it straightforward to compose LlmAgent instances into multi-agent hierarchies. But two bugs bit me hard in production that aren't documented anywhere. Here's what happened and how to fix them.

The Setup

A root router LlmAgent with two sub-agents. Both sub-agents are module-level singletons — instantiated at import time, referenced from the root agent's constructor.

# Agents/my_app/root_agent.py
from Agents.my_app.sub_agent_a.agent import sub_agent_a
from Agents.my_app.sub_agent_b.agent import sub_agent_b

def _build_sub_agents() -> list:
    return [sub_agent_a, sub_agent_b]

root_agent = LlmAgent(
    name="my_app",
    sub_agents=_build_sub_agents(),
    ...
)

Worked fine locally with adk web. Blew up on Cloud Run.

Bug 1: `Agent already has a parent agent` on module reload

The error

pydantic_core._pydantic_core.ValidationError: 1 validation error for LlmAgent
  Value error, Agent `SubAgentA` already has a parent agent,
  current parent: `my_app`, trying to add: `my_app`

What's happening

ADK's agent_loader calls importlib.import_module(agent_name) on every request. On the first request, it loads the module fresh and creates root_agent. The LlmAgent constructor sets sub_agent.parent_agent = root_agent for each sub-agent.

On the second request, agent_loader reloads the module. Because sub_agent_a and sub_agent_b are module-level singletons, they're the same Python objects from the previous load — still carrying their parent_agent reference. When the new LlmAgent tries to assign the parent again, pydantic's validator rejects it.

# Inside ADK's LlmAgent.__init__ (simplified)
for sub in sub_agents:
    if sub.parent_agent is not None:
        raise ValueError(f"Agent `{sub.name}` already has a parent agent ...")
    sub.parent_agent = self

This never surfaces locally because adk web loads the module only once per session. Cloud Run's request-per-reload behavior is what triggers it.

The fix

Reset parent_agent to None before passing sub-agents to the constructor:

def _build_sub_agents() -> list:
    agents = [sub_agent_a, sub_agent_b]
    for agent in agents:
        agent.parent_agent = None  # reset before each reload
    return agents

This is safe because the assignment happens synchronously before the new parent is set.

Bug 2: `Context variable not found` in instruction strings

The error

KeyError: 'Context variable not found: `hostname`.'

Traceback points here:

File ".../google/adk/utils/instructions_utils.py", line 124, in inject_session_state
    return await _async_sub(r'{+[^{}]*}+', _replace_match, template)

What's happening

ADK injects session state into agent instructions at runtime. The mechanism scans the instruction string with the regex r'{+[^{}]*}+' and replaces every {var_name} with the corresponding session state value.

If your instruction contains an example URL or any template-like text with curly braces:

The URL format is `https://{hostname}/api/{resource_id}/`

ADK sees {hostname}, looks it up in session state, finds nothing, raises KeyError.

My first instinct was to double-brace escape like Python's .format():

https://{{hostname}}/api/{{resource_id}}/

This does not work. The regex is {+[^{}]*}+ — it matches one or more { characters followed by non-brace characters followed by one or more } characters. {{hostname}} still matches.

The fix

Don't use curly braces for literal placeholder text in instructions:

The URL format is `https://<hostname>/api/<resource_id>/`

More broadly: any {word} pattern in an ADK instruction string is treated as a session state variable, regardless of how many braces you use. Use angle brackets, square brackets, or prose for template-like text in prompts.

Summary

Bug	Trigger	Fix
`parent_agent` collision	Module-level singleton sub-agents + ADK module reload per request	Reset `agent.parent_agent = None` before passing to constructor
`Context variable not found`	`{word}` patterns in instruction strings	Use `<word>` or square brackets instead

Both are easy to fix once you know what's happening, but the error messages don't immediately point to the root cause. The parent_agent one is especially sneaky — it only appears in production where the module is reloaded per request, never in adk web during local development.

Managing AI Agent Skills with `npx skills`: A Practical Guide

Hiroshi Toyama — Sat, 11 Apr 2026 08:04:45 +0000

The Problem

AI agents like Claude Code, Cursor, and GitHub Copilot don't inherently know how to use every tool in your stack. You need a way to teach them. That's what npx skills does — it's a package manager for AI agent behaviors, built by Vercel Labs.

npx skills add microsoft/playwright-cli

This command fetches a SKILL.md from the specified GitHub repository and installs it into your agent's config directory (.agents/skills/ or .claude/skills/ depending on the agent).

How It Works

GitHub as the Registry

Unlike npm which uses npmjs.com, skills uses GitHub as its registry. The microsoft/playwright-cli argument maps directly to https://github.com/microsoft/playwright-cli. Any public GitHub repo with a SKILL.md at root is a valid skill source.

You can also install by full URL:

npx skills add https://github.com/microsoft/playwright-cli

SKILL.md as the Package Entry Point

Each skill repo contains a SKILL.md — the equivalent of index.js in an npm package. It contains:

Metadata: name and description of the skill
Tool definitions: commands the AI can invoke (e.g. playwright test)
Prompt instructions: when and how the AI should use the tool

.skills.json + skills-lock.json = package.json + package-lock.json

Concept	npm	skills CLI
Dependency manifest	`package.json`	`.skills.json`
Lock file	`package-lock.json`	`skills-lock.json`
Install directory	`node_modules/`	`.agents/skills/`
Registry	npmjs.com	GitHub
Install command	`npm install`	`npx skills experimental_install`

After npx skills add, your .skills.json will look like:

{
  "skills": [
    {
      "name": "playwright-cli",
      "remote": "microsoft/playwright-cli",
      "version": "latest"
    }
  ]
}

Key Commands

# Add a skill
npx skills add vercel-labs/agent-skills

# Add globally (user-level, not project-level)
npx skills add vercel-labs/agent-skills -g

# Target specific agents
npx skills add vercel-labs/agent-skills --agent claude-code cursor

# List installed skills
npx skills list
npx skills ls -g           # global skills
npx skills ls -a cursor    # filter by agent

# Search the registry
npx skills find typescript

# Update all skills
npx skills update

# Restore from lock file (equivalent of npm ci)
npx skills experimental_install

# Sync from node_modules to agent directories
npx skills experimental_sync

# Scaffold a new skill
npx skills init my-skill

Gotchas

`remove` Doesn't Update the Lock File

This is the biggest footgun:

npx skills rm microsoft/playwright-cli --all

This removes the skill files from your agent directories, but leaves the entry in skills-lock.json. The next time someone runs experimental_install, the skill comes back.

Workaround:

Run npx skills remove as usual
Manually edit .skills.json to remove the entry
Delete skills-lock.json
Run npx skills update or add remaining skills to regenerate a clean lock file

`experimental_` Prefix is Real

experimental_install and experimental_sync are genuinely experimental. The sync command in the current version is not npx skills sync — it's npx skills experimental_install to restore from lock file, and npx skills experimental_sync to sync from node_modules.

Cache Behavior with npx

npx skills may run a cached older version. Force latest:

npx skills@latest add <repo>

For projects where everyone needs the same CLI version, add it as a devDependency:

npm install --save-dev skills

CI/CD Integration

Add to your CI setup to restore skills on each run:

- name: Restore AI agent skills
  run: npx skills experimental_install

This ensures every developer and CI environment uses exactly the same skill versions as defined in skills-lock.json.

Creating Your Own Skill

Any GitHub repo with a SKILL.md is installable. Create one with:

npx skills init my-skill

This scaffolds a SKILL.md that you push to GitHub. Anyone can then install it with:

npx skills add yourusername/my-skill

Browse existing skills at skills.sh.

Summary

npx skills is npm for AI agent capabilities. The mental model maps cleanly:

SKILL.md = index.js
.skills.json = package.json
skills-lock.json = package-lock.json
experimental_install = npm ci
GitHub = npm registry

The tooling is still experimental — particularly the lock file management on remove — but it's already useful for ensuring consistent AI behavior across team environments.

DEV Community: Hiroshi Toyama

A Layered Evaluation Strategy for LLM Agents (Google ADK's 12 Criteria)

The 12 criteria, grouped by what they actually measure

Why full coverage in CI breaks down

The fix: layer by execution frequency and determinism

Layer 1 — PR gate: no LLM calls at all

Layer 2 — Nightly: quality, hallucination, safety

Layer 3 — pre-release E2E: multi-turn simulation

The gotcha: meta-evaluation error propagation

Wiring it into CI/CD

Where to put the eval code

Summary

Editing an ext4 Partition Directly from macOS (No Linux VM Required)

The trick: e2fsprogs is userspace

Gotcha 1: chained -R commands don't share a working directory

Gotcha 2: raw device + journal replay can undo your own edits

The correct order of operations

Summary

Stop Symlinking Your Cursor and Claude Code Rules — Generate Them Instead

The two rule systems

Why one shared directory can't work

The fix: one source, generated outputs

Gotchas

Summary

How an Unbounded fastmcp Version Constraint Took Down Production with 421 Misdirected Request

The symptom

Chasing the wrong SDK version

Finding the real version

A red herring along the way

The fix

Takeaways

Why TPUs Aren't Popular (Even Though They're Cheaper Per Token)

The one architectural fact that explains everything: static shapes

The dynamic vs. static analogy: Python vs. Java

What "handle dynamic input on static hardware" actually costs you in code

The "smallest input" trap

It's not one function — the whole pipeline forks

Precision makes it worse

The hidden cost: your org chart breaks

So why does anyone use them? Because the input is locked

Takeaways

Upgrading Google ADK to 2.0 on a Cloud SQL Postgres Backend: The Three Things That Bit Us

1. The async driver switch — and the URL you now share with sync code

2. The missing event columns — a silent 500 in production

3. The v0 → v1 schema migration is optional (and you probably want to defer it)

The deploy order that actually works

Gotchas worth pinning

Summary

Chrome 126+ Broke My WXT Extension Dev Setup — Here's What Changed and How to Fix It

The Setup

Problem 1: --load-extension No Longer Starts Service Workers in Chrome 126+

What broke

Why web-ext switched to Extensions.loadUnpacked

The simpler fix

Problem 2: WXT Dev Server Exits Immediately When Backgrounded

What broke

Why

The fix

Problem 3: web-ext Injects Flags That Break Google Login

What broke

Why

The fix

Bonus: WXT Dev Mode Removes content_scripts from the Manifest

The Full Picture

Cursor vs Claude: The Business Models Behind the 10x Price Gap

The Two Business Models

The Cache Write Tax

Luxury Engineering

Full Model Landscape

Claude Code: The Name Is Misleading

What This Means for the Future

The Rational Selection Framework

Using llms.txt with Cursor and Claude Code: a concrete playbook

What goes where

One-time setup: a dedicated rules file

Agent protocol (copy into the same file or AGENTS.md)

Concrete workflows

Prompts that reinforce good habits

Minimal internal llms.txt (optional)

Tooling reality check

Gotcha 1: chained `-R` commands don't share a working directory

Problem 1: `--load-extension` No Longer Starts Service Workers in Chrome 126+

Why web-ext switched to `Extensions.loadUnpacked`

Bonus: WXT Dev Mode Removes `content_scripts` from the Manifest

Bug 1: `Agent already has a parent agent` on module reload

Bug 2: `Context variable not found` in instruction strings

`remove` Doesn't Update the Lock File

`experimental_` Prefix is Real