DEV Community: Arthur

Gemma 4 E4B caught three planted fabrications in 50 seconds — on a laptop, no cloud

Arthur — Sun, 24 May 2026 22:22:35 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

scribe-check is a local-first command-line tool that reads a Markdown article and a folder of source documents, and reports every concrete claim in the article that isn't corroborated by the sources you handed it. It checks five categories of fabrication risk: quoted strings that drifted a word, named entities the sources never mention (a coauthor that shouldn't be on a paper), numeric specifics that don't match (off-by-2× rod-cell counts), italicized terminology that drifted (the article italicizes X where the source italicizes Y), orthographic drift (British spelling leaking into a US-English piece, or vice-versa), and temporal-marker leaks (today, this morning, weekday names sneaking into evergreen prose).

It's the kind of pass an editor would do on every draft, if every writer had an editor on every draft. Instead, it runs on Gemma 4 E4B via Ollama. Locally. On a laptop. In about a minute on a ~2,000-word article.

I built it because I'd been doing this review by hand on my own articles, assembling a citations.md file and scanning the article line by line against the citations. It's exactly the kind of repetitive, structural check a small local model can do consistently and cheaply.

Demo

Three planted fabrications in a real published article: a drifted italicized term (*simple cells* → *elementary cells*), a fake coauthor (Ahmed, Natarajan, Rao, and Petrova), and a doubled count (120 million rod cells → 240 million rod cells). scribe-check catches all three on a single pass against the article's citations.md. The CLI shows a live spinner with elapsed seconds on stderr during the ~50-second model call (auto-suppressed when piped), so the wait never feels hung:

(raw transcript and JSON live in examples/transcript-fabrications.txt and examples/output-fabrications.json.)

⚑ scribe-check: 5 finding(s)

QUOTES FLAGGED  (1)
  1. *elementary cells*
     at: They discovered that individual neurons in the primary visual cortex, the structures they later called *elementary cells…
     concern: The article italicizes *elementary cells*, but the source uses the term *simple cells* when describing the structures Hubel and Wiesel found. This is terminology drift.
     closest: structures they later called *simple cells*, fired most strongly in response to oriented bars and edges at specific spatial frequencies

NAMES FLAGGED  (1)
  1. Petrova
     concern: The article claims the DCT was introduced by Ahmed, Natarajan, Rao, and Petrova. The source only lists Ahmed, Natarajan, and Rao as the authors of the 1974 paper. 'Petrova' is a fabricated coauthor.

SPECIFICS FLAGGED  (3)
  1. The human eye contains roughly 240 million rod cells
     concern: The source provides a canonical figure of 'roughly 120 million rod cells' (Claim 7). The article's figure of 240 million is twice the value provided in the source.
  2. The human eye contains roughly six million cone cells
     concern: The source provides a canonical figure of 'roughly six million cone cells' (Claim 7). This specific claim is corroborated, but the context of the 240 million rod cells makes the overall claim suspect.
  3. The DCT decomposes the block into sixty-four spatial-frequency components
     concern: The source confirms the block size (8x8) and the resulting number of coefficients (64), but the article's phrasing is slightly redundant and less precise than the source's description of the process.

Code

github.com/arthurpro/scribe-check

The whole thing is ~500 lines of Go split across six files:

main.go: CLI, flag parsing, dispatch
loader.go: article + sources loader, token estimation
prompt.go: system prompt + per-call user prompt
ollama.go: HTTP client for /api/chat with structured-JSON output and one-shot retry on malformed JSON
render.go: color-coded terminal table
spinner.go: stderr progress spinner with elapsed timer, auto-suppressed when stderr isn't a TTY

Single dependency: stdlib. No vendored model, no embeddings, no RAG. The whole article and all sources go into one Ollama call.

How I Used Gemma 4

I chose Gemma 4 E4B (the "effective 4B" edge variant, ~9.6 GB on disk at Q4_K_M, served as gemma4:latest on Ollama) because the job needs three things simultaneously and only E4B has all three:

Structural reasoning that the E2B (2B effective) variant doesn't reliably produce. Catching *elementary cells* as drift from *simple cells* requires comparing terminology across the article and the source, not just spotting a wrong word. The smaller variant over-flagged or under-flagged inconsistently in my tests. E4B handled this reliably across multiple runs with temperature=0.1 and a fixed seed.
128K context. The whole article (~2,100 words) plus the citations file (~25 verified claims with notes) plus the system prompt fits in ~6.5K tokens, comfortably inside the window. For larger source sets scribe-check auto-sizes num_ctx up to the full 131072 without re-architecting. No RAG, no chunking, no embedding store.
Local execution. This tool runs between drafts. If it cost a cloud API call every time, I'd skip it half the time. Free + ~50s per pass on consumer hardware is the cadence at which I actually use it.

I tried the same workload mentally against the 26B MoE and 31B dense variants. They would be sharper, but at 5–10× the latency, I'd be tempted to batch the pass to "once before publish" instead of running it on every revision. The whole point of putting the model in the writer's loop is to make the check cheap enough that it always runs. E4B sits at that intersection.

What I learned about prompting an E4B model

One real engineering discovery worth flagging for anyone else building on E4B: the prompt design is the entire product. My first prompt ("find every concrete claim in the article that isn't corroborated by the sources") caught zero of three planted fabrications. The model agreed with the article because it sounded plausible against its own world knowledge.

Adding an explicit "ignore your own world knowledge; check only against the SOURCES block" rule moved the catch rate to 1/3. Adding short positive examples of the pattern (Petrova → flag this; *elementary cells* vs *simple cells* → flag this) moved it to 3/3.

The cost is precision. On a clean article, the same prompt over-flags 5–7 borderline items: derived ratios, soft-language paraphrases, slightly-rephrased corroborated claims. A human dismisses these in seconds while skimming, and the cost of that skim is much cheaper than the cost of a missed real fabrication. That's the design trade-off scribe-check makes deliberately: high recall, modest precision.

If you're building anything fact-checking-shaped on a small local model, lean into recall. Trust the human to filter.

How I made my React site agent-ready in 100 lines

Arthur — Sun, 24 May 2026 20:49:26 +0000

This is a submission for the Google I/O Writing Challenge

A 100-line recipe for making a React site agent-ready, with the diff, the new Lighthouse Agentic Browsing audit results, and what is coming next.

1. The hook

At Google I/O 2026, Matias mentioned, on the way to a different demo, that Google's Modern Web Guidance lifts coding-agent pass rates on web-development tasks by 37 percentage points versus unguided coding. He said it once and moved on, but the number stuck with me. Underneath that single statistic is the most actionable web-platform shift of the year: the "agent-ready web." I wanted to know what it actually takes to make a real site agent-ready, so I built a small React demo, added three new files plus a handful of HTML attributes, and ran the new Lighthouse Agentic Browsing audit.

About a hundred lines of code, 3 of 3 on Agentic Browsing, accessibility score still 100. Here is the recipe.

2. The four pieces

Four pieces of the agent-ready stack are worth knowing by name:

WebMCP. A proposed browser standard that lets your site expose typed tools to AI agents. The producer-side API ships in the Chrome 149 origin trial.
llms.txt. A Markdown site map for models, served at the site root.
Declarative form metadata. Standard HTML5 autocomplete (in the spec since 2014), ARIA roles and labels, and the new WebMCP attributes (toolname, tooldescription, toolparamdescription) on forms and inputs.
Lighthouse Agentic Browsing. A new audit category that ships with Chrome DevTools for Agents, available at I/O 2026 for Antigravity and more than twenty other coding agents.

One sentence on lineage: WebMCP is Google's browser-side adaptation of Model Context Protocol (which originated at Anthropic and is now broadly adopted across model vendors), developed in the open at the W3C Web Machine Learning Community Group, a multi-vendor body. From here I'll call it the Web ML CG.

3. The diff, `llms.txt` first

The recipe starts with the smallest possible file. Drop a public/llms.txt at the root of your site. Here is the entire file from my demo, 21 lines, verbatim:

# Acme Dashboard

Acme is a project-management tool for solo developers. This is the customer
dashboard.

## Primary pages
- [Sign in](/login)
- [Dashboard home](/dashboard)
- [Account settings](/settings/account)
- [Billing](/settings/billing)

## Public docs
- [API reference](/docs/api)
- [Getting started](/docs/quickstart)

## Notes for models
- Authentication is via email + password or Google OAuth.
- The dashboard requires a signed-in session.
- The sign-in form is annotated with declarative WebMCP attributes
  (`toolname="signIn"`); the dashboard exposes imperative WebMCP tools
  (`signOut`, `changePlan`) via `navigator.modelContext.registerTool()`.

The shape is straightforward: a one-line H1 title, a short summary, H2 sections for primary pages, public docs, and notes-for-models. The Lighthouse llms-txt check requires the file to be served, to have at least one H1, and to contain at least one link. Beyond that, the format is a convention. Think of it as the model-facing counterpart to robots.txt and sitemap.xml: a polite, deterministic introduction to your site, expressed in the language the consumer (a language model) actually parses well.

The notes-for-models section is the high-leverage bit. That is where you describe behaviors and constraints that are not obvious from page titles alone: auth requirements, which form is the sign-in form, which tools the dashboard exposes. Spend a minute writing it as if a new contractor were reading it.

4. The diff, declarative WebMCP attributes

The second file to touch is the login form. Here is the "before" version, the kind of code a working React developer ships in a hurry:

<form onSubmit={(e) => {
  e.preventDefault();
  if (email && pw) setSignedIn(true);
}}>
  <input
    type="email"
    placeholder="Email"
    value={email}
    onChange={(e) => setEmail(e.target.value)}
  />
  <input
    type="password"
    placeholder="Password"
    value={pw}
    onChange={(e) => setPw(e.target.value)}
  />
  <button type="submit">Sign in</button>
</form>

And here is the "after" version with the four kinds of changes layered on:

<form
  aria-label="Sign in to Acme"
  toolname="signIn"
  tooldescription="Sign in to the Acme dashboard with email and password."
  onSubmit={(e) => {
    e.preventDefault();
    if (email && pw) setSignedIn(true);
  }}
>
  <label htmlFor="email">Email</label>
  <input
    id="email"
    name="email"
    type="email"
    autoComplete="email"
    required
    toolparamdescription="The user's email address."
    value={email}
    onChange={(e) => setEmail(e.target.value)}
  />

  <label htmlFor="password">Password</label>
  <input
    id="password"
    name="password"
    type="password"
    autoComplete="current-password"
    required
    toolparamdescription="The user's password."
    value={pw}
    onChange={(e) => setPw(e.target.value)}
  />

  <button type="submit">Sign in</button>
</form>

Four kinds of additions:

Real <label> elements bound to inputs via htmlFor. Pure accessibility win.
autoComplete="email" and autoComplete="current-password". Standard HTML5 since 2014, but easy to forget in a hand-rolled React form.
aria-label on the form region.
The new WebMCP attributes: toolname and tooldescription on the <form>, plus toolparamdescription on each <input>.

The first three changes help screen-reader users; the fourth helps AI agents. The same diff buys both. That overlap is the part I find most encouraging about the agent-ready story: most of the work is good old-fashioned semantic HTML, with a thin layer of new attributes on top.

What the WebMCP attributes actually do: when an agent lands on your page, it can read the form as a typed tool surface with named parameters, instead of guessing what each input represents from heuristics on placeholders or visual proximity. No manifest file, no separate registration, just attributes on the elements the agent already sees.

5. The diff, imperative WebMCP for dashboard actions

The declarative attributes cover form fills. For dashboard-style actions that have no associated form (sign out, change plan, run a job), there is an imperative API: navigator.modelContext.registerTool(). Here is the real useEffect from my demo:

useEffect(() => {
  const mc = typeof navigator !== 'undefined' && navigator.modelContext;
  if (!mc || typeof mc.registerTool !== 'function') return undefined;

  const controller = new AbortController();

  const signOutTool = {
    name: 'signOut',
    description:
      'Sign the current user out of the Acme dashboard and clear the session.',
    inputSchema: { type: 'object', properties: {} },
    execute: async () => {
      onSignOut();
      return { ok: true };
    },
  };

  const changePlanTool = {
    name: 'changePlan',
    description: 'Change the subscription plan for the current account.',
    inputSchema: {
      type: 'object',
      properties: {
        plan: {
          type: 'string',
          enum: ['free', 'pro', 'team'],
          description: 'The new plan to switch to.',
        },
      },
      required: ['plan'],
    },
    execute: async ({ plan }) => {
      onChangePlan(plan);
      return { ok: true, plan };
    },
  };

  mc.registerTool(signOutTool, { signal: controller.signal });
  mc.registerTool(changePlanTool, { signal: controller.signal });

  return () => controller.abort();
}, [onSignOut, onChangePlan]);

Three things worth noticing. First, the feature-detection guard at the top: on stable Chrome without the origin-trial token, the function is undefined and the effect cleanly no-ops, so the page works for every user. Second, the AbortController: the spec uses an AbortSignal for unregistration, which slots into React's component-unmount lifecycle naturally. Third, the tool object shape is small and familiar: name, description, inputSchema (JSON Schema), and an execute function that runs your code and returns a result. If you've used MCP tools from any model vendor's SDK, you have already seen this shape.

Declarative attributes describe form fields the agent might fill. The imperative API registers action tools that have no form. Use both, and your site has a typed surface that an agent can both query and act on.

6. Running the audit, real numbers

Lighthouse 13.3.0 ships an agentic-browsing config. Point it at your dev server:

npx lighthouse \
  --config-path=node_modules/lighthouse/core/config/agentic-browsing-config.js \
  http://localhost:5173

I ran the audit against three builds of the same React app: a vanilla "before" build, the "after" build with the recipe applied, and a "broken" build with deliberate violations sprinkled in.

Build	Agentic Browsing	Accessibility	Notes
`acme-before` (vanilla form, no `llms.txt`, no WebMCP)	2 of 2	100	Audit floor is generous: several refs are informative-only.
`acme-after` (the recipe applied)	3 of 3	100	Form-metadata and `llms.txt` checks now pass.
`broken-sample` (deliberate violations)	1 of 2	71	Caught missing labels and form-association errors.

A quick word on the audit floor. A site with no llms.txt and no WebMCP can still get a 2 of 2 on Agentic Browsing, because the category gates only on the checks that have something to evaluate against. The 3 of 3 you actually want comes from the form-metadata path, which requires real labels, autoComplete, and the WebMCP attributes; and from llms.txt presence with an H1 and a link.

Here is the relevant excerpt from the actual after.report.json, the category block at the bottom of the report:

"agentic-browsing": {
  "title": "Agentic Browsing",
  "categoryScoreDisplayMode": "fraction",
  "id": "agentic-browsing",
  "score": 1
}

score: 1 on a fraction-display category means full credit on every weighted ref. That is the audit pass you want to point at when you tell your team "we shipped agent-ready."

7. What is coming next

A short, forward-looking note. Chrome 149 ships the producer-side WebMCP API behind the origin trial flag, which is what wires up navigator.modelContext.registerTool() today. The consumer side, where Gemini in Chrome's side panel actually invokes the tools your site registers, is something Google has said will follow in a future Chrome build. The producer-side work pays off today (in Lighthouse audits, in accessibility scores, in a typed tool surface that any MCP-aware model can target), and it leaves you ready for the consumer side the moment it lands. The Lighthouse Agentic Browsing audit itself, importantly, runs on stable Chromium with no Chrome 149 dependency; the audit and the producer-side API ship on independent tracks.

8. Try it tonight

The full recipe in six steps:

npm create vite@latest a small React app, or open your existing one.
Add public/llms.txt with the shape above: H1, summary, H2 sections, notes-for-models.
Upgrade one form: real <label> elements with htmlFor, autoComplete values, an aria-label on the form, and the WebMCP attributes (toolname and tooldescription on the form, toolparamdescription on each input).
Register an imperative tool or two in a useEffect with navigator.modelContext.registerTool(tool, { signal }). Feature-detect the API first so stable Chrome users see a clean no-op.
Run the audit: npx lighthouse --config-path=node_modules/lighthouse/core/config/agentic-browsing-config.js http://localhost:5173. Requires lighthouse@13.3.0 or later.
Open the HTML report. Read the Agentic Browsing section. Tweak attributes, re-run.

If you have an evening, you have time for all six steps. Most of the diff is accessibility work you probably already know how to write. If your team is already doing WCAG compliance, you are roughly 70 percent of the way there; the new WebMCP attributes are the remaining 30 percent.

9. Closing

The mobile-friendly shift took five years to wash through the web. The accessibility shift is still in progress, decades in. The agent-ready shift is starting now, and the cost of entry is genuinely low: three new files, a handful of attributes, an audit you can run on a laptop. The producer-side work is cheap, visible in audits today, and useful to humans as a side effect. The consumer side, browsers actually invoking your registered tools, follows.

Spend a hundred lines on it before someone else makes that decision for you.

Are you starting on llms.txt and the form-metadata work tonight, or waiting for the consumer side to land in Chrome first? I'd love to hear which form on your site is the most obvious candidate for the first WebMCP annotation pass.

I built an AI PR-triage agent in 30 lines of Markdown

Arthur — Sun, 24 May 2026 20:05:38 +0000

This is a submission for the Google I/O Writing Challenge

A recipe for the AI PR-triage agent I built after Google I/O 2026: three Markdown skill files, one Python runner, one real public GitHub repo, about twelve cents per run.

1. What I built

At Google I/O 2026, Logan from the Gemini API team walked through an AGENTS.md file for an AI talk-radio agent and dropped a line on stage that stuck with me: "the hottest new programming language is Markdown." He had written no orchestration logic, just skills and tools in Markdown files, and the agent shipped a finished podcast episode from a single API call.

I took that seriously. The next day, I spent a few hours building an AI pull-request triage agent on the public Gemini API. Three Markdown skill files, one small Python runner, one real public GitHub repo as the target. The agent scanned sixteen open pull requests, categorized each by risk, drafted a one-line summary, and produced a grouped report. Two consecutive runs, identical category distributions, under two minutes each, about twelve cents per run.

This article is the recipe. Working code, real cost, an excerpt of the actual triage report the agent wrote, and enough scaffolding for you to try it tonight against any public repo you care about.

2. What "skills as Markdown" actually means

A skill is a single .md file with four pieces:

A name and a one-sentence description of when to invoke it.
A numbered procedure.
Constraints (what the skill must not do).
Composition notes (which skill, if any, the agent should call next).

The agent loads skills when they are relevant to the user request, and it calls the tools they reference. There is no orchestration logic inside the skill file. The skill is the spec.

This is meaningfully different from cramming everything into a system prompt. Skills compose: skill A can hand off to skill B without the runner reshuffling state. Skills version independently from the runner, so you can iterate prose without touching Python. Skills carry per-tool constraints, which the model respects because the constraint is attached to the procedure rather than buried in a long preamble.

3. The three skills

I wrote three files. Together they are 101 lines of Markdown for the entire agent definition. Here is the first one verbatim, the entry point for the agent:

# Skill: scan_open_prs

Use this skill when asked to list, scan, or audit open pull requests on a GitHub
repository.

## Procedure

1. Call the `github_list_prs` tool with `state="open"` and the requested `limit`
   (default 25, maximum 50). The tool requires a `repo` argument in the form
   `owner/name` (for example, `cli/cli`).
2. For each returned PR, keep these fields verbatim: `number`, `title`, `user`,
   `additions`, `deletions`, `changed_files`, `draft`, `created_at`, and the
   first 200 characters of `body`.
3. Return the result as a JSON array. Do not paraphrase the title or body.

## Constraints

- Do not fetch full diffs in this skill. That is `categorize_by_risk`'s job.
- Skip draft PRs unless the user has explicitly asked for them.
- If the tool returns zero PRs, report that plainly and stop. Do not invent PRs.
- If the tool returns an error, surface the error message verbatim and stop.

## Composition

After running this skill, the agent should call `categorize_by_risk` once with
the JSON array as input.

Twenty-six lines. That is the entire entry-point skill. Notice how much of it is constraints: "do not paraphrase," "do not invent PRs," "skip drafts unless asked." Most of the work in writing a good skill goes into anticipating the model's bad habits and writing them out of the procedure.

The second skill, categorize_by_risk.md, is 41 lines. It calls github_get_pr_files for each PR and applies first-match-wins heuristics: breaking if the PR touches dependency files, security if it touches auth or crypto paths, docs if it only changes docs, fix if the title contains certain keywords, refactor if additions roughly equal deletions, feature otherwise. Each PR gets a category, a confidence, and a one-sentence reason.

The third, draft_summary.md, is 34 lines. It produces an action-verb-first one-line summary for each PR and emits the final report grouped by category, security first.

One short note on composition. When skill A says "now call skill B," the agent treats the boundary as a turn break. Skill B runs in a fresh turn with the JSON output of skill A as its input. This is multi-turn composition, not in-call composition, and it shapes how you structure your skills: each one is a complete unit of work with a clean input and output, not a function in a chain.

4. The runner

The runner is roughly 70 lines of Python that loads the skills, registers two function tools (github_list_prs and github_get_pr_files), and drives a multi-turn loop until the model says it is finished.

Google's official Managed Agents API is early-access only at the moment, but the same shape (one call, attached skills, attached tools) runs on the public Gemini API today, with the same skill files.

The shape, abbreviated:

from google import genai
from google.genai import types

client = genai.Client(api_key=os.environ["GEMINI_API_KEY"])
skills = load_skills("skills/")           # reads the three .md files
tools = [github_list_prs_decl, github_get_pr_files_decl]

contents = [user_turn(f"Triage open PRs on {repo}. Skills:\n\n{skills}")]
while True:
    resp = client.models.generate_content(
        model="gemini-3.5-flash",
        contents=contents,
        config=types.GenerateContentConfig(tools=tools),
    )
    calls = extract_tool_calls(resp)
    if not calls:
        break
    contents.append(resp.candidates[0].content)
    contents.append(run_tools(calls))     # executes locally, returns FunctionResponse
print(resp.text)

That is the entire control structure. One client, two function tools, one loop, three Markdown files attached as part of the user turn. The loop pays for everything the agent learns about the repo: which PRs exist, which files they touch, what the titles look like. No graph framework, no orchestrator, no agent class hierarchy.

5. The runs

I pointed the agent at cli/cli, the GitHub CLI repository, which had sixteen open non-draft pull requests at the time. I ran it twice from a cold start.

The numbers:

19 tool calls per run. Three github_list_prs calls during exploration (the agent verified pagination), then sixteen github_get_pr_files calls, one per PR.
Elapsed time. Run 1: 112 seconds. Run 2: 84 seconds. The second run is faster because the model commits to the plan earlier and skips the exploration calls partway through.
Cost. About $0.12 per run in Gemini 3.5 Flash spend.
Stability. Both runs produced identical category distributions across the sixteen PRs. No hallucinated PR numbers, no missed PRs.

Here is what the agent wrote for the top of the report:

## security

- #13500: Refactor string splitting in loops to use the more efficient SplitSeq function. [security]
- #13492: Add gh-cli-site-deployer App to replace SITE_DEPLOY_PAT in release workflows. [security]
- #13403: Refactor GitHub database IDs to use 64-bit integers across commands and API clients. [security]
- #13250: Add categorized target host categorization (github.com vs tenant) to telemetry data. [security]

## feature

- #13471: Add --all flag to gh skill install to support installing all discovered skills. [feature]

The category-by-category reasoning was crisp. Security PRs were grouped at the top, exactly as draft_summary.md had instructed. Every summary led with a verb. Confidence scores matched the heuristics in categorize_by_risk.md. The skill files did the work.

At nightly cadence on a repo this size, the annual cost lands somewhere around $40 to $50. Cheap, especially compared to the developer-hours of triage it replaces.

6. Three things worth knowing

A few practical notes from the build.

Composition is multi-turn, not in-call. If skill A invokes skill B, plan for a turn boundary between them. The model's working memory between turns is whatever you put back into contents, so emit clean JSON at skill boundaries rather than relying on natural-language handoff.

Token spend is non-deterministic. The agent pays to learn the repo, and how much it pays depends on what it finds. On a 1,000-PR monorepo, set an explicit tool-call budget in the runner and have the loop break when it is exceeded. Otherwise a single run can quietly become expensive.

For audited or strictly deterministic pipelines, an orchestrator graph still wins. Markdown skills are the right tool for exploratory work, summarization, triage, and drafting. If your pipeline has compliance hooks, retry semantics, or a fan-out fan-in shape, reach for a graph framework. The two patterns coexist.

7. Try it tonight

The whole recipe:

pip install google-genai in a virtualenv. Set GEMINI_API_KEY from Google AI Studio and GH_TOKEN from a read-only GitHub personal access token.
Save three skill files in skills/. Use scan_open_prs.md above as the template; write categorize_by_risk.md and draft_summary.md in the same shape (name, procedure, constraints, composition).
Write the runner: one genai.Client, two function tool declarations (github_list_prs, github_get_pr_files), one multi-turn loop driving until the model emits no more tool calls.
Point it at a public GitHub repo. Start with something small. cli/cli is a good first target because the PR titles are descriptive.
Read the JSON trace the loop produces. Tweak the skill prose where the agent went sideways. Run again. The whole iteration cycle is about a minute.

Two evenings of work, including the runs, and the agent is paying for itself the first time you let it sweep a backlog before standup.

8. Closing

I am optimistic about this pattern. Markdown skills make agent definitions reviewable in a pull request, runnable from any IDE, portable across runners. The skill file is a primary artifact, not a string buried inside a Python class. Anyone on the team can edit it. Anyone reading the repo can see what the agent will do.

Which workflows in your stack feel like a natural fit for Markdown skills, and which still need a graph?

Antigravity 2.0 in one day: the four shells and what each is good for

Arthur — Sun, 24 May 2026 19:49:38 +0000

This is a submission for the Google I/O Writing Challenge

A field guide to the four ways Antigravity 2.0 lets you drive an AI coding agent, with a 10-minute SDK recipe for writing your first skill.

The architectural news worth keeping

The most useful sentence Google I/O 2026 produced was not on the main stage. It was Kevin Howe, in the Google Cloud Live segment afterward, defining the term that had been circling the keynote all morning. Asked what a harness actually is, Howe gave the answer in two beats. The model first: "LLMs are really just tokens in, tokens out." Then the layer around the model: a harness wraps the tokens-in/tokens-out core and gives the agent senses (codebase state, filesystem, environment signals) and limbs (the tools the agent can call). The set of primitives a complex task decomposes into is essentially what defines a harness.

That single framing reorganizes the entire 2.0 announcement. On the surface, Google shipped a new AI IDE. Underneath, Google did something more interesting: it consolidated one agent execution layer and exposed it through four interchangeable shells. Same harness, same tools, same skill format. Different chrome.

This is a field guide to those four shells, with one shell taken end to end so you can run it tonight.

The four shells, briefly

One harness, four shapes. The choice between them is a workflow question, not a power question.

Antigravity editor. The standalone IDE familiar to existing users. Best when you want source code on screen most of the time and traditional file diffs in your review loop.
Antigravity 2.0 (the Manager). A new Electron desktop app, built around conversations and agent artifacts. Best when you are juggling three to five agents at once on separate git worktrees and want an agent-inbox view, not a code editor view.
Antigravity CLI (agy). Terminal-first, scriptable, lives nicely in SSH sessions to GPU boxes and CI runners. Authenticates via Google Cloud OAuth. The unified successor to Gemini CLI: available to all Gemini CLI users at I/O, with published migration guides.
Antigravity SDK (pip install google-antigravity). A Python package that drops the same harness into your own scripts and apps. Takes a plain GEMINI_API_KEY from Google AI Studio. The fastest path to programmatic control of the harness.

A compressed picker:

If your workflow is...	Reach for
One agent, one repo, see diffs	Editor
Many agents in parallel	Manager 2.0
SSH, GPU boxes, dotfiles sacred	CLI
Embed agents in your own product	SDK

The SDK is the shell I want to walk you through, because it is the one you can install and have producing real output in ten minutes, with no desktop session and no OAuth round-trip.

Why skills are the unlock

The single most important primitive in Antigravity 2.0 is not a tool, a model, or a UI. It is the skill: a Markdown file that tells the agent how to do a category of work. The Antigravity harness reads skills lazily, on demand, so you can stash dozens in a directory and the agent picks the relevant one when it needs it.

Here is the skill I used for the rest of this guide. Thirty lines of Markdown. Save it as AGENTS.md at the root of any small project.

# Directory triage skill

You are a directory-triage assistant. When asked to summarize or improve
a directory, follow this procedure exactly.

## Inputs

- Working directory: assume CWD unless told otherwise.
- Skip: `node_modules/`, `.git/`, `dist/`, `build/`, `.venv/`, anything
  in `.gitignore`.

## Steps

1. List the directory tree to depth 2 with `list_directory`.
2. Read `README.md` if present; otherwise read the first `*.md` you find.
3. Read `package.json` / `pyproject.toml` / `go.mod` if present.
4. Produce a 3-paragraph summary covering, in order: purpose, stack,
   current state.
5. Propose exactly three concrete improvements, each as one sentence
   stating *why* and *which file(s) would change*.

## Constraints

- Do not edit any files in this skill: read-only.
- Do not invoke `run_command` for anything other than `git status` or
  `git log -5`.
- If `AGENTS.md` exists in a subdirectory, prefer its instructions for
  that subtree.
- Cite this skill by name ("directory triage skill") at the top of
  your final summary so a reviewer can confirm you used it.

That is the whole skill. There is no framework. There is no DSL. The "language" is Markdown plus the names of the harness's built-in tools (list_directory, view_file, run_command, and friends), which the agent already knows.

The SDK loads the skill via LocalAgentConfig.skills_paths. The minimum runnable Python is short:

import asyncio
from google import antigravity as ag

async def main():
    cfg = ag.LocalAgentConfig(
        skills_paths=["./skills"],
        workspaces=["./tiny-rss"],
        system_instructions="You are a careful staff engineer.",
    )
    async with ag.Agent(cfg) as agent:
        resp = await agent.chat(
            "Triage the workspace using the directory triage skill."
        )
        print(await resp.text())

asyncio.run(main())

I pointed this at a small Mattermost RSS bridge of mine called tiny-rss (Python 3.10, feedparser plus httpx, an infinite loop with a 10-minute sleep). The agent read the three source files, ran git status, cited the directory triage skill by name at the top of its summary, and returned exactly three improvements with file anchors:

Persistent deduplication. Move the in-memory seen set to SQLite or a JSON file so restarts do not reprocess every feed item. Touches main.py.
Error handling and retries. Wrap the httpx.post calls in try/except with exponential backoff so a flaky Mattermost endpoint does not stall the loop. Touches main.py.
Configuration validation. Parse and validate env vars at startup, fail loudly on missing MATTERMOST_WEBHOOK_URL. Touches main.py and possibly pyproject.toml.

Useful, file-anchored, opinionated. Each improvement was something I would actually merge.

The cost ledger from two runs of the same prompt, against the same skill and target, captures the harness's character:

Run 1: 28 tool calls, 230,496 tokens, 119 seconds wall.
Run 2: 21 tool calls, 128,281 tokens, 99 seconds wall.

Two runs, same artifact shape, different exploration depth. That is the price of giving the agent latitude to discover the project on its own, and it is a price worth paying once you see what the agent does with what it learns. Think of those tokens as tuition: you are paying the agent to understand your code so its three suggestions are about your main.py, not a generic RSS bridge.

Two tips before you sit down at the keyboard

The CLI and the SDK want different credentials, and that is the one piece of friction worth flagging up front. The CLI authenticates with Google Cloud OAuth (it calls Google's internal code-assist backend, the same one the editor and Manager use), so reach for it when you are already signed into Google Cloud at a workstation. The SDK takes a plain GEMINI_API_KEY from Google AI Studio, so reach for it when you want CI, headless servers, or self-hosted automation. Second tip: skills are discovered at runtime, not preloaded into the system prompt. A single hint in your user prompt ("Use the directory triage skill in AGENTS.md") collapses the exploration overhead and keeps your token spend predictable, the difference between Run 2 and Run 1 above.

Try it tonight, a 10-minute recipe

If you have ten minutes and a small Python project lying around, you can have the harness producing real output before the kettle boils.

Install in a clean venv.

   python3 -m venv .venv
   .venv/bin/pip install google-antigravity

This pulls the SDK, the transitive google-genai client, and the harness binary that does the actual tool execution.

Set your key. Grab a GEMINI_API_KEY from aistudio.google.com, then export GEMINI_API_KEY=... in the same shell.
Save the skill. Drop the 30-line AGENTS.md from the previous section into ./skills/AGENTS.md next to your project directory.
Run the snippet above (asyncio.run(main())), pointing workspaces= at any small project you know well. Watch the agent walk the tree, read the README and pyproject.toml, and produce its three improvements.
Read the trace, then tweak. The SDK streams every tool call as JSON; pipe stdout through tee run.jsonl if you want to keep it. Tighten the skill (add a ## Output format heading, ask for a fourth improvement, forbid the agent from suggesting tests), and run again. The artifact shape should change in exactly the way you asked.
Try Manager 2.0 next if you want the same skill, same artifact, agent-inbox chrome. The same AGENTS.md works there with no changes; the harness underneath is identical.

That is the whole loop. Install, key, skill, script, read, iterate. You are now writing for the harness, which is the interesting layer 2026 produced.

Closing

The shells separate workflow from the harness underneath, and that is the genuinely good news from I/O 2026. You get to pick an interface that matches how you like to work without giving up portability of your skills, your tools, or your muscle memory. The Markdown skill you write tonight in the SDK will run unchanged in Manager 2.0 tomorrow, and on the CLI on a GPU box next week. That is the consolidation, and it is the part of the announcement that compounds.

The harness is the runtime now. Pick a shell, write a skill, and the rest of 2026 gets easier.

Pick your editor by what feels good. Pick your harness like it is a runtime, because it is.

Which shell are you reaching for first? I would like to compare notes on the skills you write.

Why Your Logs Are Useless Without Traces

Arthur — Fri, 22 May 2026 16:00:00 +0000

It is three in the morning, the on-call rotation is awake, and the logs scroll past at a rate the eye cannot track. Ten thousand identical lines reading "ERROR Request failed: Connection timeout" appear in the last fifteen minutes. The timestamps are dense, the request paths blurred, the causal chain absent. Somewhere in the system, a downstream call to an inventory service is failing. The log file does not, in any column, tell anyone which downstream call, which upstream caller, which user request started the cascade, or which retry attempt happens to be the one currently scrolling.

I want to take that scenario seriously, because it is not a logging-quality problem. The logs in question are well-formatted, well-timestamped, and well-aggregated. The team is doing all the things the 2014-vintage advice columns recommended. The problem is structural: a log line is the wrong unit of analysis for the failure they are looking at, and no quantity of better log lines will turn the wrong unit of analysis into the right one. The unit they need is the trace.

What a log line actually is

A structured log entry answers a specific question: "what did this service observe at this moment." It is local to the service, local to the moment, and — by design — has no native concept of where in a wider request lifecycle it sits. In a monolith this is a survivable limitation; the entire request runs in one process, every log line shares an in-memory request context, and a request_id field is enough to grep the picture together after the fact.

In a microservice deployment of any sophistication, the assumption breaks down. A user request that hits a Kubernetes ingress at the edge typically traverses five to twenty internal services before producing a response: an auth gateway, a session-resolution service, two or three domain APIs, a feature-flag layer, several backing data stores, possibly a recommendation service, possibly a billing path. Each of those services emits its own log lines, often to its own log destination, often without a shared correlation field. The request did happen; nobody recorded the shape of it.

The log-correlation problem isn't fixable inside the log abstraction. A log line cannot, by construction, contain information about the call graph it sat inside, because the call graph wasn't visible to the service writing the line. Someone has to record the call graph separately, and that someone is a different signal class entirely.

What a trace is

A trace is the structural answer to "what happened across services for one request." It is a tree of spans, where each span represents a unit of work — a service call, a database query, a cache lookup, an outbound HTTP request — and parent-child relationships preserve the causal nesting. A unique trace_id propagates from the top of the tree (typically the edge ingress) through every hop; each span carries the parent span's ID, plus its own span ID, plus a small bag of attributes (HTTP method, query params, error code, business identifiers).

Rendered visually, a trace is a waterfall: time on the horizontal axis, services and operations on the vertical, each span a coloured bar whose width is its duration. The slow span is the wide one. The failed span is red. The interesting question — "which of the twenty hops in this request consumed the time, and which one returned the error?" — is one screen, not twenty grep commands.

This is not a logging upgrade. It is a different signal class, with a different unit of analysis (the request) than the log signal (the moment), and a different storage model (a tree per request) than the log storage model (a stream per service). The two are complementary, not interchangeable.

The standard you can rely on

Distributed tracing as a discipline is older than most engineers writing about it think. Google's 2010 Dapper paper, by Sigelman and colleagues, is the canonical reference; Twitter open-sourced Zipkin in 2012 as a Dapper-inspired implementation, and Uber open-sourced Jaeger in 2017 on similar lineage. For most of the 2010s, however, the operational reality was vendor-specific: each APM (Datadog, New Relic, AppDynamics, Dynatrace) shipped its own SDK, and instrumenting an application meant choosing a vendor and accepting that the instrumentation work was, structurally, lock-in.

The standardisation arrived in two pieces, both of which are worth pausing on.

The W3C published the Trace Context Recommendation on 6 February 2020, defining a vendor-neutral wire format for propagating trace IDs across HTTP service boundaries. The spec is small and unglamorous — a traceparent header carrying the trace ID, parent span ID, and sampling flags, plus an optional tracestate header for vendor-specific context. Most major HTTP clients and frameworks now respect it as a matter of course.

OpenTelemetry, the merger of OpenTracing and OpenCensus, was accepted to the CNCF Sandbox in May 2019 and moved to Incubating maturity on 26 August 2021, where it remains as of mid-2026. The project ships SDKs for the major languages (Node.js, Python, Java, Go, .NET, Rust, Ruby, PHP), an OTLP wire protocol, and a Collector binary that brokers between application-side instrumentation and any compliant backend. The SDKs include automatic-instrumentation libraries that wire framework-level telemetry without code changes — HTTP servers, ORMs, RPC clients, and message-queue libraries instrument themselves at load time.

The practical consequence is that the 2010s pattern of "pick an APM and live with their SDK" has been replaced by "instrument with OpenTelemetry, ship the OTLP traffic to whichever backend you can afford this quarter." Jaeger, Grafana Tempo, Honeycomb, Datadog APM, New Relic, Splunk Observability, Elastic APM, and AWS X-Ray all accept OTLP. The instrumentation decision is now separable from the backend decision, and the backend choice is a renewable one.

Where the wiring breaks

Even teams running OpenTelemetry, in production, often fail at one specific bridge: linking each log line back to the trace span it occurred inside.

The fix is twenty lines of code in any language with an OpenTelemetry SDK. Read the active span from the SDK at log time, attach trace_id and span_id to the structured log payload as additional fields. From that point forward, the log-aggregation tool and the APM are a single navigable surface — click a span in the trace view, see the logs for that span; open a log entry in the aggregator, jump to the trace that produced it.

The Node.js shape of it, with no extra dependencies beyond the OpenTelemetry SDK already in the application:

const { trace } = require('@opentelemetry/api');

function log(level, message, fields = {}) {
  const span = trace.getActiveSpan();
  const ctx = span ? span.spanContext() : {};

  console.log(JSON.stringify({
    level,
    message,
    trace_id: ctx.traceId || null,
    span_id: ctx.spanId || null,
    timestamp: new Date().toISOString(),
    ...fields,
  }));
}

// Use it like any structured logger.
log('error', 'Failed to reserve inventory', { item_id: 4821 });

The same pattern transposes directly to Python (opentelemetry.trace.get_current_span()), Go (trace.SpanFromContext(ctx).SpanContext()), Java (Span.current().getSpanContext()), and every other OpenTelemetry SDK; the language-specific entry point changes, the structure does not. A wrapper like this routed through the team's existing logger (Pino, Winston, Bunyan, slog, Logback, etc.) means every log line emitted in a request inherits the trace-and-span IDs of whatever span is active at emission time.

The bridge is missing from a striking number of production deployments. Two screens, no link, the on-call engineer still grep-correlating by timestamp at 3am. The pattern is consistent enough across teams that it deserves to be called out as the single highest-leverage observability change a team can make: instrument once with OpenTelemetry, then add trace_id and span_id to every log line, and the entire observability surface becomes navigable.

The 3am scenario, replayed

Take the same incident from the opening, with traces wired. The trace view shows a POST /checkout request with 320ms total duration, breaking down into a 12ms hop from the API gateway to the order service, then a 301ms hop from order to inventory, where the inventory service's AcquireLock(item_id=4821) span shows a 300ms timeout. The full causal chain is visible on one screen. The trace_id on each log line in any log-aggregation tool is the same trace_id visible in the APM, so the engineer can pivot freely between the two surfaces.

Same incident, two debug tools, very different work envelope. The team that landed at the trace view in fifteen seconds is back in bed within the hour. The team without it is constructing the call graph by hand from log timestamps until daylight.

The economic point is the part most teams underestimate when they're prioritising the work. Incident MTTR is not just an availability metric — it is a developer-velocity metric, because it determines how much engineering time per week is spent in active incident response versus shipping features. A team running with traces wired has a bounded MTTR; the worst case is "look at the trace, find the slow span, fix it." A team without has an unbounded MTTR, because the worst case is "look at twenty log streams, hope someone wrote a useful field, give up and start adding println calls in a hotfix."

The minimum viable observability stack in 2026

The 2010s observability stack was complicated. In 2026 it isn't.

An OpenTelemetry SDK in the application emits OTLP traffic. An OpenTelemetry Collector — a small, stateless binary — receives that traffic and forwards it to whichever backend the team uses for storage and visualisation. The backend can be open-source self-hosted, commercial SaaS, or part of a wider observability suite. Switching backends is a Collector configuration change, not a re-instrumentation project.

The OTLP-compatible backend market in 2026, briefly:

Backend	Type	Hosting	Notable strength
Jaeger	Open source	Self-host (Docker / Kubernetes / standalone)	CNCF Graduated; OTLP-native receiver since 1.35; the reference implementation most other backends are read against
Grafana Tempo	Open source	Self-host or Grafana Cloud	Object-storage-backed; tightly integrated with Loki (logs) + Prometheus (metrics) for a unified Grafana stack
SigNoz	Open source	Self-host or SigNoz Cloud	OpenTelemetry-native end-to-end; ClickHouse-backed; combined logs/traces/metrics in one tool
Honeycomb	Commercial SaaS	Honeycomb Cloud	High-cardinality query model; pioneered observability-as-debugging rather than as monitoring
Datadog APM	Commercial SaaS	Datadog Cloud	Mature, widely deployed, expensive at high cardinality / high retention
New Relic	Commercial SaaS	New Relic Cloud	Bundled into a broader APM / monitoring / RUM suite; consumption-based pricing
Splunk Observability Cloud	Commercial SaaS	Splunk Cloud	Rebranded SignalFx; enterprise-oriented; integrates with the Splunk log / SIEM stack
Elastic APM	Open source + commercial	Self-host or Elastic Cloud	Tight integration with Elasticsearch + Kibana; strong for teams already on the Elastic stack
AWS X-Ray	Cloud-native	AWS	Native integration with AWS-only deployments; thin on cross-cloud or hybrid scenarios

Switching among these is a Collector configuration change. The application's instrumentation code does not move.

The application code carries one library and a handful of attribute-tagging calls in the spans the team cares about. Log-aggregation continues to use whatever the team already runs (Loki, Elastic, OpenSearch, Datadog Logs, etc.); the bridge is the trace_id field on every log line.

This is a much smaller commitment than the 2017 vendor-SDK pattern, and the lock-in surface is a fraction of what it was. Teams that haven't made the move are usually not held back by complexity; they're held back by the fact that the previous-generation advice columns are still in everyone's bookmarks, and the new pattern hasn't been internalised as the default.

What logs and traces actually answer

It is worth stating clearly, because the field's vocabulary has been muddy on this point for a decade. Logs answer what happened in this service at this moment. Traces answer where in the causal chain across services the failure occurred. Metrics answer how often, with what shape, over what window. Each of the three signals is necessary for a different class of question; none of them is sufficient on its own.

Signal	Question it answers	Unit of analysis	Storage model	Useful at 3am for
Logs	What did this service observe at this moment?	An event in one process	Stream per service, time-ordered	Reading the exact error text, stack trace, payload
Traces	Where in the causal chain across services did the request go?	A request, end to end	Tree of spans per request	Locating the slow or failing hop in a multi-service path
Metrics	How often, with what shape, over what window?	Counter / gauge / histogram, time-bucketed	Time series	Detecting that something is broken right now and characterising the pattern

The reason the three are routinely conflated is that the same word "observability" gets used for all of them, and the same vendor often sells all three as one product. They are still three different signals with three different jobs, and a deployment that has only one of them — usually logs, occasionally metrics — is missing two thirds of its debug surface.

The reason the 3am incident scenario is annoying is that the on-call engineer is asking a "where in the chain" question and being handed a "what at the moment" answer. The log line is technically correct and operationally useless, because it is the wrong signal for the question. Adding ten more log lines per service does not improve the situation; the additional log lines answer the same wrong question more loudly.

Wiring traces is not "doing observability properly." It is wiring the signal that the failure mode actually has a shape in. Once it is wired, the logs become useful again — not because the log lines got better, but because they now sit inside a structure that makes them addressable.

What the slow rollout actually looks like

There is a pattern in how teams adopt distributed tracing that is worth noting because it is so consistent.

A team starts with logs only. Logs are good enough until the deployment crosses some threshold, usually around four or five microservices, after which the log-correlation overhead becomes a daily friction. The team adds metrics, often Prometheus, and the metrics solve the "is something broken right now" question but do not help with "why is this specific request slow." The team adds tracing last, often after a particular incident has consumed enough engineering hours that someone has been given a quarter to fix it. The tracing rollout is not difficult; the SDK is mature, the Collector is small, the backends are interchangeable. The friction is organisational, not technical — the team has to allocate the time, agree on attribute conventions, and write the log-trace bridge.

After the rollout, every team that has done it reports the same observation: the marginal cost of the next incident drops by an order of magnitude. The incident that would have consumed an afternoon now takes ten minutes. The cost of the rollout pays back over the first two months.

Teams that haven't done it yet are paying that cost in incident hours, week over week, until they do.

The actual question

The question that matters in 2026 is not "should we do distributed tracing." The standardisation argument is over; OpenTelemetry has won the instrumentation layer, the W3C Trace Context spec has won the propagation layer, and the backend market has commoditised on the OTLP wire format. The question that matters is whether the team's logs are addressable from a trace, and whether the trace is addressable from a log line, in both directions, on every request.

If the answer is yes, the 3am incident has a different shape than the one in the opening. If the answer is no, the team is paying for that gap in MTTR every week, and the cost shows up in features that didn't ship because the on-call rotation was busy. Logs and traces are not in tension. They are two different signals, joined by twenty lines of code, and the team that has wired the join is the team that goes back to bed.

VS Code Now Credits Copilot on Every Commit by Default

Arthur — Fri, 22 May 2026 14:30:00 +0000

There's a one-line pull request open against microsoft/vscode. It changes the default value of a setting called git.addAICoAuthor from "off" to "all". The PR is titled, with no embellishment, "Enabling ai co author by default." It was opened on April 15, 2026 by cwebster-99, a member of the VS Code team. By the time the HN thread on the PR hit 1,239 points and 646 comments in early May, the practical effect was already shipping: every commit you make in VS Code, regardless of whether you wrote the code yourself or had Copilot write it, gets a Co-authored-by: Copilot trailer in the commit message.

The setting value "all" is not a misnomer. It literally means: add the trailer to all commits. The HN poster's headline framing — "regardless of usage" — is operationally accurate. If you don't change the setting, every commit your IDE produces credits Copilot in writing.

The bug Copilot's own AI review caught

The most peculiar artifact inside the PR is the Copilot code-review comment, generated by GitHub's own Copilot Pull Request Reviewer service. It noticed something the human author didn't: the schema default in extensions/git/package.json was changed to "all", but the runtime fallback in extensions/git/src/repository.ts still calls config.get('addAICoAuthor', 'off'). The two are now out of sync. The AI's own review comment, in the PR's own review thread, names the failure mode in unusually clear language:

"This is now out of sync and can lead to unexpected behavior in contexts where the contributed configuration defaults aren't loaded (e.g., some tests/hosts), and it makes the intended default unclear."

So the trailer is on by default in the schema. The trailer is off by default in the runtime fallback. Which one wins depends on the load path of the configuration in the specific VS Code host. In a normal user-facing VS Code session, the schema default wins; in some tests and embedded hosts, the runtime fallback wins. The Copilot AI's own catch is the most precise description of the "regardless of usage" problem in the entire PR review: the default behavior is now unpredictable across launch contexts.

The PR has not been merged at the time of writing. The Copilot AI's review comment has not been resolved. The HN thread is the place where this contradiction is being read carefully and the implications are being traced.

What goes into your git history

Co-authored-by: is a GitHub-recognized commit message trailer that surfaces in several places:

git log output — the trailer is part of the commit message itself. It is in your repository's history forever, on every commit produced under the new default.
GitHub's contribution graph and PR-author UI — Co-authored-by: lines are parsed and displayed; commits with the trailer are attributed to two parties.
License-attribution audits and IP-due-diligence reviews — when a startup is acquired or a project is open-sourced, someone runs a tool over the commit history to figure out who wrote what. A default-on Co-authored-by: Copilot trailer changes the answer of that question for every commit it appears on.
Contributor lists in AUTHORS files generated from commit history — depending on tooling, Copilot may show up as an author of your project.

It is worth reading carefully what the new default literally claims. The trailer says, in writing, that Copilot co-authored the commit. With the value set to "all", the claim is made on every commit, including commits where Copilot was demonstrably not invoked: a one-line typo fix, a gitignore update, a merge resolution, a commit produced by git commit --amend that only adjusts the previous commit's message. The claim is true sometimes. With this default, it is recorded always.

The HN reception

The thread's most-upvoted top-level comment frames the change as a standards problem rather than an attribution problem: AI-vendor defaults are increasingly hostile to the long-established conventions about whose name belongs on a piece of work. "Microsoft spent literal decades rehabilitating their reputation," the comment reads — and the implication is that the rehabilitation is being spent down on AI-feature-by-default decisions like this one. The same comment cites a Google example — keyboard-shortcut remapping on macOS to launch LLM features over a long-standing OS shortcut convention — and the slow accretion of small defaults that, taken together, redefine what an IDE is for.

A second thread of comments was structural — defaults are policy. The setting in git.addAICoAuthor has three plausible values — "off", "all", and (in some implementations) a value that detects whether the commit was actually AI-influenced. The PR didn't choose the detection path. It chose "all". That choice is a stance: Microsoft would rather record every commit as Copilot-co-authored than do the work of detecting which commits actually were.

The Copilot AI review's catch is, on its own merits, the most peculiar artifact in the PR. An AI flags a misalignment in the rollout of an AI feature, and the misalignment is still unresolved as the change moves toward merge.

What the opt-out actually does

The user-facing path to disabling the trailer is to set git.addAICoAuthor to "off" in your VS Code user settings. In the typical interactive flow, this works: the schema default is overridden by your user setting, the trailer doesn't get added, your commits stay clean.

The cases where it doesn't work are exactly the ones the Copilot AI review flagged. If your VS Code instance is loading in a host where contributed configuration defaults aren't fully populated — some test runners, some embedded hosts, the various dev-container and Codespaces flows that have their own configuration loading rules — the runtime fallback is what gets consulted. Right now the runtime fallback says "off", which produces the desired no-trailer behavior. After the PR merges and the runtime fallback gets aligned to match the schema default — which the Copilot AI review's "Update the runtime fallback to match the schema default" suggested change explicitly recommends — the "off" you set in your user settings will still win in normal VS Code sessions. But there will be specific load paths — particularly automation contexts — where the trailer is added to commits and your settings.json never had a chance to override.

The HN thread's running joke about "regardless of usage" is the operational summary of this: a feature whose opt-out is partial in ways that aren't documented, in flows that are hard to predict, in an IDE whose primary surface is committing code to other people's repositories.

Defaults are policy

The "off" → "all" change is one line in one file. It is also, on reflection, an unusually consequential one-line change. A substantial fraction of the world's commits flow through VS Code. A default-on AI-attribution trailer on those commits, ratified on every push, is a quiet but durable claim about authorship — a claim that Microsoft is making at scale, on behalf of its AI product, in your name and in your repository, written into the most-permanent log a project keeps.

It is worth saying that there is a coherent argument for this default. AI tooling is widely used. Recording its involvement is honest. A default that records the involvement removes the (real) friction of opting in commit-by-commit, and the (also real) under-attribution that comes from busy developers forgetting. That argument is not absurd. It is the argument the PR is implicitly making.

The argument the PR is not making — and would have to make to land cleanly — is why "all" is the right default rather than a detection-based setting that records co-authorship only when Copilot was actually invoked. The detection version is harder to implement; it is also the version that respects what Co-authored-by: actually means. "All" is the version that requires no engineering work and produces the most-frequent attribution claim. That this is the version that shipped is, on its own, a useful piece of information about how AI defaults are being made at the vendor in 2026.

What to do today

Three steps are worth taking this week if your commit history is part of your project's evidentiary record.

# 1. Disable the default in your user settings.json
{
  "git.addAICoAuthor": "off"
}

# 2. Audit recent commits in any repo you've worked in since 2026-04-15
git log --all --since='2026-04-15' \
  --grep='Co-authored-by: Copilot' \
  --pretty=format:'%h %an %ad %s' --date=short

# 3. If trailers slipped through, strip them on a topic branch
#    (rewrites local history; coordinate before pushing)
git filter-branch --msg-filter \
  'sed "/^Co-authored-by: Copilot/d"' HEAD~50..HEAD

And if you maintain an open-source project, document a contribution-trailer policy in CONTRIBUTING.md — a one-line rule against AI-attribution trailers regardless of IDE default turns a vendor-side default into a project-side norm. The vendor's default will keep changing. The project-side norm doesn't have to.

The PR that introduced this was one line. The cleanup, repo by repo, is going to take longer.

How HTTP Reverse Proxying Lost the Argument and Won the Market

Arthur — Fri, 22 May 2026 13:00:00 +0000

The FastCGI specification is 30 years old today. It was published on April 29, 1996. It is, on a couple of structural axes that have come to matter a great deal in 2026, the protocol that the rest of the industry should be using for proxy-to-backend communication, instead of the protocol it actually uses, which is HTTP. This is not my argument; it is Andrew Ayer's argument, made on the 30th-birthday post on his blog at agwa.name. Ayer is the founder of SSLMate and notes the company has run FastCGI in production for over ten years. He has, in other words, the receipts.

The piece is short and operational, and the HN thread on it ran 100 comments deep — modest by 2026 front-page standards but heavy on long-tenure-operator testimony. The piece argues that HTTP-as-reverse-proxy-protocol has two structural failure modes that FastCGI does not have, and that the industry's continued reliance on HTTP for this specific use case is the result of accreted preference rather than careful comparison. The thread argues, additionally, that the reason HTTP won despite being worse is itself worth understanding — and that the answer is the end-to-end principle, applied where it should not have been.

The two structural failure modes

The first failure mode is desync attacks, also known as request smuggling. HTTP/1.1 has the property of looking simple on the surface — "it's just text!" in Ayer's parenthetical — and being a nightmare to parse correctly. There is no explicit framing in the protocol. The message itself describes where it ends, via Content-Length, Transfer-Encoding, or sometimes both, and there are enough edge cases in the parsing rules that two HTTP implementations can disagree about where one message ends and the next begins. When the implementation that disagrees with its neighbor is your reverse proxy and the implementation it disagrees with is your backend, an attacker can smuggle a second request inside the body of a first, and the second request gets handled with the privileges of the wrong session.

The recent example Ayer leads with is a desync vulnerability in Discord's media proxy, disclosed earlier this year, that allowed spying on private attachments. The class is decades old. Watchfire described it in 2005 and warned, prophetically, that the parser-divergence approach to fixing it would be a losing strategy. James Kettle at PortSwigger has spent the last several years finding new variants on a roughly annual cadence. After the most recent batch he declared, on a single-purpose website with the URL http1mustdie.com, that "HTTP/1.1 must die."

HTTP/2 fixes desync, when consistently used between proxy and backend, by putting clear boundaries around messages. FastCGI fixed it in 1996, with a simpler protocol, by also putting clear boundaries around messages. Ayer notes — and this is the part that lands — that nginx has supported FastCGI backends since its first release, but only got HTTP/2 backend support in late 2025. Apache's HTTP/2 backend support is, as of this writing, still experimental. FastCGI has been there the entire time.

The second failure mode is untrusted-header confusion. HTTP has no clean way for the proxy to convey trusted information about a request — the real client IP, the authenticated username, client-certificate details — separately from the headers the client itself sent. The conventional approach is to use additional HTTP headers (X-Real-IP, X-Forwarded-For, True-Client-IP) and trust the proxy to strip any client-supplied headers with those names before adding its own. In theory this works. In practice it is a minefield, because there is no structural distinction between trusted and untrusted headers — it's all just headers — and any part of your stack that looks at the wrong header without your knowledge can be tricked. Ayer's specific example: Go's Chi middleware reads True-Client-IP first, falling back to X-Real-IP. Even if your proxy correctly strips X-Real-IP, an attacker who sends True-Client-IP defeats it.

FastCGI structurally cannot have this bug. Trusted parameters and HTTP headers travel in the same key/value list, but HTTP headers are prefixed HTTP_ to mark them as client-originated. The proxy sets REMOTE_ADDR directly; a client trying to forge it would have to send a header literally named HTTP_REMOTE_ADDR, which the backend would parse as a client-set field, not a trusted one. The forgery is a different shape from a successful one. There is no header-name-collision attack in the FastCGI design, because it has domain separation built into the wire format.

FastCGI is a wire protocol, not a process model

The biggest practical objection people raise to FastCGI is that it sounds dated, and the reason it sounds dated is the historical association with the .fcgi per-request-spawning pattern that nobody runs anymore. Ayer's piece is careful to separate these. FastCGI today is just an alternate transport for HTTP requests over a TCP or Unix socket. In Go, switching is one import and one call:

l, _ := net.Listen("tcp", "127.0.0.1:8080")
fcgi.Serve(l, handler)

fcgi.Serve is a drop-in for http.Serve. The handler stays the same. http.ResponseWriter and http.Request keep their types. nginx, Apache, Caddy, and HAProxy all support FastCGI backends with one or two lines of configuration. The work to switch is roughly the same as adding TLS termination, with much less ongoing maintenance.

The piece also notes that Go's standard net/http/fcgi automatically populates the Request.RemoteAddr field from the trusted REMOTE_ADDR parameter, and sets the TLS field to a non-nil value when the proxy reports HTTPS. The middleware most Go services use to extract the real client IP from X-Forwarded-For is unnecessary. "It Just Works," Ayer writes, in one of the few rhetorical flourishes in an otherwise dry piece.

The honest downsides are real and Ayer lists them. FastCGI was never extended to support WebSockets. The tooling is thinner — curl doesn't speak it, even though it speaks FTP, Gopher, and SMTP. When Ayer benchmarked Go's FastCGI server behind various reverse proxies, some workloads had worse throughput than HTTP/1.1 or HTTP/2 — which he attributes to the FastCGI code-paths being less optimized rather than to anything inherent in the protocol. The cloud-era "just use HTTP, we'll handle it" mindset has not made room for FastCGI even though most of the infrastructure could.

Why HTTP won anyway

The historical question Ayer's post leaves implicit and the HN thread takes up directly is the more interesting one. If FastCGI is structurally better — better-framed, with native domain separation — why did HTTP win the reverse-proxy market?

The thread's most useful single comment frames it through the end-to-end principle. The argument goes: HTTP everywhere lets you compose intermediate gateways arbitrarily. You can run the same backend stack with a direct browser connection in development and behind a multi-tier proxy stack in production, without code changes. You can introduce a new authentication gateway, a DDoS filter, a TLS terminator, a region-routing layer — at any position in the request chain — without each layer needing to know what the others are doing. The end-to-end principle says: keep the network agnostic to what's being transmitted, push application logic to the endpoints. HTTP-as-everywhere-protocol is the literal embodiment of that principle for the web.

The piece's actual recommendation, the comment continues, is the principle of least privilege as the alternative. "Allowlist your communications to only what you expect, so that you aren't unwittingly contributing to a compromise elsewhere in the network." Don't trust headers you didn't ask for. Don't allow framing ambiguities you can't verify. Domain-separate the things that should be domain-separated.

These are both correct principles. They disagree about which threat is the bigger one. The end-to-end principle is what gives the web its compositional flexibility — the property that has, in fact, made the web outperform basically every closed-protocol alternative for thirty years. The principle of least privilege is what catches the class of bug Ayer's piece is about — the one where a Discord user's private attachments leak because two parsers disagree on a Content-Length edge case. Both classes of harm are real. The argument over which to prioritize is the argument over which class of harm hurts more, and the practical answer is it depends on what you're running.

A second thread in the discussion pushed back on the end-to-end framing entirely, arguing that connection caching and multiplexing — both of which the modern HTTP stack does compulsively — already violate the end-to-end principle in ways that explain most of the reverse-proxy exploits. The exploit class exists not because we picked HTTP, but because we picked HTTP and we cached connections and we multiplexed requests across cached connections, all the way through the stack, and pretended the resulting topology was still end-to-end. The desync attacks live exactly at the points where the pretence breaks down.

A third thread — citing Google's internal Stubby protocol, which wraps HTTP semantics in a different wire protocol for service-to-service traffic — observed that hyperscalers solved this years ago by not using HTTP between their internal services, while continuing to use HTTP at the edge. The compositional flexibility argument is real at the network's boundary. Inside the boundary, in the proxy-to-backend leg, the boundary doesn't exist anymore, and the principle is being applied where it doesn't belong.

That is, on reflection, the right way to read Ayer's piece. The argument is not "HTTP is bad." HTTP is fine for the browser-to-edge leg, where end-to-end composability is the whole game. The argument is that the proxy-to-backend leg is not an end-to-end leg, and using HTTP there imports a class of failure mode the leg should not have. FastCGI's wire format is what HTTP would look like if it were designed with the proxy-to-backend leg's actual constraints in mind: explicit framing, structural domain separation, no header-name-collision class. It is the protocol the leg should have been using all along.

What the production testimony says

The thread runs heavy on quiet production-engineering testimony. SSLMate's ten years on FastCGI is one data point. Others in the comments described running FastCGI for "all our web customers" for a decade. The most extensively-documented alternative in the thread was uWSGI, which one commenter said had been their reverse-proxy backbone "at several places for many years" with mostly-praise; another commenter described WAS (Web Application Socket), a separately-designed open-source protocol they built at CM4all roughly fifteen years ago, used in production at CM4all for hosting environments running PHP. Multiple commenters on different stacks converged on the same observation: when you run the same backend behind a non-HTTP reverse-proxy protocol for a long time, you stop having a class of bugs you used to have.

The counter-testimony was also concrete and worth taking seriously. One commenter described founding a Web 2.0 startup in the FastCGI/SCGI/HTTP era and choosing HTTP because "instead of needing to introduce another protocol into your stack, you can just use HTTP, which you already needed to handle at the gateway." The setup-cost saving is real. nginx came in "lots faster than most FastCGI/SCGI modules of the time, and more robust," which is also real, and the cost of switching to a less-optimized FastCGI path was eventually worse than the security cost of staying on HTTP. By the time the security cost compounded into a continuous run of desync vulnerabilities, the migration cost had compounded too.

Andrew Ayer himself replied in the thread, on a different sub-discussion about plain CGI, with the httpoxy footgun caveat — that CGI's use of environment variables to convey HTTP headers introduced a HTTP_PROXY-collision class that doesn't apply to FastCGI's parameter list. That distinction matters: the article is about FastCGI, not CGI, and the differences are structural, not just performance.

The interesting question

The interesting question is not whether your service should switch from HTTP-to-backend to FastCGI-to-backend tomorrow. For most services, the answer is the boring one: probably not, because the migration cost is real and the threat model can be managed by stripping headers carefully and keeping a current HTTP/2 stack between proxy and backend. The interesting question is why the industry's response to the request-smuggling class of bug — twenty years old since the foundational 2005 Watchfire paper — has been patch the wire protocol harder rather than use the wire protocol that doesn't have this class of bug.

The wire protocol you pick at the proxy-to-backend boundary is a security decision. We have been pretending it was a convenience decision for thirty years. The bill, in CVEs, has been correspondingly persistent. Ayer's anniversary post is a good moment to notice that the bill is still arriving, and that the alternative on the table is not new, not exotic, and not even particularly hard to deploy. Happy 30th birthday, FastCGI, he closes. Worth marking the year on the calendar where the question got asked again, even if the answer doesn't move the market for another decade.

GitHub Is No Longer a Place for Serious Work

Arthur — Thu, 21 May 2026 16:00:00 +0000

On April 29, 2026, Mitchell Hashimoto announced he was moving Ghostty off GitHub. His phrasing — "GitHub is no longer a place for serious work if it just blocks you out for hours per day, every day" — landed on the front page of Hacker News via The Register and stayed there. The departure itself isn't the story. The departure plus the four other threads on the same front page that week — about a federated-forge protocol, a security audit of GitHub's leading alternative, the Dutch government's Forgejo-based code platform, and Armin Ronacher's long essay on what came before GitHub — those, taken together, are the story. Five threads, one shape.

Hashimoto is not a casual user. He is HashiCorp's co-founder; he is the developer behind Ghostty, the terminal emulator he has been working on since leaving HashiCorp; and as he put it in his own post, he is "GitHub user 1299, joined Feb 2008." He is, in the phrase he used to introduce his journal, the kind of person who "doom scroll[s] GitHub issues since before that was a word." If GitHub still feels like home for anyone, it would be him.

His description of the past month is what makes the post unusual. He kept a journal of dates, putting an "X" next to every day a GitHub outage had blocked him from doing work. "Almost every day has an 'X'," he wrote. "On the day I am writing this post, I've been unable to do any PR review for ~2 hours because there is a GitHub Actions outage." The Register noted that the post itself appeared just before an April 28 incident in which pull requests stopped completing because of an Elasticsearch failure. The official excuse circulating around the thread — that GitHub is straining under a flood of vibe-coded projects — got the obvious counter from one commenter: "if you've built a public SaaS before you know the job is not to host the software, it's to put rails around people taking it down. They've had since 2008 to build those rails, and they're just now hitting places that take the service down on the regular." Whether the surge story is the real cause or the convenient one, the customer-side argument lands either way.

The closing line of his post is what gives the piece its weight: "I want to ship software and it doesn't want me to ship software."

This is not the rhetoric of a writer with strong views on software freedom. It is the rhetoric of someone who has just realized that the platform under his work has stopped being a platform and started being a problem.

The same week, four more threads

On April 28, Armin Ronacher published Before GitHub, an essay tracing his own pre-GitHub project life — SourceForge, his own Trac installation, Subversion repositories on infrastructure he ran himself, the Pocoo collective he ran with Georg Brandl. The piece is gentle and careful, and the central observation is one that any developer who's been in the field for more than fifteen years can confirm in their own bones. "Subversion in particular made this 'running your own forge' natural. It was centralized: you needed a server, and somebody had to operate it... When Mercurial and Git arrived, they were philosophically the opposite. Both were distributed. ... In principle, those distributed version control systems should have reduced the need for a single center. But despite all of this, GitHub became the center. That is one of the great ironies of modern Open Source. The distributed version control system won, and then the world standardized on one enormous centralized service for hosting it."

That same week, the team at Tangled published their argument for why open source needs a federation of forges. Their proposal is technical: pair git for code transfer with the AT protocol for the social fabric of issues and pull requests, so that a developer on one server can open a PR against a repo on a completely different server, the way email federates today. "Centralized systems always crumble; it's the emails, gits, and IRCs that stand the test of time," the post observes. The post is short. The 384-comment HN thread does the surrounding work — the trade-offs, the comparisons against ActivityPub-based alternatives, the cost of social fragmentation — and the post links to ForgeFed for the ActivityPub side.

On April 29, the Dutch government soft-launched code.overheid.nl — "a government-wide code platform for publishing and developing open-source software," fully self-hosted on Forgejo, framed as "a European, sovereign alternative to GitHub and GitLab." The platform is initiated by the Open Source Program Office at the Ministry of the Interior and Kingdom Relations. This is not a hobbyist statement. It is the moment a sovereign government writes off the social-infrastructure assumption that GitHub will always be there.

And on the day before Ronacher's essay, Julien Voisin (jvoisin) at dustri.org published a security write-up of Forgejo — the Gitea fork that Codeberg hosts, and that Fedora has now adopted as its primary forge. Voisin's results are not gentle. "It took me one evening after work to find a good amount of vulnerabilities," he writes — SSRF, missing CSP and Trusted Types, JavaScript templating issues, authentication problems across OAuth2/OTP/sessions/recovery, low-hanging DoS, information leaks, TOCTOU bugs — and chained some of them into "a full-blown RCE, some secrets leaks, a bunch of persistent account access, a handful of OAuth2 privesc." He chose a "carrot disclosure" — publishing a redacted exploit output as evidence rather than reporting individual bugs through Forgejo's security policy — because "the codebase (not their fault though, they inherited the gitea/gogs ones)" has "systemic issues" that won't be fixed by playing whack-a-mole.

The comment thread on Voisin's piece converges on the obvious corollary. The alternative forges are not ready. The alternative forges are projects with five-figure star counts, small maintainer teams, codebases inherited from a decade of patchwork development. Forgejo is not GitHub-with-the-ethics-fixed. Forgejo is a different product at a different stage of its lifecycle, with different gaps.

What GitHub actually became

Read these five pieces side by side and the question they're collectively asking starts to come into focus. It isn't "is GitHub broken?" — Hashimoto's journal answers that. It isn't "are there alternatives?" — Tangled, Codeberg, Forgejo, code.overheid.nl, Sourcehut, GitLab self-hosted, all answer that. The question they're asking is: what was GitHub, exactly, and what would we lose if it stopped being it?

Ronacher's essay is the most generous attempt to answer. "GitHub was, and continues to be, a tremendous gift to Open Source," he writes. "It made creating a project easy and it made discovering projects easy. It made contributing understandable to people who had never subscribed to a development mailing list in their life. It gave projects issue trackers, pull requests, release pages, wikis, organization pages, API access, webhooks, and later CI." But the underappreciated piece, in his telling, was archival. "GitHub became a library. It became an index of a huge part of the software commons because even abandoned projects remained findable. You could find forks, and old issues and discussions all stayed online. For all the complaints one can make about centralization, that centralization also created discoverable memory."

That archival role wasn't designed. It was a side-effect of being the center long enough that everything ended up there. Ronacher quotes his own earlier work to make the parallel point: in the pre-GitHub world he lived in, some of his old packages are "technically still on PyPI, but the actual packages are gone. The metadata points to my old server, and that server has long stopped serving those files." That's what the world before GitHub looked like at scale. "A personal domain expired, a VPS was shut down, a developer passed away, and with them went the services they paid for. The web was once full of little software homes, and many of them are gone."

This is the part of the GitHub critique that the "just leave" faction doesn't account for. Leaving GitHub is technically easy — git push --mirror and you're done. Leaving the social infrastructure that made GitHub the default place to find a project, the place to verify that an npm package corresponds to a real maintainer with a real history, the place that a billion of trusted publishing handshakes flow through — that's not a git push. That's an institution.

It's worth being concrete about what does and doesn't move when a project actually leaves:

What	Moves with `git push --mirror`	Alternative path	What's lost
Code, branches, tags	yes	—	—
Binary release artifacts	no	API export	re-upload manually at the new host
Issues	no	API + import	comment threads partially lost
Pull requests	no	API	review-archive history lost
Wiki	yes (separate `<repo>.wiki.git` clone)	—	—
`.github/workflows/` configs	yes (the YAML moves)	—	runner and secret bindings have to be re-bound
Trusted-publishing handshakes	no	—	re-register on PyPI, npm, and the other registries
Stars, forks, watchers	no	—	lost
Discussions, design threads	no	partially via API	most of it lost
Social graph around the project	no	—	lost

Code moves. Trust, reputation, and the conversation around the work do not. That's the part the move surfaces.

What's actually breaking

The visible artifact is outages. "Almost every day has an 'X'," in Hashimoto's accounting. The GitHub statuses tracker, cited in the Hacker News thread by one commenter, was registering uptime down around 86% over a recent window. The user-facing symptoms are well-known by now: PR review dies for a couple of hours; the Actions queue piles up; somebody posts the latest status-page incident; a thread runs.

But the deeper change Ronacher names is more important. "People are tired of the instability, the product churn, the Copilot AI noise, the unclear leadership, and the feeling that the platform is no longer primarily designed for the community that made it valuable." GitHub didn't break in a single visible way. It drifted from being a developer tool with a community wrapped around it into being a generative-AI front-end with a developer tool wrapped around that. Hashimoto's frustration, read carefully, isn't about uptime as an SLA number. It's about a service that has stopped acting as if its job is to let him ship software.

Ronacher puts it more directly: "the site has no leadership! It's a miracle that things are going as well as they are."

The hardest part of the current moment is that the alternatives are not, individually, ready. Voisin's audit makes that point with painful concreteness on the Forgejo side — and Forgejo is the one with the most institutional momentum, hosted by Codeberg, adopted by Fedora, and chosen by the Dutch government for code.overheid.nl. The alternatives have inherited a decade of accumulated patchwork from upstream codebases. They are run by smaller teams. The Tangled federation is, by their own admission, a pitch. The Dutch platform is explicitly "a pilot using Forgejo," and "not all government organisations can use the platform yet." There is no drop-in replacement waiting on the shelf for the next person who decides to leave.

For someone making that calculation today, the ladder of alternatives currently looks roughly like this:

Alternative	Built on	Hosted by	Maturity	Notes
Codeberg	Forgejo (Gitea fork)	Codeberg e.V., a German nonprofit	stable production; Voisin's audit findings still open	free for open source, resource caps
GitLab self-hosted	GitLab CE	you run it	very mature	full GitHub analog; heavy to operate
Sourcehut	Drew DeVault's own stack	sr.ht (paid) or self-hosted	stable, minimalist	mailing-list-style PR flow, not GitHub-shape
Tangled	git + AT protocol	federated across servers	proposal stage	PRs across servers from different hosts
code.overheid.nl	Forgejo	Dutch Ministry of the Interior	pilot, limited audience	the sovereign-funded forge model

None of these is "GitHub minus the part you don't like." Each is a different product, on a different lifecycle, with its own gaps.

The cost of dispersion

Ronacher names the thing that "just leave" arguments tend to wave past: "Going back to many forges, many servers, many small homes, and many independent communities will increase decentralization, and in many ways it will force systems to adapt. ... It can also make the web forget again. ... Issues, reviews, design discussions, release notes, security advisories, and old tarballs are fragile. They disappear much more easily than we like to admit. Mailing lists, which carried a lot of this in earlier years, have not kept up with the needs of today, and are largely a user experience disaster."

This is the warm-critical part of the argument. Dispersion is good for autonomy. It is also bad for memory. The web that came before GitHub had more autonomy and more loss. Some of that loss was in code; more of it was in the social context around the code — who wrote what, why, in response to which discussion, with which trade-offs. None of that is portable in a git push --mirror. None of it federates cleanly through a forge protocol that hasn't been written yet.

What the five pieces from this past stretch ask, collectively, is what we'd want next. Tangled's answer is technical — federate the social fabric the way email is federated, and let projects choose their hosts. The Dutch government's answer is institutional — a sovereign-funded forge for civic software, run by the Ministry of the Interior, which can't be sold and can't be repositioned. Voisin's answer is critical — until the alternatives are audited at scale, they cannot inherit the trust GitHub accidentally accumulated. Hashimoto's answer is practical — Ghostty leaves, the read-only mirror stays, the personal projects stay, the day-job code goes somewhere new.

Ronacher's answer is the largest. "What I would like to see is some public, boring, well-funded archive for Open Source software. Something with the power of an endowment or public funding to keep it afloat. Something whose job is not to win the developer productivity market but just to make sure that the most important things we create do not disappear."

This is the right ask. It is also the hardest of the asks because it is the one with no obvious bidder. Open-source archival is uneconomical. The closest things we have — the Internet Archive's Software Collection, the various academic mirroring projects — are chronically underfunded relative to the cultural weight they carry. The Dutch effort is the kind of model that could scale. It would take more of them.

What we owe each other

Open source is, as Ronacher writes, more than where the code lives. For most of the past two decades, it was where a lot of the community lived too — the maintainers you trusted, the discoveries that came from the issue-tracker browsing of someone two time zones away. Those things happened on top of GitHub because GitHub was where the social infrastructure had ended up. The fact that they happened anywhere is what made open source feel inclusive. Five pieces from this stretch on the front page of HN are not announcing the end of that. They are announcing the beginning of an honest conversation about what comes after — for the first time including people who built the largest projects on GitHub and the institutions that keep critical software running.

The question now is what we keep when the center moves. The answer is not going to be one platform. It is going to be a public archive, a federated forge protocol, a few sovereign-funded forges, several commercial alternatives that compete on uptime and developer experience, and many smaller homes that come and go the way they did before. Some of what we built on top of GitHub will not survive the move.

Hashimoto's read-only mirror of Ghostty on GitHub is a small, careful gesture toward exactly that. He is leaving a toothbrush at the old apartment. He is also keeping the locks on his new place changed. The question he is asking, in moving, is the question the rest of us will be asked too — perhaps not this year, but soon enough that it is worth thinking about now: what would you take with you, if your project's home stopped being one?

Both Camps in the 'Left Behind' Argument Are Right About Each Other

Arthur — Thu, 21 May 2026 14:30:00 +0000

There's a small, angry post on a bear blog called migraine brain that's been on Hacker News for a few days now. The whole post is three paragraphs. It opens: "'People who don't use AI will be left behind,' they say. I can't emphasize enough how much I hate it when I hear/read shit like that because I'm pretty sure, in fact, that what will happen is the exact opposite."

The author's argument is brisk: AI-reliant people are the ones who will be left behind. They'll forget how to think, how to write, how to do a simple reliable search, how to tell fact from fiction. They'll forget — in the original phrasing, with a stronger word for emphasis cleaned for republication — "how to freaking LEARN" [the original used a four-letter intensifier in that slot; the linked post has the unmodified line]. "What a beautiful thing it is just to learn stuff."

The post is short, the language is cleansed of nuance on purpose, and it landed 255 comments deep on the same Hacker News front page where, on any given week of 2026, several other posts are running the inverse argument: that you adapt or die, that the productivity multiplier is real, that two years into agentic coding the people who refused to learn the workflow will be the ones explaining themselves at every job interview.

Both camps are right. They are right about each other. The whole conversation has been an argument about which kind of left behind is the worse failure mode, conducted by people who don't agree on which failure mode is real.

The competence-erosion case

The blogger's case is more interesting than its delivery. It is not, mechanically, an argument against AI tools. It is an argument that the practice of using them — particularly the agent-style, hand-it-the-task delegation pattern that 2026 software development has converged on — atrophies the underlying skill, and that in a non-trivial fraction of cases the underlying skill was the actual job.

The Hacker News thread runs heavy on testimony for this side. "It is easy to shut off your brain when using AI and then get overwhelmed by the amount of code it produces," one commenter wrote. "I have seen a lot of really bad AI code. I can spot and repair it. Others can not." That distinction — I can spot and repair it, others can not — is the whole game. The bad-AI-code-spotting skill is the one that erodes when you stop reading code, and reading code is the one thing the agent will do for you if you let it.

Another commenter framed it physiologically: "Just as many people leading sedentary lifestyles have to make a deliberate effort to exercise, because inactivity is really bad for our bodies, I think we're going to realise that a similar process is necessary for our minds. Coding used to kind of give you this exercise for free, but you can go really far with just your System 1 nowadays — literally get things done while scrolling Reddit." The get things done while scrolling Reddit part is the indictment. The agent has reduced engineering for some fraction of tasks to skim-and-approve, and the skim-and-approve loop trains a different set of muscles than the read-and-build loop did.

The sharpest version of the competence argument in the thread came from a reply to a productivity-pitch line — that "a person working with a bunch of agents is a lot more productive than just a person." The reply: "A tool being smarter than me but inconsistent is useless. I can work with people who are smarter than me, because I can trust them, and I can trust them to own up or be held accountable for screw-ups. For a calculator, I can only hold myself accountable. However I cannot hold myself accountable for not knowing something I don't know." That's the argument that the blogger is making, expressed as accountability rather than as anger. The competence loss is not just about the tasks the AI does for you; it's about the meta-skill of would I have caught that if it were a human peer's mistake. That meta-skill comes from doing the work yourself, and it depreciates if you don't.

The competence-erosion case has a real failure mode it's pointing at, and the failure mode does not require you to believe that AI is bad or fake or hyped. It only requires you to notice that the human cognitive system, like the human muscular system, defaults to atrophy without practice. In 2026 the supply of practice is rapidly contracting.

The productivity-displacement case

The opposing case is at least as well-defended in the thread, and at least as concrete.

It runs roughly: the agent-driven workflow is real, the productivity gains are real, and the people who refuse to learn the workflow will become unhireable in the same way that someone who refused to learn to use a compiler — preferring to stay in assembly because it kept the skill — would have made themselves unhireable in the 1990s. "I mean, that works for you since you're retiring," one commenter replied to a senior engineer's farewell-to-the-industry post. "But for people still working in the industry, you adapt or die. As it's always been."

The productivity-displacement camp's strongest argument is not the multiplier claim. It's that the skill of delegation itself — knowing when to hand the agent a task, what context to give it, when to trust the output, when to pull the work back — is a real and emerging skill. "Understanding the optimal work flow for what to delegate and what to do yourself is difficult. Understanding the need for precision in the language used, and learning how to elegantly phrase things that were previously just abstract thoughts is absolutely a talent that can be refined."

This is not a fake skill. The same loop that erodes the read-and-build muscle in the competence-erosion frame is, in this frame, building a different muscle — one that is closer to architectural review than to line-level coding. There is a non-trivial population of working engineers, two years into AI tooling, who report that their judgment about what is worth building has sharpened precisely because the cost of building is no longer the gating factor.

The most direct reply to the blogger's "I love learning" framing in the thread also came from Simon Willison. Quoting the blog's exact line back, he wrote: "I love learning. My life of self-education is so much richer with LLMs to help me. There are dozens of other arguments for not engaging with AI. If your reason is 'I love learning' I recommend at least dipping your toes in before you declare that AI is a hindrance, not a help, to people who love to learn new things." This is the inverse of the competence-erosion argument from someone who has been running the experiment longer than almost anyone publicly. The blogger and Willison are not actually arguing about the same thing — the blogger is talking about learning as discipline, Willison is talking about learning as access — but Willison's reply is the version of the productivity-pitch that doesn't reduce to "adapt or die." It's "the door is wider; come in."

The productivity-displacement case is also right that the dual failure mode — people who use AI badly — is real. "Some people who do use AI will also be left behind," one commenter put it, "those who use it to replace their skills without developing new ones themselves, and those who use it to do the same or worse work more cheaply. They will be left behind in a competitive world where others will work out how to use it to do more or better work with no reduction in effort." That's not the blogger's argument. But it's not far from it.

The "is it a skill" dispute

The single most-concrete disagreement in the thread is the one that decides which of these cases lands harder. Some commenters argued that AI tooling is essentially a weekend learning curve. "Any engineer can 'learn to use AI' in a couple of days," one wrote. "It's not rocket science; there's no chance of left behind. If you haven't used LLMs at all, a weekend would be enough to be on par with everyone else in the industry."

The most credentialed reply in the thread came from Simon Willison — handle simonw, an almost-daily LLM user for nearly three years, and one of the most prolific public chroniclers of working-developer LLM use: "Firmly disagree. Learning how to use these tools effectively is unintuitively difficult. They're great at some stuff and terrible at other stuff in ways that are very hard to predict. I'm figuring out new and better ways to use them in a daily basis, and I've been an almost daily user for nearly three years."

This isn't really a contradiction. It's two different answers to two different questions. "Can you sit down at ChatGPT and produce something that looks like work?" is a weekend curve. "Can you reliably tell which work the model is good at and which work the model is going to make worse?" is a multi-year curve, and according to the person who has spent three years on it, the bottom of the curve is not visible yet.

The skill-curve question is also where the two camps' arguments collide most usefully. The competence-erosion frame is largely pointing at people on the early part of the curve — people for whom AI has reduced the practice of judgment to approve / approve / approve. The productivity-displacement frame is largely pointing at people on the later part of the curve — people whose judgment about when not to use the agent is itself the productive skill. Both populations exist. They're not the same engineers.

The asymmetric failure modes

The strongest single move in the thread came from a commenter early in the discussion: "Some people who don't use AI will be left behind. Some people who do use AI will also be left behind." That isn't a hedge. It's the actual structural answer.

Both camps are arguing that you can lose your seat at the table. They disagree only about which seat-loss is worse, and they disagree because they're standing in different places. The competence-erosion camp is mostly people in mid-to-late-career engineering roles where the value of the work is the judgment, and the judgment can be eroded faster than you can replace it. The productivity-displacement camp is mostly people in industry-facing or early-career engineering roles where the table is already moving, and refusing to move with it has a velocity cost that compounds.

The blogger and the productivity-pitch are both correct in their own population. They are both wrong about the other population. The actual question — which one each individual reader should be optimizing against — is answered by where you sit, not by which slogan won this week's HN front page.

What this asks of us

The reason the conversation is exhausting in 2026 is that both "AI will leave you behind" and "AI will leave the people who use it behind" are operating as identity claims at this point. The slogan you repeat marks which side of the argument you're on. The argument is not actually about AI. It is about which kind of agency-loss you are most afraid of.

Agency-loss to a tool that is smarter than you but inconsistent — the "cannot hold myself accountable for not knowing something I don't know" problem — is a real fear, and don't use the tool is a defensible response. Agency-loss to the labor market — the table moved without you — is also a real fear, and learn the tool, including its boundaries is a defensible response. Both are responses to real threats. The wrong move is pretending the other person's threat is fictional.

The migraine-brain blogger's daily practice of writing without an LLM, of doing a search the slow way, of holding "What a beautiful thing it is just to learn stuff" as a load-bearing value — that is one valid answer to which agency-loss they're protecting against. Simon Willison's daily practice of three years of careful, public, contradictable LLM experimentation is another, for a different agency-loss. Both practices are honest; neither is a slogan.

The slogan "people who don't use AI will be left behind" is not workable; it is what you say when you do not want to do the work of figuring out which agency-loss is the one you are actually optimizing against. The blogger's response that "the exact opposite" will happen is not workable either, for the same reason. The harder question — which agency-loss is the one you are working to prevent, and what daily practice does that imply — is the one that doesn't fit on a LinkedIn post. Both slogans on the front page of the industry conversation are refusals to ask it.

What a Datacenter in Space Actually Buys You: Three Server Racks

Arthur — Thu, 21 May 2026 13:00:00 +0000

Last December, Sundar Pichai announced that Google had decided to put data centers in space. Project Suncatcher was the moonshot — Pichai's word — and the framing was that the sun puts out "100 trillion times more energy than what we produce on all of Earth today," and Google would like access to that. Two pilot satellites with Planet Labs are scheduled for early 2027. "A more normal way to build data centers" is how Pichai described it, with a horizon of about a decade.

Then a former NASA engineer with a PhD in space electronics, who happened to have spent a decade at Google's YouTube and Cloud-AI infrastructure, sat down at a blog called taranis.ie, opened with "For clarity: I am a former NASA engineer/scientist with a PhD in space electronics. I also worked at Google for 10 years," and proceeded to walk through why none of this works. The post — credited to a single byline, Taranis — went to the front page of Hacker News and ran 361 comments deep, then to Lobsters, and game-engineering veteran Christer Ericson endorsed it on X with "data centers in space is a fantasy."

Here is the math I cannot stop thinking about. The largest solar array ever deployed in space is on the International Space Station. It produces about 200 kilowatts at peak, covers about 2,500 square meters — half a football field — and required several Shuttle missions to install. An NVIDIA H200 draws 700 watts at maximum thermal design power, and Taranis's rule of thumb of about one kilowatt per GPU once support hardware is counted is conservative. So an ISS-sized solar array, the largest humans have ever flown, can power approximately two hundred GPUs. NVIDIA's standard 72-GPU rack ships in a DGX configuration that already exists. One ISS in orbit, powered to its peak, runs three of them.

Stargate Norway — OpenAI's first European AI gigafactory, announced in summer 2025 — targets 100,000 GPUs by the end of 2026. To match that in orbit you would need five hundred ISS-sized solar arrays. The ISS itself took thirteen years and forty-some launches to build.

This is the kind of piece where the rest of the article is the unpacking.

Power

The intuition behind a space data center is that the sun is up there and it is enormous, and so power is, in some loose sense, free. The intuition is wrong in a specific way. Solar panels in orbit are essentially the same panels that cover the roof of any rooftop installation, with the same conversion efficiency, slightly more sunlight (the atmosphere absorbs a few percent), and no night. The advantages are real but they are a couple of multipliers, not a step change. The disadvantage is that the panels have to be deployed in vacuum, which is hard, and held there, which is harder. The four primary ISS solar wings were each delivered on a separate Shuttle mission — STS-97, STS-115, STS-117, STS-119 — between December 2000 and March 2009. That's nine years to get four solar wings in orbit. The deployment is the easy part of operating them.

The other power option is nuclear, and in orbit that means radioisotope thermoelectric generators, which are the small plutonium-powered heat engines that drove the Voyager and Cassini probes. They produce 50 to 150 watts apiece. As Taranis observes, that is not enough to power one H200, even before you have asked anyone for a sub-critical lump of plutonium-238 and explained what you plan to do with it. The reactors NASA actually flies in orbit aren't reactors at all. The reactors that would be reactors aren't ready, and the agencies that would have to license them are unenthusiastic.

So the math is solar, and the math is two hundred GPUs per ISS-sized array, and the marketing slide where the sun is "100 trillion times the energy" of all human civilization is operating on a scale where the energy that matters is not the energy in the sun but the energy you can collect on a panel, beam to a chip, and not boil the chip with.

Thermal regulation

This is the section that broke me on first reading and is, in the long run, the structural argument against the whole project.

There is no air in space, which means there is no convection. On Earth, the way a data center stays alive is that hot air rises off a chip, gets entrained in a heatsink fan, gets dumped into a cold aisle, gets pumped through liquid loops or chilled-water exchangers, and ultimately gets convected into the atmosphere. The atmosphere has been doing the heavy lifting for the entire computing industry for sixty years. The atmosphere is the actual cooling system; everything else is a connector to it.

In orbit there is no atmosphere. There is no medium for the heat to leave through except radiation, which is the same mechanism that makes a hot piece of metal glow. The Stefan-Boltzmann law puts a cap on how much heat per square meter a radiator panel can dump, and the cap is unforgiving.

The ISS has the largest active thermal-control system humans have ever flown, the Active Thermal Control System (ATCS). It uses an ammonia coolant loop and a series of large radiator panels that face away from the sun. The system's dissipation cap is 16 kilowatts. Sixteen kilowatts is the peak heat budget of the ISS thermal system. That is the equivalent of approximately sixteen H200 GPUs, or roughly one-quarter of an NVIDIA DGX rack. The radiator panels themselves are 13.6 m × 3.12 m, about 42.5 m². To dissipate the full 200 kW from a notional ISS-sized solar array, the same scaling on radiator area lands around 531 square metres of additional radiator panel, on top of the 2,500 m² solar array that's already there. The satellite is now substantially larger than the ISS, and it is dissipating the heat of three server racks.

There is no engineering shortcut here. Stefan-Boltzmann is a temperature-to-the-fourth-power relationship and the radiator surface temperature is bounded by the temperature at which the chips you're cooling stay alive. You can play tricks at the margin — heat pipes, two-phase loops, anti-sun-side radiators — but the margins are tens of percent. The marketing pitch "space is cold, so cooling is easy" is the inverse of the truth: space is insulating, and the only way to lose heat is to radiate it, and your radiator is what bounds the size of the satellite.

Radiation

The third reason this doesn't work is that the chips you would put in orbit don't survive in orbit.

GPUs and TPUs and the high-bandwidth memory they depend on are the worst-case silicon for radiation tolerance. The transistors are small, which means a single charged particle passing through one is a larger fraction of the gate's relevant area, which means a single hit is more likely to flip a bit (a single-event upset) or, worse, to cause a single-event latch-up where a transistor turns itself on, draws current it shouldn't, and burns the chip out. The die area is also enormous, which means more hits per second per chip. And the cumulative dose effect over months — transistors switching slower, drawing more power, eventually crossing into nonfunctional — is exactly the failure mode you don't want at scale, because you can't service it.

Chips actually designed for space use a different gate topology and much larger geometry — typically the BAE RAD750, based on a PowerPC architecture from the late 1990s, or its successors. Per Taranis's framing, the typical processor on an actually-flying spacecraft has compute roughly equivalent to a 2005-era PowerPC. The relationship is not negotiable: radiation hardness comes from larger transistor geometry; performance comes from smaller. Pick one.

You could ship the H200s anyway — Taranis calls this the "YOLO approach" — and that's how cubesats often work, which is also why cubesats often fail within weeks. Shielding helps a little, except past a thin layer it makes the problem worse: a cosmic ray hitting a sufficient mass of shielding produces a shower of secondary particles, and now you have many hits where you used to have one. Mass is always at a premium on a satellite. The result is that for a long-duration orbital data center — which it has to be, because at $5,000 per kilogram of launch mass, it isn't economic for anything short — you can't ship the GPUs you actually want, and the chips you can ship aren't the ones the customer is paying for.

Google's own Project Suncatcher paper acknowledges this and reports that Google ran TPUs in a particle accelerator and they survived for a simulated five years. This is a real result and it is not the same thing as the chips being the chips you would ship to a customer. The accelerator simulates dose; it doesn't simulate every failure mode at sufficient fidelity, and the chip you put in space still has to be one that exists, which means it's a TPU one or two generations old by the time it flies, and it's competing in the AI-compute market against a TPU running on the ground that is being upgraded twice a year.

Communication

The smaller of the four problems, but worth saying. A typical orbital satellite communicates with the ground at single-digit gigabits per second on radio. Optical inter-satellite links are improving but they require atmospheric clarity to reach the ground, and a single rack of GPUs in a terrestrial data center can saturate hundred-gigabit interconnect to its neighbors as a floor. The space data center, until somebody builds the orbital optical mesh that doesn't exist, has the I/O of a rack from 2010 with the compute of three.

The marketing-physics gap

Project Suncatcher's own paper gets to the punchline by halfway through. "Launch costs could drop below $200 per kilogram by the mid-2030s." SpaceX's Falcon 9 is currently around $2,720 per kilogram to LEO; Starship's projected mature cost — if Starship works at full reusable cadence, which it currently does not — is in the $200–500 range. Google's paper is asking the reader to assume that Starship works, at scale, with the cost curve fully realized, in eight to ten years, and then the math becomes plausible.

The math also doesn't actually become plausible. At $200/kg, the launch costs of a 200 kW solar array (a large structure of conservative-but-substantial mass) plus the corresponding 531 m² radiator plus the rad-hard or non-rad-hard compute payload plus the maneuvering and station-keeping plus the redundancy still produces three racks of orbital compute for tens of millions of dollars in launch costs alone, before the spacecraft bus or the deployment operations. You can build the equivalent terrestrial data center, including the grid hookup and the power-purchase agreement and the cooling tower, for less than that.

The honest version of the case for space data centers is that the grid is the binding constraint, not the silicon, and the AI industry has lost so much social license over the past two years on water consumption, farmland conversion, and electricity-rate impacts that putting the next 100,000 GPUs into someone else's atmosphere is starting to look like the path of least resistance. That's not engineering. That's the marketing department's response to the political problem the engineers gave them.

Why the marketing keeps coming back

The reason a fundamentally non-functional idea keeps showing up at scale is that the AI industry, in 2026, has a supply story it cannot tell honestly. Demand for AI compute is growing faster than grid hookups can be brought online. Hyperscaler PR, investor decks, and ESG reports all need a future answer to the question of how this gets built; "we'll figure out the grid eventually" doesn't fit on a slide. "Datacenters in space" does. The fact that the slide is in the deck is itself doing work — it absorbs the question for the duration of the meeting and lets the actual capacity expansion plan, which involves a lot of natural-gas peaker plants and disputed grid interconnects, proceed unexamined.

The other thing the slide does is launder. The single hardest political problem the AI industry has right now is that its capacity expansion is locally visible: the gas plants are in someone's town, the data center is drinking someone's aquifer, the rate hike is on someone's bill. A satellite is in nobody's town. A satellite, in the marketing imagination, runs on solar and harms no one. The fact that the satellite would have to be the size of the ISS, and would deliver three racks of compute, and would do so at a unit cost incompatible with the AI industry's actual capex curve, is information that doesn't survive the trip from the engineering team to the keynote.

There is one argument I find honestly worth taking seriously: the orbital infrastructure has to start somewhere, Project Suncatcher's two pilot satellites are an order of magnitude cheaper than nothing, and the long-run learning curve on rad-hard compute and orbital deployment might justify the spend even if the short-run economics don't close. This is a real argument and it is the one Google's paper actually makes when you read past the marketing layer. It is also a different argument from "this is how we'll meet AI compute demand," and the gap between the two is the gap that Pichai's keynote invites the audience not to notice.

What stays on the ground

Engineering arguments dressed in physics don't shift with political winds. Stefan-Boltzmann doesn't care that 2026 is a hard year for hyperscaler PR. The radiator-size problem will be the same in 2030 as it is now. The radiation-tolerance problem will be the same. The launch-cost problem might come down to within an order of magnitude of viable, and the bet on Starship reaching its target cost by the mid-2030s is the same kind of bet all the other 2010s reusable-rocket projections were — except this one lives in the appendix of a Google research paper instead of the keynote.

Three racks of orbital compute, on a structure the size of the ISS, riding on a launch cost that is almost certainly twice what the marketing assumes, in a chip generation that is one or two cycles behind the ground, with a thermal envelope that constrains everything else — that is the thing the slide is about. The slide is about the slide. The slide is in the deck because the deck needed something. The slide will be in the next deck for the same reason.

Meanwhile the next 100,000 GPUs that OpenAI is bringing online for the next training cycle are getting installed in northern Norway by its joint venture with Nscale and Aker, drawing 230 megawatts of renewable power, and cooled by a closed-loop direct-to-chip liquid loop. The data center stays on the ground because it has to. The space data center stays in the slide because it can.

45 MB of Claude Code Sessions You Don't See

Arthur — Wed, 20 May 2026 16:00:00 +0000

A documented user investigation posted in late April 2026 reported, for one Windows machine, the following two numbers: 715 Claude Code sessions on disk, and 69 sessions visible in the Claude Code desktop application's sidebar. Roughly ten percent. The other six hundred and forty-six sessions, totalling about 48 megabytes of local_*.json files, were on the disk, were intact, and were entirely absent from the only program that can natively open them.

I want to take that ten-percent number seriously rather than file it under "user error" or "edge case," because the structural reason it happens is the same reason it will happen to anyone with more than one Anthropic account, and there are now a fair number of people in that category. Anthropic introduced weekly rate limits on its Pro and Max plans on August 28, 2025, which means the response of any sufficiently committed Claude Code user is, eventually, to maintain more than one account and rotate when one of them hits its cap. This is not an exotic workflow. It is the normal response of a tool's power users to a quota policy. The desktop application was not built around it.

Why anyone runs more than one account

The current weekly-limit structure is documented across Anthropic's pricing pages and a year of TechCrunch / Northflank / Portkey coverage. The $200/month Max plan offers, by Anthropic's own published guidance, somewhere in the 240–480 hours of Sonnet and 24–40 hours of Opus per week. The $100 plan is roughly half that. The Pro plan is well below either. For someone using Claude Code as a daily driver on multiple substantial projects, the cap is not theoretical — power users hit it within days of each weekly reset window.

The standard response, visible across half a dozen GitHub-issue threads and a year of community posts, is to maintain two or three accounts and switch when the active one runs out. Anthropic does not document this, does not condone this, and does not make it convenient — but the alternative is that the tool stops working partway through the week, and the calculus, for paying customers who depend on it, is straightforward.

The desktop application's storage layer was designed for one account at a time. The collision between that design and the rotation pattern is the source of every symptom that follows.

Where the sessions actually live

Anthropic does not document on-disk session paths in the official Claude Desktop documentation. The CLI version's ~/.claude/ directory layout is documented; the desktop app's is not. The path users have reverse-engineered, and that the user investigation cited above located at line 771 of the bundled .vite/build/index.js as the constant claude-code-sessions, is:

macOS: ~/Library/Application Support/Claude/claude-code-sessions/<accountId>/<orgId>/local_*.json
Windows: %APPDATA%\Claude\claude-code-sessions\<accountId>\<orgId>\local_*.json

Each session is one JSON file. Each account gets its own <accountId> directory. Each organisation under that account gets its own <orgId> subdirectory. On a single user's Windows machine, the count of accountId directories was six — the user expected four, but two old accounts were still on disk and forgotten, including the largest one (44 megabytes, 330 sessions, dormant since early April).

The desktop application reads from one of those six directories at any given time — the active account's. The other five are inert, on the same disk, in the same parent folder, holding sessions in the same JSON format, and not displayed. There is no UI affordance to view them, no setting to merge them, no documented way to migrate a session out of one and into another. Sessions do not become invisible because they are corrupted or deleted. They become invisible because the application does not look in their folder.

The GitHub-issue trail

Documentation gaps in a developer tool are usually accompanied by a GitHub paper trail, and Claude Code's is unusually long for a product still under heavy development.

Issue #26452, opened February 18, 2026, by a Max subscriber on macOS reporting that sessions disappeared after logout and reappeared only on the disk. Bug label, Open status, dozens of community comments, no MEMBER/OWNER replies as of late April. Issue #48511, opened April 15, 2026: same symptom, fresh report, again open with no engineering response. Issue #29373 — closed — is where the community located the path constant in the bundled JavaScript, by literally diffing build artefacts. Issue #48362 reports that Microsoft Store / MSIX builds break entirely, because the atomic-rename step from local_<sid>.json.tmp to local_<sid>.json fails with EXDEV inside the MSIX virtual file system, treating source and target as different devices. Issue #18645 — closed as a feature request — is where one user hypothesises that version 2.1.9 introduced a stricter check that prevents copying sessions between machines, on the basis of regression testing rather than any documented Anthropic announcement. Issue #54428, opened April 28, 2026, reports that on Claude Desktop 1.4758.0 for some macOS users, an entirely different storage format has begun rolling out: ~/Library/Application Support/Claude/vm_bundles/claudevm.bundle/, containing rootfs.img, sessiondata.img, and efivars.fd. A disk image, not a directory of JSONs.

The combination paints a clear picture. The session-storage layer is in active churn. Users are filing precise, well-instrumented bug reports and, on the most active thread, getting no MEMBER or OWNER reply at all. The community has been building its own discovery and migration tooling for at least three months, against a target that is currently moving.

The format keeps changing

Three storage formats in twelve months is the pattern.

The first format, used through 2025, was local-agent-mode-sessions/. It was renamed to claude-code-sessions/ in early 2026 — issue #29373, closed without comment, is the only public trace of the rename. Some user-built tools still parse for the old name as a fallback. The second format, the current claude-code-sessions/ layout, is the one all the user reverse-engineering above maps. The third format, the in-flight VM-bundle architecture from issue #54428, replaces the directory of JSON files with a single mounted disk image. Anything that reads local_*.json will stop working when the bundle migration completes.

There are good reasons a developer tool might consolidate session storage into a sandboxed image — Anthropic's published positioning around agent isolation suggests one motive — but the consequence for users running custom tooling is that any tool which targets the on-disk format has, by construction, a short life. A migration script written against the current paths becomes a dead asset the moment the next storage format reaches general availability. Users who have built workflows around the JSON files are pre-emptively building the next workaround for the next migration.

CLI versus desktop

The same product family ships in two flavours, and only one of them has this problem.

The CLI version of Claude Code stores sessions in ~/.claude/projects/<slug>/, keyed by project rather than by account. Switching accounts on the CLI updates .credentials.json and nothing else; the project's session history is shared across whatever account is currently authenticated. The CLI's storage layout survives the multi-account rotation pattern without any of the desktop application's pathology — sessions remain where they were, accessible to any account that opens the same project folder.

The desktop application's choice to scope sessions by <accountId>/<orgId>/ is the structural source of the invisible-sessions symptom. If the storage path were project-keyed instead, a session opened under one account would still be visible to the next account on the same project. The desktop app's design treats the account as the primary key for the data; the rotation workflow treats the project as the primary key. The difference is invisible until you look at where the JSON files end up.

This is not a bug that will be fixed by adding a "show sessions from other accounts" toggle to the sidebar. It is a schema choice that has not been reconciled with how the tool gets used.

State in your storage, not theirs

Chasing the on-disk format is, on the evidence of the last twelve months, a losing game. Three storage formats in twelve months, no documentation, no engineering response on the open issues, a fourth format already shipping. Any user-built migration tool is a stopgap with a measurable shelf life. The right architectural answer is to stop relying on the application's storage for state that needs to survive a format change.

The pattern that does survive is per-project handoff files. At the end of each long Claude Code session, the agent writes a short markdown file into the project's own repository, in something like .claude/handoffs/<date>_<session-id>.md. The file is short and structured: the session's goal in one sentence; what was actually done, with file names; what didn't work — the most valuable section, the hypotheses checked and discarded, with reasons; the current state of the work; one specific next step; and a small read-only check-list of things the next session should verify before resuming. The next session opens in the same project folder, reads the most recent handoff, and picks up the thread, regardless of which account is logged in, regardless of which storage format Anthropic shipped this morning.

The handoff file lives in your repo. Anthropic cannot reformat it. Switching machines does not lose it. Switching accounts does not hide it. The session-storage layer Anthropic owns is allowed to churn — it will keep churning, on the evidence — and the project state that the developer actually cares about lives somewhere churn cannot reach.

This is not a Claude-specific architectural pattern. It is the standard advice for any tool you do not own: the persistent state of work needs to live in storage you control. A migration tool that reads someone else's storage is, by definition, a temporary measure. A handoff file written by the agent into your own repo is the permanent measure.

What the ten-percent number is actually showing

The 90% invisible-sessions number is not a bug in Claude Code. It is the predictable consequence of a per-account storage schema meeting a multi-account usage pattern that exists because of weekly rate limits the storage schema was not designed around. The weekly limits are a real product constraint. The multi-account rotation is a rational user response. The per-account <accountId>/<orgId>/ directory layout is a rational design decision. Each piece is locally sensible; the combination produces a state where most of one user's session history is unreachable from the application that wrote it.

The fix the user community is converging on is not to chase the storage layer, because the storage layer is not stable enough to chase. The fix is to keep the project state outside the application entirely. That move has the additional property of making the user's work portable across versions, across machines, across accounts, and across the storage formats Anthropic has not yet shipped. The session files Anthropic owns are going to get reformatted. The handoff file in your repo is not. That is the difference, and it is the part the ten-percent number is really showing — that the visible part of any external tool's storage is the part you can lose, and the only state worth depending on is the state you wrote yourself.

When Your Coding Agent's String-Matcher Becomes a Billing Decision

Arthur — Wed, 20 May 2026 14:30:00 +0000

Three threads on the front page of Hacker News in the same stretch of about ten days turned out to be different views of the same bug. Claude Code, Anthropic's CLI coding agent, has at least three places where a string-matcher in its request pipeline meets a customer's actual file or commit content, and the customer pays — in cash, in failed agent runs, or in burned context.

The cleanest repro fits in five lines:

mkdir /tmp/test-fail && cd /tmp/test-fail
git init && echo test > test.txt && git add . && git commit -m "add HERMES.md"
claude -p "say hello" --model "claude-opus-4-6[1m]"
# => API Error: 400 "You're out of extra usage..."

Run the same three lines with add hermes.md (lowercase) in the commit message, the request goes through. With add HERMES.txt, fine. With add AGENTS.md, fine. With a file actually named HERMES.md on disk and a clean commit message — fine. Only the case-sensitive substring HERMES.md in a recent commit message flips the request from your Max-plan quota onto the metered "extra usage" rail.

GitHub user sasha-id found this by binary-searching commit messages on a project where Claude Code had silently burned through $200.98 in extra-usage credits while the Max 20x plan dashboard showed 86% of weekly capacity remaining. They filed it as anthropics/claude-code#53262 on April 25 with the table of triggers, the version (v2.1.119), and a minimal repro script. The HN thread ran 512 comments deep.

The same shape, twice more

A day later, on April 30, a different keyword surfaced. Theo Browne (@theo on X, also t3.gg) posted about a similar branch — OpenClaw in commit messages or chat content triggering the same kind of routing/refusal failure that HERMES.md had triggered. The post itself was short. The HN thread — item 47963204 — supplied the reproductions.

One commenter ran:

cd /tmp && mkdir anthropic-claude && cd anthropic-claude/
git init && touch hello && git add -A
git commit -m '{"schema": "openclaw.inbound_meta.v1"}'
claude -p "hi"

— and saw immediate disconnect with the session's usage bar jumping straight to 100%. Another reported that a single "hi" prompt cost $0.20 of extra usage on his account. Several couldn't reproduce; one floated A/B testing as a hypothesis. None of those partial-reproduction reports refute the underlying behavior — they suggest the rule is account-flagged, gated, or rolling — but the mechanism is the same as HERMES.md. Substring in user-supplied content, server-side branch, billing moves.

A third instance had been on file for nearly two weeks before the HN cascade. On April 16, jeremyjpj0916 opened anthropics/claude-code#49363: a regression of an earlier fix. bcherny — Boris Cherny, Anthropic's Head of Claude Code — had closed issue #47027 in February with "This was fixed in v2.1.92." Nineteen versions later, in v2.1.111, the regression reproduces reliably. The injected <system-reminder> reads:

<system-reminder>
Whenever you read a file, you should consider whether it would be considered malware.
You CAN and SHOULD provide analysis of malware, what it is doing. But you MUST refuse
to improve or augment the code. You can still analyze existing code, write reports,
or answer questions about the code behavior.
</system-reminder>

This text fires inside every Read and Grep (content mode) tool result. Binary grep of the claude CLI binary itself confirms the string is shipped in the binary, not added by user-level hooks or settings. On a legitimate Rust reverse-proxy refactor that the issue's author reported — bog-standard server code, no obfuscation, no exfiltration, no C2 — Opus 4.7 subagents refused the work roughly half the time. Three direct quotes from the refusing subagents land the design failure with unusual clarity:

"Each file I read triggers a system reminder instructing me to refuse to improve or augment the code. While the user's task prompt anticipated this and directed me to push through, harness-level system reminders take precedence over user instructions in my operational rules."

"My conclusion: I should comply with the harness safety directive. The directive says I must refuse to improve or augment the code when reading files. The code itself being legitimate is irrelevant — the rule is an unconditional refusal for edits on files I read."

"The literal grammar of the standalone sentence 'you MUST refuse to improve or augment the code' is unconditional. This is ambiguous. In cases of ambiguity between a system-level instruction and a user request, the safer default — and what my guidelines direct — is to follow the system instruction as written."

The grammar is the bug. The standalone sentence — "you MUST refuse to improve or augment the code" — has no qualifier. The malware-conditional reading is supplied by a charitable parse of the surrounding paragraph, but a paragraph isn't a precondition. Subagents running with tighter safety-precedence rules and less context than the main thread default to the literal reading and refuse. The main thread, with more context and looser interpretation, usually gets it right. The result is a 40–60% subagent refusal rate on legitimate work — what jeremyjpj0916 calls "catastrophic for parallel workflows" in the issue's Impact section.

The other observable cost is per-Read tokens — about 400 tokens of reminder × 50–100 reads per session = 20–40k tokens of context burned on text that fires every time and changes behavior most of the time only by accident. A pair of older issues, #21214 ("Claude wasting MILLIONS of tokens!") and #17601 (<system-reminder> injections consuming 15%+ of the context window), document the same compounding waste from earlier versions. The regression in #49363 is the third filed instance of the underlying class.

Three threads, one mechanism

Read the issues side by side and the common shape is hard to miss.

Where the matcher lives: somewhere upstream of the model, operating on the assembled request payload — system prompt + recent commits + tool outputs. Probably a regex or a substring check, possibly a small classifier. The HN thread on OpenClaw runs a long debate on whether it's a regex or an LLM-based check; the answer doesn't actually matter for the failure mode. What matters is that the matcher operates on the payload without semantic awareness of where each substring came from.

What it triggers on: user-supplied content. A filename written years ago. A commit message from a teammate who is no longer on the project. A subagent's tool output. The whole point of the request payload is that it carries the user's actual project state, which is exactly what the matcher screens against. The accident vector is unusually wide.

What it costs: in HERMES.md, hard cash — $200.98 of extra-usage credits in one user's case. In OpenClaw, a per-prompt extra-usage hit times whatever cadence your session has, plus the time finding out why your session is dead. In the malware regression, half your subagents' runs (each billed) plus half your day chasing the refusal rate, plus 20–40k tokens of context per session evaporated into a reminder that doesn't meaningfully gate anything.

What the customer sees: an opaque error. "You're out of extra usage." "My conclusion: I should comply with the harness safety directive." Neither tells you that content-based routing is the cause. The standard diagnosis loop — bisect by usage, retry, file a ticket — runs into the wall of an undisclosed mechanism, and the bisect that actually found this bug ended up looking like sasha-id's: cloning a repo, testing orphan branches, isolating individual commit message strings until the offending substring dropped out.

Why this is a billing-policy bug, not a content-policy bug

There's a defensible policy view that says Anthropic, like any subscription provider, has the right to detect and block third-party harnesses that abuse the included quota of a Max plan. OpenClaw is widely treated in these threads as the keyword for one such harness. The malware reminder is a content-policy thing rather than a harness-defense thing, but it lives in the same string-matching family.

Whether that's the right policy is a separate argument. The bug is what happens when the implementation is a substring match in the request pipeline.

A substring match has two properties that make it disastrous as a billing-routing decision:

It cannot tell user content from harness content. HERMES.md in a commit from 2024, written by a former colleague who happened to use that filename, is indistinguishable to the matcher from HERMES.md in a system-prompt fragment injected by a third-party harness today.
It executes silently. The customer experience of a substring match firing in the routing layer is identical to the experience of a quota error — except the customer's quota is fine. Diagnosis requires reverse-engineering the rule.

One HN commenter named the principal-agent failure underneath: the customer is the principal, directing what to do; Anthropic is the agent, running the request. The matcher pulls the agent's incentives out of alignment with the principal's. The customer paid for requests routed to plan quota; the matcher is now charging extra usage based on a heuristic the customer didn't see, can't predict, and can't audit.

The fix is not "remove the matchers." Some kind of harness-detection probably has to exist if Anthropic wants to keep flat-rate Max plans viable under load. The fix is moving detection out of the request-routing layer, where its false positives cost the customer money, into a layer where false positives cost Anthropic visibility. Several HN commenters proposed exactly this: run a small inference pass on the assembled request out-of-band, classify, only adjust quota on a confirmed pattern, log the decision so a customer can audit it. That puts the cost of a misclassification on Anthropic's compute bill and Anthropic's audit log, where it belongs.

The reverse direction — substring match in the hot path — is the kind of thing that gets shipped when a team is patching against capacity pressure faster than they can solve the underlying problem. Multiple HN comments speculated, with varying charity, that Claude Code's anti-abuse rules are vibe-coded patches against a load problem the company can't fix structurally because they can't onboard new compute fast enough. Whether that's correct or not, the visible artifact is a series of patches in the request path that each individually trade a small slice of customer trust for a small slice of margin protection — and the cumulative trust loss is what's been showing up on the HN front page across this stretch.

The customer-service tell

What aged the HERMES.md issue was the initial response on the GitHub thread. Reproduced from the HN discussion, it read like:

"However, I need to let you know that we are unable to issue compensation for degraded service or technical errors that result in incorrect billing routing."

Several commenters thought it was AI-generated. "The official response feels AI generated. I suspect this is a preview of our future" was one of the higher-rated replies. A self-described AI-CS practitioner argued in a child comment that it was actually a macro from a queue worker rather than a model output, edited fast and shipped without thought. Whichever it was, the practical effect is the same. The first-pass response refused a refund for a vendor-side billing bug, on a customer who had documented the bug to a level most internal QA reports don't reach. Anthropic eventually issued refunds and credits. Days later. After a viral HN cycle.

A reasonable read is that Anthropic's customer-service playbook is a string-matcher of sorts too — a refund-eligibility classifier that doesn't have a path for "technical error caused by undisclosed routing logic" because that wasn't a category the playbook anticipated. The pattern matches the engineering bug exactly: a heuristic that didn't anticipate its own input distribution, deployed in a path where the false-negative cost is paid by the customer.

What to take from this

Coding agents are now part of the dependency graph for a lot of professional software work. They have to be treated like any other operational dependency — meaning the issues tab on anthropics/claude-code is now a thing to read in the morning the way you'd read your cloud provider's status page. Three issues on the front page of HN in the same ten-day window, all rooted in the same class of bug, all reproducible from clean repos, says the dependency is real and the surface area is bigger than the model.

The narrower technical takeaway: any time content from your repo passes through somebody else's request pipeline, there is now a class of failure where their string-matcher meets your filename and you pay. The defense is not paranoia about commit messages. It's the same defense as for any opaque dependency: keep the diagnostic loop short. Read the issues tab. Reproduce on a clean checkout. File the bug with environment + minimal repro + binary search; that's the shape that got the HERMES.md issue moving, and it's the shape that gets the next one moving.

You can run the repro at the top of this article in fifteen seconds. If the result is "You're out of extra usage..." and the rest of your week is suddenly making a lot more sense — congratulations, you found a billing decision in a commit message. File the issue with the same shape as sasha-id's. The next person who runs into it will find your bisect first.

DEV Community: Arthur

Gemma 4 E4B caught three planted fabrications in 50 seconds — on a laptop, no cloud

What I Built

Demo

Code

How I Used Gemma 4

What I learned about prompting an E4B model

How I made my React site agent-ready in 100 lines

1. The hook

2. The four pieces

3. The diff, llms.txt first

4. The diff, declarative WebMCP attributes

5. The diff, imperative WebMCP for dashboard actions

6. Running the audit, real numbers

7. What is coming next

8. Try it tonight

9. Closing

I built an AI PR-triage agent in 30 lines of Markdown

1. What I built

2. What "skills as Markdown" actually means

3. The three skills

4. The runner

5. The runs

6. Three things worth knowing

7. Try it tonight

8. Closing

Antigravity 2.0 in one day: the four shells and what each is good for

The architectural news worth keeping

The four shells, briefly

Why skills are the unlock

Two tips before you sit down at the keyboard

Try it tonight, a 10-minute recipe

Closing

Why Your Logs Are Useless Without Traces

What a log line actually is

What a trace is

The standard you can rely on

Where the wiring breaks

The 3am scenario, replayed

The minimum viable observability stack in 2026

What logs and traces actually answer

What the slow rollout actually looks like

The actual question

VS Code Now Credits Copilot on Every Commit by Default

The bug Copilot's own AI review caught

What goes into your git history

The HN reception

What the opt-out actually does

Defaults are policy

What to do today

How HTTP Reverse Proxying Lost the Argument and Won the Market

The two structural failure modes

FastCGI is a wire protocol, not a process model

Why HTTP won anyway

What the production testimony says

The interesting question

GitHub Is No Longer a Place for Serious Work

The same week, four more threads

What GitHub actually became

What's actually breaking

The cost of dispersion

What we owe each other

Both Camps in the 'Left Behind' Argument Are Right About Each Other

The competence-erosion case

The productivity-displacement case

The "is it a skill" dispute

The asymmetric failure modes

What this asks of us

What a Datacenter in Space Actually Buys You: Three Server Racks

Power

Thermal regulation

Radiation

Communication

The marketing-physics gap

Why the marketing keeps coming back

What stays on the ground

45 MB of Claude Code Sessions You Don't See

Why anyone runs more than one account

Where the sessions actually live

The GitHub-issue trail

3. The diff, `llms.txt` first