DEV Community: Rob

Five ways your AI coding agent wastes tokens (and how to fix each one)

Rob — Wed, 24 Jun 2026 14:32:54 +0000

On a metered API key, token waste shows up as a bill. On a subscription plan, Claude Code, Codex, "5x" and "20x" tiers, it shows up as a lockout.

You are halfway through a real task. The agent has the repo in its head, a failing test in front of it, and enough context to make the next edit. Then it stops. No useful warning. No obvious culprit. Just a usage wall.

The frustrating part is that the thing that burned the window was probably not the work itself. It was repeated context, cache misses, tool schemas, oversized models, and long sessions carrying old state forward.

Most token waste in coding agents is mechanical. Fixing it starts with understanding how context gets sent, cached, repeated, and billed.

The mental model

Every turn, your agent sends a working context back to the model: system instructions, tool definitions, prior messages, file excerpts, tool results, and whatever else the agent believes is relevant.

The model does not continue from a private memory of the previous turn. It re-reads a prompt.

So the budget is not "how much did I ask it to do." The budget is:

tokens per turn * number of turns

Waste is the gap between that number and the amount of context the work actually needed.

Here are the leaks I would look for first.

Leak 1: Prompt cache misses

Prompt caching is the highest-leverage optimization because it applies to repeated context, which is exactly what agent sessions are full of.

Providers can cache the stable prefix of a prompt. If the next request starts with the same prefix, the model can read that cached content at a much lower rate. On current Claude Opus pricing, standard input is $5 per million tokens and cache reads are $0.50 per million. Same text, 10x difference. OpenAI's cached-input pricing has the same shape: repeated prefix tokens are much cheaper than fresh input.

That makes cache hit rate one of the first numbers worth checking:

cache_hit_rate = cache_read / (cache_read + cache_creation + uncached_input)

In a long coding session, this should usually be high. The repo summary, system prompt, tool definitions, and earlier conversation are mostly repeated turn to turn. If the cache hit rate is low, you are paying full freight for context the model just saw.

Common ways teams break the cache:

Changing content near the front of the prompt.
Injecting volatile state, timestamps, or changing status text before stable context.
Reordering tool definitions or MCP tool schemas between turns.
Pasting large files into chat instead of letting the agent read them when needed.

The fix is not clever. Keep the front of the prompt stable. Put volatile material later. Do not reshuffle system context every turn. Treat tool definitions like part of your hot path, because they are.

Leak 2: Context bloat

Long context is useful, but it is not free. Every extra chunk you keep in the session gets carried into later turns unless the agent compacts or drops it.

The bad pattern is easy to recognize: one task turns into three tasks, debugging turns into implementation, implementation turns into release notes, and the session still contains the whole archaeology dig from two hours ago.

Compare the average input size of the first five turns with the last five. If the last five are more than 2x larger, and the session is already past roughly 30K tokens, you are probably paying a long-context tax on every remaining turn.

Operational fixes:

Start a new session when the task boundary changes.
Compact before the session is huge, not after the model is already struggling.
Ask the agent to summarize decisions and open questions, then restart from that summary.
Reference files by path. Do not paste thousands of lines into the conversation unless you want to pay for those lines repeatedly.

This is the same discipline as log handling. You do not hand someone a 10,000-line log when grep ERROR would do.

Leak 3: MCP and tool overhead

MCP servers are useful, but they are not invisible. Each server exposes tool definitions. Those definitions sit in context. They get sent again and again.

One or two focused servers is usually fine. A pile of always-on servers can become a fixed tax on every turn before the user has asked for anything. If you connect five MCP servers and each one exposes a dozen tools, you can burn a surprising amount of input budget on tools the agent never calls.

The check is simple:

tool_schema_tokens / total_input_tokens

If tool schemas are a big percentage of the session, especially for tools you did not use, disconnect them. Bring them back when the task needs them.

That is basic capacity management, not an argument against MCP.

Leak 4: Model and reasoning right-sizing

There are two separate problems here.

First, frontier models get used for low-variance work. Formatting JSON, renaming a variable, writing a commit message, and applying a mechanical refactor usually do not need the most expensive model in the stack.

Second, reasoning tokens are easy to ignore because you often do not see them. On models with explicit thinking or reasoning effort, those tokens are billed as output. Output is usually the expensive side of the request.

The answer is routing, but routing with guardrails:

Use the strongest model for ambiguous architecture, hard debugging, and edits where a bad answer is expensive.
Use cheaper models for formatting, extraction, summarization, and mechanical code changes.
Lower reasoning effort for simple tasks.
Track quality before and after the route change.

This is the one leak where the fix can hurt quality. Do not automate it blindly. Start with one workflow and measure whether the cheaper path holds up.

Leak 5: No burn-rate visibility

The last leak is operational, not model-specific.

Most people only learn they were burning too fast when the subscription window closes on them. That is a terrible alert. It fires after the user has already lost the session.

You want rate, not just total:

Tokens this session.
Tokens this hour.
Cache hit rate over the last N turns.
Context size trend.
Reasoning/output ratio.
Tool schema overhead.

A session running at 3x its normal burn rate should be visible while there is still time to change behavior. That is the difference between an alert and a postmortem.

How to see this

The useful data is already local. Claude Code and Codex write per-session token breakdowns to disk: uncached input, cache reads, cache writes, output, and reasoning. I wrote up exactly where those files live and how to read them here: Read your coding agent token usage.

You do not need prompt text to find these leaks. Model names, timestamps, tool usage, and token counts are enough.

If you do not want to write the parser, that is what I built ModelMeter for. The collector reads local agent logs, token counts only, never prompts or keys, and surfaces the metrics that actually change behavior:

Cache hit rate.
Context growth.
MCP/tool overhead.
Reasoning token burn.
Model routing opportunities.
Session burn rate.

npx modelmeter-collect init <your-token>
npx modelmeter-collect

One important caveat: on a subscription plan, you are not paying per token at the margin. Any dollar figure is an API list-price equivalent, not your actual bill. It is useful for scale, but it is not the thing to optimize directly.

On a flat plan, the numbers that matter are token volume, cache hit rate, context growth, and whether you are pacing the usage window. Be suspicious of any tool that tells subscription users, with fake precision, that they "spent $X this month."

The point

Token waste in coding agents is rarely mysterious once you can see it. The usual causes are cache misses, context bloat, idle tool overhead, oversized models, and no burn-rate visibility.

None of those require a new philosophy of prompting. They require instrumentation and a few boring habits.

Start with cache hit rate. If it is bad, fix that first. It is usually the cheapest win in the whole stack.

Free to try at modelmeter.dev.

Claude Code and Codex are logging your token usage locally. Here is how to read it.

Rob — Thu, 18 Jun 2026 02:40:21 +0000

Your AI coding agent's token data is already on your machine. You just haven't looked at it yet.

Claude Code and Codex both write local logs after every session. Those logs include detailed token breakdowns: uncached input, cache hits, cache writes, output. No API call needed, no provider dashboard, no guessing. The number that matters most, your prompt cache hit rate, has been sitting on your disk every time you wondered why you were burning through your weekly limit so fast.

The logs you already have

Claude Code writes a JSONL transcript for every session under ~/.claude/projects/. Each assistant message carries a usage block:

{
  "type": "assistant",
  "uuid": "f0c8...",
  "message": {
    "model": "claude-opus-4-...",
    "usage": {
      "input_tokens": 137,
      "cache_read_input_tokens": 815193,
      "cache_creation_input_tokens": 5521,
      "output_tokens": 4260
    }
  }
}

That split is the whole game. input_tokens is uncached input. cache_read_input_tokens is context served from the prompt cache. cache_creation_input_tokens is context written to cache. output_tokens is the response. On a long agent session, cache_read should dwarf input_tokens. If it does not, you are re-paying for the same context on every single turn.

Codex writes rollouts under ~/.codex/sessions/. It emits token_count events with a cumulative running total per session:

{ "type": "token_count", "info": { "total_token_usage": {
  "input_tokens": 0, "cached_input_tokens": 0,
  "output_tokens": 0, "reasoning_output_tokens": 0
}}}

Because Codex counts are cumulative, you take the delta between events rather than summing them.

Reading the logs without reading your prompts

A few lines of Node walk the JSONL, sum usage per model per day, and dedupe by message uuid for Claude and by session delta for Codex, so you never double-count:

import { readFileSync } from 'node:fs'

for (const line of readFileSync(file, 'utf8').split('\n')) {
  if (!line.trim()) continue
  const o = JSON.parse(line)
  const u = o.message?.usage
  if (!u || seen.has(o.uuid)) continue
  seen.add(o.uuid)
  // accumulate u.input_tokens, u.cache_read_input_tokens,
  // u.cache_creation_input_tokens, u.output_tokens by o.message.model
}

Notice what you do not need: the prompt text, the response text, or any API key. Model names and token counts are enough to compute everything useful. A usage tool should never have to read what you typed, and this one does not.

The number that actually matters

Once you aggregate, one metric matters more than the rest: your prompt cache hit rate.

hit_rate = cache_read / (cache_read + cache_creation + uncached_input)

On a flat plan, this is your real efficiency lever. A high hit rate means you are reusing context instead of resending it. A low one means you are burning tokens, and your usage limit, on the same context over and over. The fix is usually structural: stabilize the front of your prompt so the cache prefix stays intact, keep tool definitions lean, and stop reshuffling system context between turns.

One honest caveat: on a subscription you do not pay per token, so any dollar figure is an API list-price equivalent, not your actual cost. It is a useful sense of scale, nothing more. The signals that genuinely matter are token volume and cache hit rate. Any tool that flashes a "you spent $X this month" number at a flat-plan user is being a little loose with what that number means.

Turning it into a live dashboard

I wrapped all of this into ModelMeter. A one-line collector reads those local logs and sends the token counts, and only the token counts, to a dashboard that shows your cache hit rate, ranks where your tokens are going, and labels every figure by how it was derived: computed from real tokens, a gated estimate, or "coming" when it needs request-level data the logs do not contain.

npx modelmeter-collect init <your-token>
npx modelmeter-collect

Add a Claude Code Stop hook or a 60-second cron job and it stays live, updating after each prompt. It works for Claude Code, Codex, or both. It also accepts usage from a metered API key via a copy-paste snippet, or from a CSV export if you would rather not run the collector at all.

Free to try at modelmeter.dev.

The point

Whether or not you use ModelMeter: you are not flying blind. Your subscription coding tool has been writing detailed usage data to your local disk after every session. Go read it. You will almost certainly find that your biggest efficiency lever is a single number you have never once looked at.

Can You Tell When an LLM API Swaps in a Cheaper Model?

Rob — Tue, 16 Jun 2026 15:33:10 +0000

If you call an open-weight model behind an API, whether that is your own box, a hosted endpoint, or a router, you are trusting that the thing answering is the model on the label. Providers have every incentive to serve a smaller or more aggressively quantized model under load. I wanted to know if you can catch that from the outside.

Short version: the obvious method fails, a less obvious one works, and it only works if you accumulate evidence.

Attempt 1: grade the output. Dead on arrival.

The intuitive idea is to send a prompt, look at the answer, and flag low-quality responses. I scored served outputs by perplexity under the model that was supposed to produce them. The result was backwards. A cheaper model (I used a 0.5B as the impostor) produces simpler, more generic, more predictable text, and predictable text has low perplexity under any model. The impostor's output scored better than the genuine model's own output, by about 0.65 bits per byte, on 9 of 10 prompts. So "flag the improbable answers" rewards the cheaper model. Scratch that.

Attempt 2: the scoring challenge.

Stop grading free-form answers. Fix a token sequence and ask the model to score it: the log-probability it assigns to that sequence, teacher-forced, one forward pass, no sampling. A model assigns higher probability to text it would itself produce, so for the same fixed sequence the genuine model is measurably more confident than a different one.

Here are the numbers, scored on my own machines with the Qwen2.5 family:

comparison	mean gap (nats/token)	genuine wins
honest floor (same model, q4 vs q8)	about 0.00 (std 0.07)	n/a
1.5B impostor (2x cheaper)	+0.27	8 of 10
0.5B impostor (6x cheaper)	+0.66	10 of 10

The catch: one check is not enough.

That honest floor row is the important one. The same model at two quantizations drifts about 0.07 nats per token, centered on zero. The 2x-impostor signal of 0.27 is only about three times that, and on short, low-entropy outputs the two distributions overlap. A single scoring challenge cannot separate a 2x-cheaper impostor from an honest server running a different quant.

The means are clearly distinct though, so it works as an accumulating signal. With honest standard deviation around 0.07 and an impostor mean around 0.27, a running average over roughly 10 to 15 challenges separates them with confidence. So this is a slow background audit, not a one-shot test. Difficulty scales with how close the impostor is: a 6x downgrade falls out in a few checks, a 2x needs about a dozen, and a very close swap or a light quant downgrade may be impractical.

A gotcha that cost me an hour.

I first got nonsense numbers, about -10 nats for tokens like "of" and "is", which is worse than uniform-random over the vocabulary. The cause was that in llama-cpp-python 0.3.23 the high-level create_completion logprobs are wrong. The fix is to read per-position logits straight from the context and compute the log-softmax yourself. Sanity-check any logprob pipeline against a known sentence first. English should land around 0.5 to 1.5 bits per byte under a decent model. If you see 5, your scorer is broken, not the model.

The honest limit.

This needs real logprob access to the model under test: open weights you serve, or a provider that exposes proper logprobs and lets you score an arbitrary sequence. Fully closed APIs that only return text are a harder problem, and I do not have a clean answer there yet. For open-weight serving, which covers most self-hosting and a good chunk of the hosted market, the scoring challenge is a usable audit.

The takeaway: you can verify an open-weight model is what it claims, but only statistically, over many checks, and the intuitive method does the opposite of what you want. I think that pattern, where the obvious metric is backwards and the real signal needs accumulation, shows up all over verification.

Local AI Needs Data-Plane Health Checks

Rob — Sun, 14 Jun 2026 20:59:02 +0000

The worst network bugs are the ones where every dashboard says green and the packet still dies.

That was my Sunday.

I have a Mac that I use as my daily machine and a Linux box called newtorob with a 2080 Ti in it. Potluck runs a local AI sidecar on each machine. The Mac can use its own model locally, or route a request to another machine in my household over a WireGuard mesh.

The product shape is simple:

Mac app -> local sidecar -> WireGuard mesh -> Linux sidecar -> model runtime -> streamed tokens back to the Mac.

This is the "my machines" path. No model API. No cloud inference. The coordinator handles roster and signaling metadata, but the prompt itself should go directly over the private mesh to my own hardware.

Everything looked connected. The Linux peer was enrolled. The coordinator knew about it. The mesh sidecar was running. The UI showed a peer. The machine had a model loaded.

Then I sent a prompt and got: no reachable peer.

The control plane was green. The data plane was dead.

The lie in "online"

Most peer health checks answer the wrong question.

A coordinator heartbeat proves the peer can talk to the coordinator.

A WebSocket connection proves the peer can keep one control connection open.

A WireGuard handshake proves two tunnel endpoints exchanged packets recently.

A capabilities response proves a process can report what it thinks it can serve.

None of those prove that an inference request can cross the exact path the product needs right now.

For a local AI mesh, the real question is not "is this peer online?"

The real question is:

Can this prompt reach that model and stream a token back right now?

That distinction matters because the failure modes sit between the layers. A peer can be present in the roster while the tunnel is broken. A tunnel can have a fresh handshake while HTTP over the tunnel fails. A model can be loaded while the process is unreachable from the other machine. A privacy VPN can silently drop traffic on an interface it does not recognize while every higher-level control check looks fine.

That last one was my bug.

The first wrong theory

My first theory was MTU.

That was not random. WireGuard-over-WireGuard paths are good at producing partial success. A small handshake packet can pass while larger data packets disappear. If path MTU discovery is broken, the tunnel looks alive and the application path dies. This is exactly the kind of problem where "connected" and "usable" diverge.

Tailscale and NetBird both default to conservative MTUs around 1280 for a reason. WireGuard adds overhead. Relays add overhead. Residential networks add weirdness. If you run a local mesh on top of another VPN, a 1420-byte default can turn into a packet shredder.

So I checked the mesh MTU.

It was already 1280.

That was a useful dead end. It ruled out the cleanest explanation and left the uglier one: the packet was not too large. It was not allowed.

The real cause

The Linux box runs Mullvad. The Mac also has Tailscale. Potluck uses a potluck0 WireGuard interface and mesh IPs in the 100.64.0.0/10 range.

That combination has two separate traps.

Tailscale treats 100.64.0.0/10 as its space. Its nftables rules can drop packets from that range when they arrive on a non-tailscale0 interface.

Mullvad's killswitch is stricter. It installs nftables chains with default-drop policy and allows traffic only through interfaces it trusts. potluck0 is not one of them.

From Mullvad's perspective, this is correct behavior. A privacy VPN killswitch should not let random interfaces become escape hatches.

From Potluck's perspective, this means my own mesh interface is blocked unless I add a narrow exception.

The fix was three scoped accept rules:

nft insert rule ip filter ts-input iifname potluck0 accept
nft insert rule inet mullvad input iifname potluck0 accept
nft insert rule inet mullvad output oifname potluck0 accept

Not a flush. Not a policy change. Not disabling the VPN firewall. Just a hole for the Potluck mesh interface.

After that, the Mac could hit:

curl http://100.64.0.7:8321/health
curl http://100.64.0.7:8321/peer/capabilities

Both returned 200. The prompt routed to the Linux box. The footer showed it ran on newtorob-a16.

That fixed the immediate problem.

Then it broke again.

Reconnects erase one-shot fixes

VPN clients rebuild firewall rules.

That sentence is obvious after you have been bitten by it once. It is not obvious when you are staring at a mesh that worked five minutes ago.

Mullvad, Proton, Nord, and similar clients do not treat nftables as a stable place where your hand-inserted rule gets to live forever. Reconnect the VPN, switch servers, wake from sleep, change networks, and the client may recreate its ruleset. Your narrow exception disappears. The killswitch keeps doing its job. Your mesh goes dark again.

My first fix was a boot-time one-shot. It installed the three accept rules when the machine started. That survives reboots. It does not survive VPN reconnects.

The better fix was a watcher.

Every few seconds it checks whether the accept rules that should exist still exist. If Tailscale or Mullvad is not present, it owes nothing. If they are present and any of the three potluck0 rules are missing, it reruns the same idempotent insert path.

The loop is boring by design:

while true; do
    sleep "${POTLUCK_FW_WATCH_INTERVAL:-5}"
    if ! rules_intact; then
        log "accept rule(s) missing; reapplying"
        do_install || log "reapply hit an error; will retry on next tick"
    fi
done

The systemd unit is also boring:

Type=exec
ExecStart=/usr/local/lib/potluck/install-firewall-rules.sh --watch
ExecStop=/usr/local/lib/potluck/install-firewall-rules.sh --uninstall
Restart=always

I tested it the blunt way. Run --uninstall, confirm all three rules are missing, wait seven seconds, confirm they are back. The journal logged the reapply event. The mesh stayed usable after that.

That is not the whole product fix. It is only the repair for this Linux VPN coexistence case.

The product fix is diagnostics.

A health check needs to follow the work

The lesson is not "add firewall rules."

The lesson is that local AI needs data-plane health checks.

If a system routes inference across machines, it should have a check that uses the same route as inference. Not just the same peer. Not just the same coordinator. The same path.

For my setup, a real health check should answer separate questions:

Is the local sidecar running?
Is the coordinator reachable?
Is the peer present in the roster?
Is there a recent WireGuard handshake?
Can this machine make an HTTP request to the peer sidecar over the mesh IP?
Can the peer stream a small response from the model path?

Those are different failures with different owners and different fixes.

If the coordinator is down, restarting Mullvad will not help.

If the peer is powered off, reapplying nftables rules will not help.

If the WireGuard key in the coordinator is stale, reloading the model will not help.

If the model runtime is missing CUDA libraries, the tunnel can be perfect and inference will still fail.

If Mullvad dropped potluck0, the peer can look enrolled and still be unusable.

The UI should not compress all of that into "offline."

It should say "coordinator unreachable," "peer not present," "no tunnel," "relayed," "firewall likely blocking mesh traffic," or "model runtime not ready." The exact labels matter less than the principle: name the layer that failed.

Why this matters more for local AI than normal SaaS

In a normal SaaS product, most of the network path is owned by the operator. The user opens a browser. Your load balancer works or it does not. Your app servers work or they do not. There are still ugly edge cases, but the core path is under one operational umbrella.

Local AI is different.

The path crosses the user's laptop, their OS firewall, their VPN, their home router, a mesh tunnel, another machine's firewall, a model sidecar, a Python runtime, a GPU driver, and a model file on disk.

The product does not get to pretend that is one boolean.

This is especially true for "my machines" routing. The whole point is to make a user's idle hardware useful: Mac for the app, Linux box for GPU inference, Windows desktop for another model, maybe a mini PC in a closet. That is a better architecture for ownership and cost. It is also a worse architecture for lazy health checks.

The user should not need to know nftables to understand why their peer is unavailable.

The software should know enough to say: "Your peer is visible, but data-plane traffic over potluck0 is blocked. Reapply the scoped firewall rules or disable the VPN killswitch exception."

Even better, with consent, it should offer the fix.

What I changed

The immediate change was operational:

Add a narrow, reversible firewall helper for Linux systems using Tailscale and Mullvad.
Run it as a long-lived systemd watcher, not a boot-only one-shot.
Keep the scope to potluck0 accepts. Do not flush rulesets. Do not weaken the broader VPN policy.
Verify the real path with curl to /health and /peer/capabilities, then a prompt that actually runs on the remote machine.

The next change is product:

Replace the single peer-status badge with a small diagnostics model. Local host, coordinator, relay, tunnel, peer data plane, model runtime. Each layer gets a named failure and a concrete fix.

That is less elegant than a green dot.

It is also more honest.

The check I want

The check I want is not expensive.

Send a tiny HTTP probe to the peer over the mesh. Sometimes send a larger one to catch MTU and fragmentation problems. If the app is about to route inference, ask the peer for capabilities over the same path. If that passes, optionally send a tiny model-path probe before marking the peer usable for a real prompt.

Cache the answer briefly. Debounce flaps. Suppress downstream errors when an upstream layer is already broken.

But do not call the peer reachable just because a control-plane heartbeat exists.

That is how I lost an afternoon.

What I would tell anyone building this

If you are building local-first AI across machines, do not start with "peer online."

Start with the path:

Can the request leave this process?

Can it cross the mesh?

Can it reach the peer process?

Can the peer reach the model runtime?

Can one token come back?

Everything else is metadata.

The metadata is still useful. Heartbeats, handshakes, rosters, relay status, and capabilities all help narrow the search. But they are not proof that the system can do the work.

A local AI mesh should not ask "is the peer online?"

It should ask "can this prompt reach that model and stream a token back right now?"

That is the health check that matters.

Rob writes the Local AI Engineering Notes series on strake.dev. He's building Potluck AI, a local-first AI system that routes inference across your own machines and trusted peers, and Strake, a GitHub Action deploy gate.

Keep the Credit Ledger Off-Chain. Checkpoint It On-Chain.

Rob — Fri, 05 Jun 2026 20:42:12 +0000

I specced an on-chain credit ledger for a compute network, wrote the mint, and then the token program refused to initialize it. The two extensions I needed (non-transferable credits, plus a hook on every movement so the chain could run the accounting) are mutually exclusive on a single Solana mint. The runtime rejects the initialize instruction. That rejection sent me back to the design, and I came out the other side with the ledger off-chain and only a daily checkpoint on-chain.

This post is why I think that is the right architecture for a usage-accounting ledger in general, not just the workaround for one runtime constraint. It is specific to compute-metering credits (earned for serving inference, spent for consuming it), and most of it generalizes to any high-frequency internal accounting system that someone reaches for a blockchain to secure.

What a credit ledger actually is

Start with the workload, because the workload is what should pick the substrate.

A usage credit ledger is a meter. It has two write paths: a contributor serves a job and earns credits, a user consumes a job and spends them. There is a third, smaller path for verification adjustments and refunds. Writes are append-heavy. Reads are dominated by one question: what is this account's balance right now. Each event carries a tiny value, a fraction of a cent of compute, and the events arrive at a rate set by network traffic rather than by anything financial.

Put rough numbers on it. A network serving 10,000 inference requests a day generates at least 10,000 spend events, a comparable number of earn events, and some verification deltas on top. Call it 20,000 to 30,000 ledger writes a day at small scale. At real scale you multiply that by two or three orders of magnitude. This is the shape of a phone company's call-detail records or a cloud provider's billing meter. None of those run on a blockchain, and the reasons they don't are the reasons a credit ledger shouldn't either.

The fee argument is the weakest one

People assume the problem with an on-chain ledger is gas, so I want to deal with that first and set it aside, because on a cheap chain it mostly doesn't hold.

Solana's base fee is 5,000 lamports per signature. Even with priority fees during contention you are paying fractions of a cent per transaction. Thirty thousand writes a day at those rates is a rounding error against any real infrastructure budget. If raw transaction cost were the only objection, I would have kept the ledger on-chain and moved on.

The real objections are three, and none of them is about fees: a protocol constraint, a migration trap, and a liveness coupling.

The protocol constraint

This is the one that physically stopped me, so it is worth being precise.

A usage credit should be non-transferable. You do not want an internal metering unit trading on a secondary market as a speculative asset, because the moment it does, the price of compute on your network starts moving with a token chart instead of with the cost of compute. Solana's Token-2022 program supports this directly through the NonTransferable extension.

An on-chain ledger also means custom logic runs on every credit movement. In Token-2022 that is the TransferHook extension: a program you control that fires on each transfer and does your accounting.

You cannot initialize a single mint with both. NonTransferable and TransferHook are mutually exclusive at the runtime level, and the initialize instruction fails if you try to combine them. The logic is consistent once you see it: a hook that fires on transfer is meaningless on a token that can never transfer. So you get one or the other. You can have non-transferable credits with no movement logic, or transferable credits with a hook. The combination an internal metering credit actually wants is the one the program will not let you build. That alone took the on-chain ledger off the table before I reached any economic argument.

The migration trap

A live balance system on-chain is the hardest component in the whole network to change.

Any program upgrade that touches how balances are represented is a migration of live, real-value state, with all the replay and ordering surface that implies. You cannot test it the way you test a service, because the thing you are mutating is everyone's money-equivalent at once. The scale precedent here is Helium's own L1-to-Solana move in April 2023: migrating a live token and balance system across chains is a multi-quarter, high-risk operation that a small team executes once and dreads.

If the ledger lives on-chain, every schema change inherits a slice of that risk. If the ledger lives off-chain, a schema change is a database migration you can stage, dry-run against a copy, and roll back. I would rather own that problem in Postgres than in a program upgrade.

The liveness coupling

If every debit is a transaction, the product's ability to settle a credit is coupled to the base layer's ability to land a transaction.

Solana has had multi-hour degradation events. During one, an on-chain ledger cannot record a spend. So either the network stops settling credits until the chain recovers, or it queues the events somewhere off-chain and replays them later, at which point you have reinvented an off-chain ledger under worse conditions than if you had designed one on purpose. The dependency runs backwards. A meter should keep metering through an outage in something it does not control.

What Helium actually did

The precedent worth copying is sitting right there, and most of the networks I look at miss it.

Helium does not put its high-frequency accounting on-chain. Proof-of-Coverage and data-usage accounting run off-chain through oracles. Solana is the settlement and state anchor. Data Credits, the unit users burn to move bytes, are non-transferable and priced at a fixed $0.00001 per packet, and the per-event question of who covered what and who used how many bytes does not hit the chain one transaction at a time. When Helium moved its L1 onto Solana, it kept that split intact.

The lesson I take from the closest structural analog to what I am building: the chain secures the anchor, and the accounting stays in a system built for accounting.

The design: hash chain plus daily checkpoint

Here is the architecture I landed on.

The ledger is an append-only log in an ordinary database. Each entry includes the hash of the previous entry, which makes the log a hash chain. You cannot alter a past entry without rewriting every entry after it, and any such rewrite is detectable by recomputing the chain from any earlier known-good hash forward. This is the same construction a blockchain uses internally, applied to a private log.

Once a day, you build a Merkle tree over the current balance set, take the single Merkle root, and publish that root on-chain through a tiny program whose only job is to store roots with timestamps. The program holds 32 bytes per checkpoint and does nothing else.

That on-chain root is the commitment. It records, with the chain's finality behind it, the exact state of the ledger at the checkpoint. Later, anyone can be handed an inclusion proof: here is your account entry, here is the Merkle path, here is the root the chain recorded that day. They verify the path against the on-chain root and confirm their balance without trusting the operator's word for it.

What the checkpoint buys

Three things, mapping one-to-one onto the three objections.

Auditability without per-event cost. The public root is a commitment the operator cannot quietly walk back. A member who suspects their balance was altered requests an inclusion proof and checks it against the chain themselves. You get the property people actually wanted from an on-chain ledger, which is that nobody can rewrite history in the dark, without paying to write every line of history on-chain.

Lower migration risk. This is the one that changed my own roadmap. With an on-chain ledger, shipping the credit system meant launching a live balance migration, which made it the single highest-risk primitive in the build. With checkpoints, shipping means deploying a program that stores 32-byte roots. The risky work shrinks from migrating a live balance system to publishing a checkpoint. Those are different sizes of problem, and the second one I can ship in an afternoon and sleep after.

Liveness tolerance. The checkpoint cadence is daily, so a multi-hour base-layer degradation delays a root and changes nothing a user sees. The ledger keeps recording the whole time. The next checkpoint catches up. The settlement path never touched the chain to begin with, so an outage in the chain cannot stall it.

Honest limits

This design has real costs, and I would rather name them than have a reader find them.

Between checkpoints, you trust the operator. The hash chain makes tampering detectable after the fact, but a dishonest operator can still serve a wrong balance for up to one checkpoint interval before the next root exposes the divergence, and detection requires someone to actually request and verify a proof. Shorter intervals shrink the window and cost more transactions. I think daily is right for a metering credit whose balances move in small increments, and I would shorten the interval before it became a real exposure.

The hash chain proves integrity, not correctness. It proves the log was not altered. It says nothing about whether a given entry should have been written. Whether a contributor genuinely served the job they were credited for is a separate problem, solved by verification sampling, and no ledger structure substitutes for that.

Availability of the off-chain log is your responsibility. The on-chain root is permanent. The log it commits to lives in your database, and if you lose the log, the root commits to something you can no longer reconstruct. So the off-chain store needs the durability discipline of any system of record: replication, backups, tested restores, the boring parts that decide whether the clever parts survive.

And this approach assumes the credit never needs trustless real-time finality. If a unit genuinely has to be final on a public ledger the instant it moves, an exchange-traded asset for example, checkpoints are the wrong tool and you should pay the on-chain cost in full. The case where checkpoints fit is a usage credit that lives and dies inside the network and never trades.

What this means for what I am building

I am building Potluck, a peer-to-peer AI compute network, and the credit ledger is the layer that accounts for compute served and compute consumed. We run it off-chain as a hash chain and publish a daily Merkle root on-chain. I arrived here the long way, by speccing an on-chain ledger first, hitting the Token-2022 mutual-exclusivity wall, then the migration risk, then the liveness coupling. The off-chain-with-checkpoints design answered all three with one structure.

The general claim I will stand behind: a usage-accounting ledger is a meter, and a meter wants a database with a cryptographic commitment over it, not a blockchain carrying every event. Put the trust anchor on-chain. Keep the meter where meters run.

Rob writes the Local AI Engineering Notes series on strake.dev. He is also building Potluck AI, the peer-to-peer AI compute network referenced in this post, and Strake, a GitHub Action deploy gate.

Why Decentralized AI Compute Needs Two Assets, Not One

Rob — Thu, 04 Jun 2026 18:34:24 +0000

Bittensor pays roughly eight dollars in TAO token emissions for every dollar of real AI revenue that flows through the network. The exact ratio fluctuates by quarter, but the shape is durable. Q1 2026: about $328 million in annual emissions against $43 million in real AI revenue. That is 7.6 to 1. It is what the crypto-skeptical press has called "extractive by default." It is also what the crypto-friendly analysts call "the subsidy treadmill."

The Bittensor engineering team is sophisticated. The subnet validators run real ML evaluation. The miners serve real inference. The revenue is real. The emissions are also real.

The cause is the token model itself. One asset is asked to do two jobs that do not belong together. I want to be specific about this part, because every other decentralized AI compute network I have looked at has the same problem, and the fix is well-known.

What the token does

A token in a decentralized AI compute network does two structurally distinct things.

The first job is utility settlement. Contributors run inference, and someone has to pay them for the compute work they did. The payment medium has to scale with usage, has to be denominated in something the contributor can spend on the network or convert to fiat, and has to remain stable enough that contributors can plan around it. This is a billing system.

The second job is value capture. Early supporters, investors, and contributors take risk to bootstrap a network that does not yet exist. They have to be paid back for that risk in a way that scales with the eventual success of the network. The payment medium has to be a speculative asset that appreciates as the network grows. This is an equity instrument.

A billing system and an equity instrument want opposite things. A billing system that is also a speculative asset means that contributors who get paid in it cannot help but hold a speculative position. An equity instrument that is also a billing system means that token-price volatility shows up in the unit economics of the network itself.

The single-token model conflates these two jobs. Bittensor's TAO is both. Akash's AKT is both. Render's RNDR is both. The conflation is what produces the subsidy treadmill.

The mechanism

Walk through what happens in a single-token network at steady state.

Real AI revenue flows in at rate R. Token emissions flow out at rate E. Contributors decide whether to keep running nodes based on whether (R + E)/N (where N is the number of active nodes) clears their economic threshold.

In a healthy network, R rises as the network finds product-market fit, and E declines on a published glide-path. The two converge somewhere around year 5 or 7. At that convergence point, the network operates on real revenue and the equity holders capture the value that has accrued in the token.

The problem is what happens before that convergence. Contributors are sensitive to the dollar value of their pay, and the token-denominated component (E) is large compared to the real revenue component (R). When the token price falls for any reason (macro selloff, competing subnet, ETF rejection, founder error), contributors leave. When contributors leave, the network's quality of service falls. When quality of service falls, real revenue falls. When real revenue falls, the token price falls further. The feedback loop is right there in the mechanism.

The way out of the loop is more emissions, faster. That is the subsidy treadmill: emit faster to retain contributors, which dilutes the token, which weakens the contributor payoff, which requires more emissions. Bittensor's 7.6 to 1 ratio is the ratio of "what we have to emit to keep contributors here" to "what the network actually does." Most defenders call it a bootstrap phase. The math says it's the equilibrium that one-asset mechanism design produces.

The two-asset alternative

Separate the assets. Use a stable utility credit for the billing job. Use a tradeable governance token for the equity job. Let each asset do the job it is good at.

The utility credit is denominated in compute hours. One credit equals one normalized A100-minute equivalent. Contributors earn credits by serving inference. Users earn credits by purchasing them with fiat at a rate set by the network. Users spend credits by consuming inference. The credit is non-transferable in v1, has no exchange listing, and has no speculative premium. It is a billing system and only a billing system.

The governance token is tradeable on a public market and exists for value capture. It is allocated to the team, early investors, the foundation treasury, the ecosystem fund, and a contributor airdrop based on cumulative credit earnings. It carries governance rights over protocol parameters, treasury allocation, and slashing policy. A percentage of public-pool fees buys back and burns the token, so token holders capture the upside of network adoption.

The two assets are kept apart on purpose: utility on one side, value capture on the other. A contributor who wants to participate in the value-capture side can earn credits and then convert their cumulative credit history into a governance-token airdrop at a Phase 2 milestone. A contributor who does not want speculation exposure can earn credits, redeem them for inference services, and never touch a tradeable asset.

Why credits are stable

The credit's value floor is the compute it represents. One credit can be redeemed for one normalized A100-minute of inference at any time. That redemption right is what makes the credit stable.

The redemption right is a use right, not an FX peg. A user who holds 1000 credits can run 1000 A100-minutes of inference. The credits do not need to trade against the dollar or against other crypto assets; they just need to clear inference requests at the rate the network publishes.

This is the same shape as data credits in Helium's redesign (one credit = one IoT data packet). Helium v1 used a single-token model and produced the same subsidy treadmill we see in Bittensor today. The v5 redesign split utility (Data Credits) from value capture (HNT), and the unit economics stabilized. It was the standard fix that mechanism design produces when the original mechanism fails. Nothing clever about it.

Why governance gets its own token

A governance token wants the opposite of stability. It wants to be a tradeable instrument that captures the value of the network as the network grows.

A token whose only economic function is governance plus fee accrual is what the post-2024 DeFi consensus has converged on. MKR is the canonical example: it pays no one for operations; it just captures value through buybacks funded by DAI fees and votes on protocol parameters. The market prices it on expected future fee accrual. When the network grows, fees grow, buybacks grow, the token appreciates.

That mechanism does not work if the same token is also paying contributors. Contributor payments dilute the token. Buybacks concentrate it. The two operations cancel out, and the token's price reflects the noise of the two flows rather than the signal of network growth.

When the two operations are split across two assets, both can do their job cleanly. The credit pays contributors and absorbs all the operational dilution. The governance token captures value and absorbs all the buyback concentration. The signals stop fighting each other.

The sequencing argument

The mesh ships before the token. There is no reason to launch a token into an empty network, because a token coordinates supply, and supply that does not yet exist cannot be coordinated.

The substantive engineering work that has to land before a token makes sense is the network itself. Cross-machine inference routing has to be measured. Quality-scoring has to be calibrated. The credit ledger has to be running at small scale among trusted contributors. The user-side product has to actually serve real workloads.

When those things are working, the credit ledger can scale to a wider contributor pool. When the contributor pool is real, the governance token has something to govern. When the governance token has something to govern, it can launch.

The Bittensor failure mode is launching the token first and trying to bootstrap supply through emissions. The Petals failure mode is having no token at all and hoping altruism scales. The two-asset model with mesh-first sequencing avoids both. It says: ship the boring part, prove the boring part works, then add the financial instrument that makes the network ownable.

What the math looks like

A two-asset network running at steady state has two ratios to track instead of one.

The credit ratio is real-revenue-per-credit-redeemed. A network is healthy if users redeem credits at a stable rate against fiat-denominated inference cost. If credits are inflating against compute hours (one credit redeems for less inference over time), the credit issuance mechanism is broken.

The governance ratio is fee-buyback per token emission. A healthy governance token has buybacks at or above token emissions over rolling 12-month windows. If buybacks fall meaningfully below emissions, the token is in dilution territory and the governance economics are not working.

Bittensor's published numbers do not allow a clean version of these two ratios because the two operations are conflated. But the closest analog (real revenue versus token emissions) is the 7.6 to 1 ratio. The same ratio for a two-asset network at steady state should be approximately 1 to 1 (fees fund both buybacks and contributor incentives at parity), with credits decoupled from the governance-token price.

A network designed this way can recover from a token-price drawdown without losing contributors. The contributors are paid in credits, and the credits redeem for the same compute services they always did, even if the governance token is down 70%. The governance token's price reflects market sentiment about the network's future fee accrual; it does not control the network's day-to-day unit economics.

Honest limits

The two-asset model is not free. It adds engineering complexity (two accounting systems instead of one), adds legal complexity (the credit may or may not be a security depending on jurisdiction, and the governance token almost certainly is one in the US until decentralization thresholds are met), and adds adoption friction (contributors have to learn the difference between earning credits and earning governance-token allocations).

The model also does not magically solve demand-side problems. If users do not want to buy credits, the credit is not a real billing system, regardless of how cleanly it is separated from the governance token. The two-asset model fixes the supply-side dynamics that the single-token model breaks. It does not fix the demand side. The demand side comes from building a product people actually want.

The Helium precedent is instructive on this point. The two-asset migration stabilized Helium's unit economics, but Helium still had to find a real use case (mobile carrier service) before the network became economically self-sustaining. The original IoT use case had not materialized at the supply density the network had bootstrapped. A clean mechanism cannot rescue a bad market thesis.

What this means for what I am building

I am the founder of Potluck, which is a peer-to-peer AI compute network that has been measuring the boring parts (memory mesh latency, cross-machine retrieval, inference verification) for six months. The two-asset model is the token design we have committed to. It is documented in detail at trypotluck.ai if you want the longer version.

I wrote this to make a specific argument I think is true about decentralized AI compute networks in general: the single-token model is mechanism-design failure that no amount of execution can fix, and the fix has been understood since Helium v5 in 2023. Several networks are launching with the single-token model anyway.

If you are building one of those networks, separate the assets before you launch. The math gets better the moment you do.

Rob writes the Local AI Engineering Notes series on strake.dev. He is also building Potluck AI, the peer-to-peer AI compute network referenced in this post, and Strake, a GitHub Action deploy gate.

Brute-Force Retrieval Holds Through 5,000 Memories. Then It Doesn't.

Rob — Wed, 03 Jun 2026 16:52:56 +0000

The last post I wrote measured how long it takes to query my Linux box's memory store from my Mac over a WireGuard mesh. The answer was about 20 milliseconds, plus or minus a few for Tailscale jitter, stable across store sizes from 10 to 500 entries. That ended with a sentence I should not have left there without testing: "the linear scan over a few hundred vectors is negligible at this scale."

That is true at a few hundred. The honest follow-up question is the one I had skipped. At what scale does it stop being true.

I knew the rough shape of the answer. The retriever is a brute-force cosine scan over every embedding in the local store. No approximate-nearest-neighbor index. No HNSW, no IVF, no FAISS. Just a loop. At 500 entries the loop is essentially free; the embedder cost dominates the call. Somewhere above 500 the loop starts to cost real milliseconds. The question is where.

I ran the bench again with a --sizes flag I added that morning, and pointed it at 1,000, 5,000, and 10,000 synthetic memories on the Linux peer. Same script as before. Same Tailscale mesh. Warm embedder, 30 query samples per store size.

The measurement

I ran each cell once. Cleanup between cells so the next size starts from zero. The Mac-local p50 baseline (14.8 ms, an essentially empty embedding-pass-through) is shown for reference.

Store size on peer	p50	p95	min
Mac local (baseline)	14.8 ms	18.3 ms	7.6 ms
10 memories on peer	35.9 ms	40.7 ms	18.1 ms
100 memories on peer	36.1 ms	41.2 ms	19.1 ms
500 memories on peer	36.3 ms	46.0 ms	20.9 ms
1,000 memories on peer	42.0 ms	51.1 ms	21.3 ms
5,000 memories on peer	40.6 ms	116.5 ms	32.0 ms
10,000 memories on peer	57.6 ms	131.9 ms	47.2 ms

Three observations the table makes obvious.

First, p50 is remarkably flat through 5,000. Forty-two milliseconds at 1K, forty milliseconds at 5K. The store size grew fifty times and the median round-trip moved by zero. The cosine scan is contributing essentially nothing to the median at these sizes; the embedder cost on the peer is still the dominant term in a typical query.

Second, p95 starts to widen at 1K and breaks at 5K. Fifty-one milliseconds at 1K, then 116 milliseconds at 5K, then 132 at 10K. The 5K p95 more than doubled vs the 1K p95 even though the median barely moved. That is the linear scan showing up in the tail. The median samples are landing in a regime where the embedder dominates, but the slow samples are catching the scan doing real work.

Third, p50 starts moving at 10K. Fifty-eight milliseconds, fifteen above the 1K and 5K p50s. The scan is now contributing visibly to the median, not just the tail.

The threshold is somewhere between 5K and 10K. p95 has already broken by 5K; p50 breaks by 10K. The "fine" regime ends in that window.

Full bench data and the script that produced it: trypotluck.ai/benchmarks.

What that means in practice

Most people who would use a local AI memory store do not have 10,000 entries in it. I have been dogfooding the system for a few months and my own store has 280 memories in it as of writing this. A heavy daily user storing every decision, preference, and project fact would maybe reach 2,000 in a year. The 5K threshold is something most users will never hit. The 10K threshold is firmly in power-user territory.

That is the result I wanted to be true, and it is. Brute-force cosine scan is the right default for personal AI memory at the scale most people will operate at. Adding an ANN index now would be a premature optimization that buys nothing for 95% of users and adds operational complexity for everyone.

It is also useful to know exactly when ANN starts paying off. If a user reaches around 5,000 memories, p95 starts wobbling. If they reach around 10,000, p50 starts wobbling. The right time to add HNSW or IVF-PQ to this codebase is when I see real users hitting those sizes, not before. Until then the scan is the right answer and the engineering effort is better spent elsewhere.

A note on what is in the embedder budget

p50 is forty milliseconds through 5K. About fifteen of those are the WireGuard round-trip plus FastAPI middleware. The rest is the embedder running on the peer. bge-small-en-v1.5 on a 2080 Ti via CUDA does a single query embedding in about twenty milliseconds when the model is warm. The cosine scan over 500 vectors of dimension 384 is about 0.2 milliseconds in numpy, which is below the precision of my measurement. The scan over 5,000 vectors is about 2 milliseconds, which is also below the noise floor of an HTTP-over-WireGuard probe. The scan over 10,000 vectors is about 4 milliseconds and that one is just barely visible in the median delta between 5K and 10K.

What is visible in the 5K p95 is not the average scan cost. It is the worst-case scan cost colliding with a worst-case scheduling stall, a worst-case GC pause, a worst-case context switch on the peer. The tail samples are catching the system at its slowest. As the store grows, more of the scan's wall time happens during a bad moment, so the tail widens disproportionately to the median.

The interesting engineering implication is that the first optimization that matters is not ANN. It is reducing the number of vectors the scan touches in the first place. Project-scoped filtering (only scan vectors tagged for the current project), confidence-threshold pruning (skip vectors below a minimum stored confidence), recency cutoff (skip vectors older than N days unless re-accessed) all reduce the scan size cheaply. Most realistic queries on a 10,000-vector store are not actually asking the retriever to consider all 10,000. They are asking it to consider the few hundred relevant to the current project, which puts the effective scan size right back in the "fine" regime.

ANN indexing is the right move when even the project-scoped scan crosses 5K. That is a real product moment, but it is not now.

Honest limits of the measurement

This was Linux peer only, with CUDA-accelerated embedding and a 2080 Ti. Mac local would look slightly different at scale because Metal embedder throughput is slightly higher than CUDA on this generation of GPU. Windows peer would look worse because the embedder runs on CPU there. I expect the same shape, the same threshold around 5K to 10K, slightly different absolute numbers. I have not yet measured the Mac local or Windows peer cases at 1K plus and that is the next bench.

The synthetic memories the bench script populates are deliberately diverse but they are not real user memories. Real memories are more semantically clustered (you ask about kubernetes more than you ask about cassandra) which means the cosine scores will have a different distribution, which means the top-k selection will land in slightly different cache regimes. I would expect this to make the tail samples noisier in production than in the bench. The 5K and 10K p95s are probably underestimates of what a real user with a power-user store would see.

I ran each size once. A statistically defensible characterization would run each size five to ten times and report distributions, not point estimates. The p50 numbers are stable enough run-to-run that I am confident in the 5K threshold within a few hundred memories either way. The p95 numbers have wider run-to-run variance, especially at 5K and 10K, so the exact p95 values are less trustworthy than the trend.

These results are for retrieval latency only. They do not address what happens to retrieval quality as the store grows. A 10K-memory store has different recall characteristics than a 500-memory store because there is more competition for the top-k slots. That is a separate measurement and one I have not run yet. The right place for it is on the next pass.

What I would build next

In order of what actually pays off for most users:

Project-scoped pre-filtering in the retriever, so the cosine scan only touches vectors with a matching project tag. Cheap to implement, immediately reduces effective scan size for anyone with more than one project.
A tiny in-process LRU cache for embeddings of recently-asked queries. Same query within a session skips the embedder entirely. Embedder is the dominant cost at small sizes, so a 50% cache hit rate cuts p50 by ten milliseconds.
Recency-based cold-storage tiering for memories older than some threshold and unaccessed for some other threshold. Keeps the hot scan small even as total store grows past 10K.
ANN indexing (probably HNSW via hnswlib for the python bindings, or usearch for a smaller dependency) only when the hot scan size crosses 5K. That is the right moment, not before.

I will probably ship 1 this week, 2 next week, defer 3 until a user actually has the scale to need it, and defer 4 until item 3 is not enough. The order is from "fixes a thing I can measure today" to "fixes a thing I will not need to fix for months."

What this changed

The previous post about cross-machine memory query ended with a hand-wave. "Negligible at this scale" without an upper bound is not a measurement, it is a vibe. The fix was a bench script that already existed plus three CLI flags. The cost was thirty seconds of typing and fifteen minutes of waiting for the populate step to finish at 10K entries.

The result is the same shape I expected with a sharper number than I expected. The cosine scan is fine through about 5,000 entries. By 10,000 it is starting to be a real cost. Most users will live their whole Potluck life inside the "fine" regime. For the ones that don't, the right next optimization is not ANN. It is the cheaper pre-filter step that puts most queries back in the fine regime even on a large store.

The architecture was right. The threshold I was working off was wrong. The measurement tightened it from "negligible at this scale" to "negligible through five thousand, real cost past ten thousand," which is a more honest sentence to put in a benchmark caption.

Rob writes the Local AI Engineering Notes series on strake.dev. He's also building Potluck AI, the local-first AI memory system measured in this post, and Strake, a GitHub Action deploy gate.

Cross-Machine Memory Query: About 20 Milliseconds, Most Days

Rob — Wed, 03 Jun 2026 14:26:30 +0000

I wrote about hardware benchmarks twice this week. Different problem this time. Same machines.

I have a Mac for daily work, a Linux box that runs a few media services and a GPU, and a Windows desktop I keep for gaming and AMD testing. They are all on the same Tailscale-managed WireGuard mesh. Each one runs a local memory store I use with my AI coding tools.

The store is local-first by design. No vendor cloud. No memory-sharing API. When I move from the Mac to the Linux box for some weekend project, the context I built up on the Mac stays on the Mac, and the new context I build on Linux stays on Linux. That has always been a feature for me. Until last weekend, when I realized it was also a constraint I had built for myself.

I wanted to query the Linux box's memory store from my Mac.

The simplest version of the problem

The local sidecar that powers the memory store exposes an HTTP API. Locally, my agent hits POST /memory/search and gets back the relevant memories. The sidecar binds to 0.0.0.0:8321 so household-mesh peers can dial it. A middleware enforces that non-loopback callers can only reach /peer/* and /health. Everything else returns 403.

This is the right default. The local user's memories, tasks, chat history, and observability data should not be reachable from any other machine, including machines I own, until I explicitly opt that in.

The result is a clean privacy boundary and exactly zero cross-machine memory features.

The design fork before any code

Before I wrote a line of code I had to settle one thing. Three real options.

A. Memory stays strictly local. The cross-machine claim was an aspiration that does not match the actual product. Remove the claim, do not build the feature.

B. Own-fleet memory aggregation gated behind a per-machine opt-in environment variable. When the user sets it, household-mesh peers can query that machine's full memory store. Trust is mesh-IP reachability.

C. Per-project sharing flags. Memories tagged for shared projects are queryable; everything else stays strictly local. Cleaner privacy model, more code.

A is the cleanest privacy story, but it walks back a published claim. C is the right long-run answer but it is at least three times the code to ship. B is the pragmatic prototype.

I picked B with an opt-in default of off. The privacy posture stays correct for users who never opt in. The architecture stays correct for users who do.

What the endpoint looks like

@router.post("/memory/search", response_model=PeerMemorySearchResponse)
async def peer_memory_search(request: PeerMemorySearchRequest):
    if not _peer_memory_enabled():
        raise HTTPException(
            status_code=503,
            detail="Peer memory search is opt-in and disabled on this machine. "
                   "Set POTLUCK_PEER_MEMORY_ENABLED=1 to enable household-fleet "
                   "memory aggregation.",
        )
    if not _memory_active():
        raise HTTPException(status_code=403, detail=_DISABLED_DETAIL)

    pairs = await semantic_search(
        query=request.query,
        project=request.project,
        limit=request.limit,
        min_confidence=request.min_confidence,
        include_pinned=True,
    )
    return PeerMemorySearchResponse(
        memories=[m.model_dump() for m, _ in pairs],
        count=len(pairs),
    )

That is the whole endpoint. Same retriever as the local call. Read-only. The env var gate reads the variable per request so the user can toggle without restarting the sidecar.

There is also a header comment block that documents the new trust model so the next person who reads the file does not have to guess at what is intentional.

The first measurement, and why I did not trust it

Three machines on Tailscale. Each runs the sidecar bound to 0.0.0.0:8321. Opt-in env var set on each one. I issued 30 warm-embedder queries from the Mac to each peer's /peer/memory/search and to my own local /memory/search for comparison.

Path	p50	p95
Mac local `/memory/search` (baseline)	14.8 ms	18.3 ms
Mac to Linux peer `/peer/memory/search`	29.0 ms	37.8 ms
Mac to Windows peer `/peer/memory/search`	40.8 ms	53.5 ms

The Linux peer added 14 ms over local. The minimum cross-machine call was inside the local p95. The shape of the numbers was clean and the conclusion was easy. It feels instant.

I had a draft post written around that 14 ms. I did not publish it.

The numbers felt too generous. The probe was a single ad-hoc script against whatever store happened to be on the Linux box at the time, which was nearly empty. That is not a measurement. That is a vibe check.

I wrote a real bench.

The second measurement: the actual bench script

bench/run_peer_retrieval.py does two things.

First, it populates the peer's store with a known set of 10 synthetic memories, queries them with semantically distinct natural-language questions, and verifies recall@1, recall@3, and recall@10. This catches silent breakage in the wire path: truncation, reordering, dropped fields. All three recall numbers should be 1.00 by construction. The point of the probe is the negative result: confirming the architecture introduces no silent corruption.

Second, it populates the peer at three store sizes (10, 100, 500 memories), then issues 30 warm-embedder queries against each. The point is to characterize how latency scales with the size of the embedding scan, which is the part of the system most likely to degrade gracelessly as memory stores grow.

The populate step uses SSH local-forwarding from Mac to peer so the writes hit the peer's loopback-only /memory endpoint, satisfying the peer_access_middleware loopback check. Memory writes stay strictly local even during the bench. The query step uses the cross-machine /peer/memory/search directly.

I ran it the same evening as the first measurement. The numbers came back higher than the ad-hoc probe.

Store size on peer	p50	p95
10 memories	41.8 ms	52.6 ms
100 memories	39.5 ms	56.6 ms
500 memories	45.0 ms	85.2 ms

Overhead vs Mac local jumped from 14 ms to roughly 27 ms. Correctness was 1.00 across the board.

Two questions immediately. First: which number is right, 14 ms or 27 ms? Second: why did 500 memories show a p95 of 85 ms when 10 memories showed 52 ms? The linear-scan answer explains 500-vs-10 in principle, but only by a few milliseconds at this scale, not thirty.

The third measurement: the next morning

I ran the same bench script the next morning. Twice in a row, about ten minutes apart.

Run	10 mem p50	100 mem p50	500 mem p50	worst p95
First (evening)	41.8 ms	39.5 ms	45.0 ms	85.2 ms
Second (morning)	35.9 ms	36.1 ms	36.3 ms	46.0 ms
Third (morning, ten min later)	34.7 ms	34.9 ms	41.3 ms	50.4 ms

The two morning runs agree to within ~1.5 ms p50 at every store size. The evening run was 5 to 10 ms higher across the board. Same code. Same machines. Same Tailscale mesh.

The variance is the Tailscale path. Tailscale prefers direct UDP between peers when the network conditions allow; if that fails, traffic relays through Tailscale's DERP servers, which adds a hop and a few milliseconds of geographic latency. Whether a given session lands direct vs DERP can flip based on residential ISP behavior, NAT state, and time of day. The 5 to 10 ms band in my morning-vs-evening numbers is what that flip looks like from the Mac's HTTP stopwatch. The evening's worst p95 (85 ms at 500 memories) is the same flip plus the long tail of a worse direct path.

What I publish

For a stable headline: cross-machine memory query adds about 20 milliseconds of overhead over local on Mac, plus or minus 5 milliseconds of Tailscale jitter, plus a few more milliseconds of p95 widening as the peer's memory store grows past a few hundred entries. The 14 ms number from the ad-hoc probe was the low end of that band. The 27 ms from the first bench was the high end. About 20 ms is the honest middle.

The architectural conclusion does not depend on which number you pick. Twenty milliseconds is well inside the "feels instant" range for interactive coding-agent workflows. Even at the worst measurement I have on file (85 ms p95 on a 500-memory store on a slow-Tailscale night), it is faster than most users can perceive as a pause.

What that 20 ms is made of

Roughly: WireGuard tunnel round-trip plus HTTP request and response serialization plus FastAPI middleware plus the actual retriever call on the peer side. The WireGuard round-trip alone is around 9 to 12 ms when Tailscale lands a direct path, 14 to 18 ms when it relays through DERP. The retriever and serialization are the rest.

This is meaningfully lower overhead than I expected when I started. I had been mentally budgeting cross-machine as a 50 to 100 ms operation that I would have to design around. At 20 to 30 ms most days, it just works.

Things that surprised me along the way

I started a sidecar on Linux from an SSH session via nohup ... &. The process died as soon as the SSH session closed. SSH sessions over Tailscale's built-in SSH server do not behave like normal openssh sessions. setsid works. nohup does not. Half an hour of debugging that I would rather have back.

On Windows I tried to launch the sidecar by passing set X=1 && set Y=path && python -m ... through a single cmd.exe /c chain via WMI Win32_Process.Create. The env vars did not survive. The fix was to write a .bat wrapper file and invoke that. Cleaner. Reliable. Should have been my first move.

The first peer query took 22 seconds. That was the sentence-transformers embedder lazy-loading on the peer the first time /peer/memory/search was called. Subsequent calls were ~36 ms. Worth pre-warming after a sidecar restart.

The privacy guard middleware works exactly as designed. It returned 403 from /memory/search to my Mac, returned 200 from /peer/memory/search after I set the env var, and stayed 503 from peers where I had not opted in. No accidental data leaks during all my probing.

And the headline number from my own first probe was 50% off from the rigorous measurement, which is why the rigorous measurement exists. The 14 ms number would have aged badly the first time a user with a real store ran the same probe and reported back the truth.

Honest limits of the prototype

Opt-in is all-or-nothing per machine. If I set the env var on Linux, every peer in the mesh that can dial my Linux box's IP can query the whole memory store. There is no per-project sharing flag. The cleaner project-scoped sharing model is the obvious next step.

Trust is mesh-IP reachability. Whoever can dial my mesh IP can call the peer endpoint if I have opted in. Signed-nonce challenge replacing mesh-IP-as-credential is the next hardening pass.

There is no graceful federation. If I query my Mac and it forwards to Linux, I get Linux's results back. If I want Mac's local results merged with Linux's, I do that in the client. A peer-aware retriever that automatically aggregates across all opted-in peers is the next product step.

The endpoint is read-only. Memory writes stay strictly local. That is deliberate; turning the cross-machine endpoint into a write path needs more careful trust modeling than I want to do in a prototype.

What this changed

Until last weekend, the Linux box was a peer that could serve me inference but not memory. The memory layer was strictly local. That was a clean privacy story but also a real product limit.

After two hours of work, the same architecture has a new opt-in endpoint that lets me query any of my machines' memory stores from any other. The default privacy posture is unchanged. The published architecture invariants still hold. The only change is that the people who want this can have it, by opting in on the machines they want to participate.

I measured it three times before I published a number because the first answer felt too clean. The truth turned out to be a band, not a point. About 20 ms most days. About 25 ms on slow-Tailscale nights. Scaling gracefully through 500 memories on the peer. That is a more honest claim than 14 ms, and it took two extra runs to learn it.

The architecture was right. The trust boundary was already where it needed to be. The thing I was missing was an env var, one new path on the peer surface, and the discipline to measure twice before publishing once.

Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.

An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.

Rob — Tue, 02 Jun 2026 18:28:57 +0000

I wrote a post yesterday about why GPUs barely help small text embeddings at batch=1. Different workload, same machines. This time I ran a local LLM inference benchmark across the same three boxes. The result complicated my hardware mental model in a way I think is worth sharing.

The setup

Three machines.

A Mac M2 Pro with 16 GB of unified memory, running Metal through llama-cpp-python.

A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. CUDA 13.

A Windows desktop with an AMD 5800X, 64 GB of RAM, and an RX 6600 XT with 8 GB of VRAM. Vulkan through llama.cpp.

Four models, all Q4_K_M quantization except the last. Phi-3 mini 3.8B. Qwen 2.5 7B. Llama 3.1 8B. Llama 3.1 70B at the more aggressive Q3_K_S as a stretch test.

Ten-prompt suite, mixing short Q&A, code generation, summarization, and long context. Three runs per prompt. Median across the runs.

The numbers

Generation tokens per second, overall median across the suite.

Model	Mac M2 Pro (Metal)	Linux 2080 Ti (CUDA)	Windows 6600 XT (Vulkan)
Phi-3 mini 3.8B	19.1	59.9	16.4
Qwen 2.5 7B	12.4	43.0	20.4
Llama 3.1 8B	11.6	40.1	20.9
Llama 3.1 70B Q3	won't fit	1.3	won't fit

The 2080 Ti winning everything makes intuitive sense. The Mac-versus-AMD comparison is the part that surprised me.

The anomaly

The RX 6600 XT is a roughly $200 used consumer GPU. It beats my Mac M2 Pro on Llama 3.1 8B by 80 percent. 20.9 tokens per second versus 11.6.

The same RX 6600 XT loses to my Mac on Phi-3 mini. 16.4 versus 19.1. A 14 percent loss.

Same hardware. Same benchmark harness. Same prompts. Opposite winner.

The reflex answer is "noise." It is not noise. The numbers held up across three runs per cell and ten prompts per cell. They held up in the per-category breakdowns, the prompt-eval rates, and the time-to-first-token measurements. The Mac wins for small models and loses for medium models. That is the finding.

Why this happens

Phi-3 mini at Q4_K_M is about 2.2 GB of weights. That fits in the M2 Pro's cache hierarchy comfortably.

Apple Silicon's unified memory architecture means there is no host-to-device transfer. The CPU and GPU share the same physical memory pool with the same bandwidth. There is no PCIe bus to cross. Dispatch overhead is the only fixed cost.

The RX 6600 XT has more raw VRAM bandwidth than the M2 Pro's unified pool. About 256 GB/s versus 200. But for a 2.2 GB model running one token at a time, you cannot saturate that bandwidth. The compute work per dispatch is too small. The PCIe round-trip and the Vulkan driver overhead eat the win.

For Qwen 7B and Llama 8B at Q4, the model is around 5 GB. That exceeds the M2 Pro's cache. The Mac is now memory-bandwidth-bound at the SoC level, sharing 200 GB/s between CPU and GPU. The discrete card is bandwidth-bound at the VRAM level, with 256 GB/s dedicated to the GPU alone. The discrete card wins.

The threshold where this flips is roughly where the model exceeds the M2 Pro's effective cache. For Q4 quantization, that threshold lives somewhere between 3.8B and 7B parameters.

What this means if you are buying hardware

The right question is not "which platform is faster for local AI." It is "which platform is faster for the model size I actually use."

If your loop is small specialized models. Routing classifiers. Lightweight rerankers. Sentence embedders. Mac wins. Buy more unified memory.

If your loop is 7B and 8B chat models. The midrange AMD card wins on price-per-token. Buy used.

If your loop is 13B and larger. NVIDIA's mature CUDA dispatch and the higher-end VRAM widen the gap, but the gap is still roughly proportional to the cost.

If your loop is 70B and above. None of this hardware is enough.

The honest answer to "what should I buy for local AI" is "what is the model going to be."

The hardware tier ceiling

The 70B result is the most useful data point in this benchmark, because it stops being about which platform wins.

Llama 3.1 70B at Q3 will not load on a 16 GB Mac. Will not load on an 8 GB AMD card. Runs at 1.3 tokens per second on the 62 GB system RAM Linux box with the 2080 Ti partially offloaded. Time to first token is 3.8 seconds. Technically possible. Unusable for chat.

Above that tier, you need a Mac Studio M3 Ultra with 512 GB of unified memory, or a 192 GB DDR5 workstation with a 24 GB GPU, or a multi-GPU rig. Those exist. They are not most developers' desks.

DeepSeek V3 and R1 sit higher still. At Unsloth's most aggressive Q1.58 quant they need around 131 GB of unified memory or 192 GB of system RAM. People do run them on consumer hardware. Just not on the kind of consumer hardware most developers own.

The "you need pooled compute" argument used to feel abstract to me. It does not anymore. There is a specific tier of model your current desk cannot run. Whichever model that is, that is where pooled compute starts to matter.

The summary

Hardware-versus-model-size matters more than vendor for local model inference. The Mac M2 Pro wins for small models that fit in cache. The discrete GPUs win once the model exceeds cache. The cheap AMD card is competitive with the more expensive NVIDIA card on price-per-token. None of this hardware runs 70B usably, and the larger 671B-class models need a hardware tier above any of it.

There is no universal winner. The right hardware depends on which model is in your loop.

If you are about to spend money on hardware for local AI, run the benchmark on the model you actually use before you commit. The vendor wars are not the answer.

Your GPU Probably Isn't Helping Your Retrieval System

Rob — Tue, 02 Jun 2026 16:28:07 +0000

Most "just use a GPU" advice is wrong for how anyone actually runs small models.

I spent yesterday benchmarking a 33M parameter embedding model across five hardware backends. The results were not what I expected.

The setup

Model: BAAI/bge-small-en-v1.5. 33M params, 384-dim output. The workhorse small embedder a lot of retrieval systems use.

Workload: LongMemEval oracle split, 500 instances, batch=1, single-query retrieval per call. Published academic benchmark, not a synthetic microbench. The query distribution is realistic.

Three machines.

A Mac M2 Pro with 16 GB of unified memory, running Metal via PyTorch MPS.

A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. Running Ubuntu 22.04 and CUDA 13. I had to fight the driver to get there. More on that below.

A Windows desktop with an AMD 5800X, 64 GB of RAM, and an RX 6600 XT with 8 GB of VRAM. Running Windows 11 with DirectML on top.

Metric: p50 and p95 latency per embedding call, measured across 500 instances.

The numbers

Backend	Hardware	p50 (ms)	p95 (ms)
Metal	M2 Pro (unified memory)	10.6	35.0
CUDA 13	RTX 2080 Ti	17.8	20.8
CPU	Intel 13700K	22.2	34.6
CPU	AMD 5800X	18.9	21.7
DirectML	RX 6600 XT	17.9	22.0

What jumps out

DirectML on the 6600 XT is statistically break-even with the AMD CPU on the same machine.

The GPU "acceleration" did nothing.

CUDA on the 2080 Ti wins, but only by about 20 percent on p50 and 40 percent on p95. That is a GPU costing five times what the 6600 XT does. The win is real but modest.

Metal wins outright. Unified memory eliminates host-to-device transfer entirely. There is no copy to make. Only dispatch overhead.

Why

At batch=1 with a 33M parameter model, you are dispatch-bound. Not compute-bound.

The per-call cost of kernel dispatch plus host-to-device transfer roughly equals the cost of just running the forward pass on a modern CPU. The GPU never gets a large enough block of work to justify its setup overhead.

This is the part of the "use a GPU" advice nobody mentions.

It is correct for big models. It is correct for big batches. It is correct for long sequences.

For small models running one query at a time, the math goes the other way.

The threshold shifts based on three things.

Model size: more params, more compute, more amortization of dispatch.

Batch size: more parallel work, same overhead spread thinner.

Sequence length: longer prompts, more matmul, same logic.

If you are embedding 32 documents at once on a serious GPU, CUDA wins decisively. If you are embedding one query at a time on a midrange consumer card, you are paying GPU tax to do CPU-scale work.

A debugging story you will probably relate to

On the Linux box, torch.cuda.is_available() returned False on a working CUDA install.

My first move was to blame the driver. It was on 535. Surely too old.

I bumped it to 580 and the problem went away.

The fix was correct. The diagnosis was not.

The actual root cause: torch 2.12.0+cu130 ships the CUDA 13 runtime. The 535 driver only supported CUDA 12.2. PyTorch needs runtime and driver to be ABI compatible. They were not. CUDA reported unavailable.

The driver was not "too old" in some general sense. It was specifically mismatched with my torch build tag.

I spent an hour writing the wrong root cause into my notes before I checked torch.version.cuda and saw the actual story.

If you are debugging a "broken" CUDA install, here is the order to check.

First, what CUDA version was your torch wheel built against. torch.__version__ tells you. cu130 means CUDA 13.

Then, what does your driver expose. nvidia-smi shows the max CUDA version it supports.

Mismatched ABI is failure mode one. Outdated driver is failure mode two. They look identical until you check.

The benchmark methodology trap

One more result caught me cleanly. Worth flagging.

Total wall time across 500 instances went up when I enabled CUDA. From 35.6 seconds on CPU to 64.6 seconds on CUDA. Even though per-call latency dropped.

The reason: my benchmark was re-initializing the model for every instance. GPU context init dominated the runtime. Per-call latency improved. Throughput regressed. CUDA looked slower in aggregate.

In production, where the model is loaded once and serves many queries, this inverts entirely.

If your benchmark structure does not match your production structure, you will measure the wrong thing. Cold-start cost is real. It lives outside the fast path.

The takeaway

For small embedders at batch=1, Metal with unified memory is fastest. CUDA is modestly better than CPU. DirectML is break-even with CPU.

The throughput answer is honestly kind of boring.

The reason is the part that generalizes. Dispatch overhead dominates at this scale.

If you are reaching for a GPU because that is what you do with ML, measure first. There is a real chance your CPU is fine and you are optimizing the wrong thing.

Why Strake Is Free Right Now

Rob — Mon, 01 Jun 2026 13:47:19 +0000

Strake is free right now.

Not because deploy safety has no value. Not because pricing does not matter. And not because I think free users magically turn into a business.

It is free because I need real teams trying it on real pull requests.

That is the only feedback that matters at this stage.

I can make the site cleaner. I can rewrite the homepage. I can tune the scoring rules in a local demo until they feel right. None of that tells me what happens when Strake comments on an actual PR and an engineer has to decide whether the verdict helped.

That is what I want to learn.

Was the GO obvious?

Was the HOLD annoying?

Did the CRITICAL verdict catch something the team would have shipped anyway?

Did the PR comment give enough context, or did the engineer still have to open PagerDuty, Datadog, Slack, GitHub, and a wiki tab to understand what was going on?

Those are product questions, not pricing questions.

I made the product smaller

Earlier versions of Strake were trying to explain too much at once.

Incident workflow. Runbooks. Service health. Deploy history. Dependency changes. Operational memory.

All of those pieces still matter. They are part of where Strake is going. But they were too much to lead with.

The first useful thing is much simpler:

Strake is a GitHub Action that tells you whether a deploy is riskier than it looks.

It runs in the pull request. It reads production context. It posts a GO / HOLD / CRITICAL verdict where the team is already deciding whether to ship.

That is the part I want people to try first.

Not a dashboard.

Not a new ceremony.

Not another place engineers have to remember to check before a release.

Open a PR. Run the gate check. Read the verdict. Decide whether to ship.

Why I moved it to GitHub Actions

The shift to GitHub Actions was not just an integration choice. It changed how I think about the product.

Most deploy-safety workflows ask you to leave the deploy decision to inspect the deploy risk.

That sounds small, but it is where the habit breaks.

If the answer is in another dashboard, someone has to remember the dashboard exists.

If the answer is in Slack, someone has to ask the right question in the right channel at the right time.

If the answer is in a wiki, someone has to know what to search for.

The PR is different.

The PR is already where the decision is happening.

That is where the gate belongs.

A green build tells you the code passed the checks it was given. It does not tell you production is already fragile. It does not tell you an incident is open on the same service. It does not tell you the last deploy failed. It does not tell you the dependency tree moved in a weird way. It does not tell you the runbook is missing.

Those are deploy-boundary questions.

They should show up before the deploy goes out.

GitHub Actions is boring in the right way. Teams already use it. It already runs on PRs. It already has a place to report status. Strake should meet the deploy decision there instead of asking teams to build a new habit from scratch.

Why free is the right first step

I do not want someone to ask, "Is Strake worth buying?" before they know whether Strake is worth using.

That is backwards.

The right first test is small:

Pick one repo.

Install the Action.

Open a pull request.

See whether the verdict is useful.

If the gate is noisy, I want to hear that. If it misses context your team cares about, I want to hear that too. If it catches a deploy risk you would have otherwise waved through, that is the signal I am looking for.

Charging too early would make the feedback worse.

People get polite when money is involved. They start evaluating plan limits, seat counts, procurement, and whether the pricing model maps to their org chart. I do not need that yet.

I need blunt product feedback from teams that actually ship production software.

So the first version is free to start.

What free does not mean

Free does not mean Strake is a toy.

It means the product is intentionally narrow right now.

The first question is:

Can Strake make one deploy decision better?

If the answer is no, a paid plan would not fix that.

If the answer is yes, the next questions get more interesting. More repos. Longer signal history. Team controls. Support. Security review. Better runbook workflows. Whatever a real production team needs before this becomes part of how they ship.

Those are good problems to earn later.

Right now, I would rather have five teams try Strake on one real repo and tell me exactly where the verdict is wrong than have fifty people admire a polished demo.

Who I want using it

Strake is not for teams with a mature release engineering group and years of internal deploy tooling.

It is for the teams in the middle.

You have customers in production. You ship through GitHub. You have PagerDuty or Datadog or Slack alerts or some mix of all three. You have runbooks somewhere, but they are not always where the on-call engineer needs them. You do not have a full SRE bench to build internal guardrails from scratch.

You still need to answer the same question before you push:

Is this deploy riskier than it looks?

That is what Strake is for.

Try it on one repo

The ask is intentionally small.

Try Strake on one repo.

Do not migrate your incident process. Do not rebuild your release workflow. Do not sit through a long demo if you would rather just see the thing work.

Install the GitHub Action, open a pull request, and judge the verdict.

If it is useful, keep going.

If it is too noisy or too quiet, tell me where.

That feedback is the whole reason Strake is free right now.

Try Strake on one repo.

You're Migrating Off Opsgenie. Here's What You Should Actually Fix.

Rob — Thu, 26 Mar 2026 14:35:11 +0000

You're Migrating Off Opsgenie. Here's What You Should Actually Fix.

Opsgenie's end-of-support is April 2027. If you're on a small engineering team, you're probably mid-migration right now — comparing PagerDuty pricing tiers, reading incident.io vs. BetterStack threads, maybe resigning yourself to Jira Service Management because you're already deep in the Atlassian ecosystem.

I want to suggest something uncomfortable before you pick your next tool: alerting was never your actual problem.

I managed Opsgenie rotations at three different companies over the past eight years. FreightWaves, TextNow, Pilot Flying J. Different industries, different stacks, different team sizes. The pattern was always the same.

Someone would deploy a change. Something would break. Opsgenie would page the on-call engineer. That engineer would open a Notion doc titled "Runbook — Service X" that hadn't been updated since 2022. They'd mostly ignore it and Slack the person who wrote the service. That person would fix it. Everyone would move on. Two weeks later, something similar would happen again.

Opsgenie did its job perfectly. It routed the alert to the right person. The problem was everything else.

The question nobody was asking

At none of those companies — not one — did anyone ask the obvious question before deploying: is it safe to push right now?

Not "did CI pass." Not "did someone approve the PR." I mean: is the system healthy enough to absorb a change right now? Are we burning through error budget? Is there already an active incident? Did someone just deploy 20 minutes ago and we haven't seen the impact yet?

Nobody asked because there was no way to answer it. The information existed — scattered across Datadog, PagerDuty, GitHub, Slack — but nobody had assembled it into a single decision. So engineers deployed based on gut feel. "Seems fine." "I don't see anything in Slack." "The dashboards look okay I guess."

43% of incidents are preceded by a recent deploy. That number didn't surprise me at all when I first saw it. It matched what I'd lived through.

The runbook problem is worse than you think

Here's the thing about the Opsgenie migration conversation that nobody is having: most teams using Opsgenie didn't just use it for alerting. It was their entire incident process. Alert comes in, Opsgenie pages someone, that person figures it out. There was no structure beyond that.

The runbooks — if they existed — lived in Confluence or Notion. I wrote about this in incident management without a dedicated SRE, and the core problem hasn't changed: a runbook that's three clicks away from the alert that triggered it is a runbook that doesn't get opened at 3am.

I've seen this enough times to have a visceral reaction to it. The on-call engineer gets paged, opens Slack, asks "has anyone seen this before?" and waits. Meanwhile the customer is staring at a broken login page. The runbook that would have told them to check the config deployment and roll back the last change is sitting in a Confluence space that the engineer didn't even know existed.

Teams that connect their runbooks directly to their alerts — so the runbook opens automatically when the relevant alert fires — cut their mean time to resolution from 67 minutes to 23. That's not a marginal improvement. That's the difference between an incident that costs you a customer and one that costs you 20 minutes.

What "migrating off Opsgenie" should actually mean

If you're going to rip out a core piece of your incident workflow, this is the moment to ask harder questions than "which alerting tool has the best Slack integration."

The questions I'd ask:

Do you know whether it's safe to deploy right now? Not in a gut-feel way. In a "here's your error budget status, here's your active incident count, here's your deploy velocity over the last 24 hours" way. If you don't have that, you're going to keep causing the incidents that your new shiny alerting tool routes to your team.

When someone gets paged, do they know what to do? Not "figure it out." Actually know — because the runbook showed up in front of them automatically, with the steps they need and the context about what changed. If your runbooks are still in a wiki, your migration isn't going to fix the thing that actually hurts.

Are you learning anything from your incidents? Not in a blameless-postmortem-Google-Doc way. I mean: does your system know that this service broke last Tuesday after a similar deploy? Does your deploy process incorporate the history of what's gone wrong before? Most teams I've worked with have zero institutional memory. Every incident is treated as a surprise, even when it's the third time it's happened.

Don't just swap your paging tool

The Opsgenie shutdown is a forcing function. Use it.

If you just swap Opsgenie for PagerDuty or BetterStack, you'll have the same problem in a different UI. Engineers deploying blind. Runbooks gathering dust. Your on-call rotation burning people out because every incident starts from scratch. I wrote about the monitoring version of this trap in your startup doesn't need better monitoring — the tooling isn't the bottleneck. The process is.

The actual fix is a layer that sits before your alerting tool. A deploy gate that tells your team whether it's safe to push. Connected runbooks that show up when things break. Incident data that compounds into institutional knowledge so your team stops relearning the same failure every quarter.

That's what I'm building at Strake. It's in private beta right now and it works alongside whatever alerting tool you pick — PagerDuty, BetterStack, Grafana OnCall, whatever. The deploy gate and the runbook layer are the parts that were always missing, regardless of who was routing the page.

If you're mid-migration and want to talk through how your team handles deploy safety, I'm happy to jump on a call. Not a pitch — I'm genuinely trying to learn from teams going through this right now.

Rob is building Strake — a deploy gate and incident workflow platform for engineering teams without dedicated SRE coverage. If your current incident process is "someone posts in Slack and we figure it out from there," come take a look at strake.dev.