<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Rob</title>
    <description>The latest articles on DEV Community by Rob (@newtorob).</description>
    <link>https://dev.to/newtorob</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F8356%2F59523132-eb85-458e-9db5-b57cb8ee59b2.jpeg</url>
      <title>DEV Community: Rob</title>
      <link>https://dev.to/newtorob</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/newtorob"/>
    <language>en</language>
    <item>
      <title>Local AI Needs Data-Plane Health Checks</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Sun, 14 Jun 2026 20:59:02 +0000</pubDate>
      <link>https://dev.to/newtorob/local-ai-needs-data-plane-health-checks-2ene</link>
      <guid>https://dev.to/newtorob/local-ai-needs-data-plane-health-checks-2ene</guid>
      <description>&lt;p&gt;The worst network bugs are the ones where every dashboard says green and the packet still dies.&lt;/p&gt;

&lt;p&gt;That was my Sunday.&lt;/p&gt;

&lt;p&gt;I have a Mac that I use as my daily machine and a Linux box called &lt;code&gt;newtorob&lt;/code&gt; with a 2080 Ti in it. Potluck runs a local AI sidecar on each machine. The Mac can use its own model locally, or route a request to another machine in my household over a WireGuard mesh.&lt;/p&gt;

&lt;p&gt;The product shape is simple:&lt;/p&gt;

&lt;p&gt;Mac app -&amp;gt; local sidecar -&amp;gt; WireGuard mesh -&amp;gt; Linux sidecar -&amp;gt; model runtime -&amp;gt; streamed tokens back to the Mac.&lt;/p&gt;

&lt;p&gt;This is the "my machines" path. No model API. No cloud inference. The coordinator handles roster and signaling metadata, but the prompt itself should go directly over the private mesh to my own hardware.&lt;/p&gt;

&lt;p&gt;Everything looked connected. The Linux peer was enrolled. The coordinator knew about it. The mesh sidecar was running. The UI showed a peer. The machine had a model loaded.&lt;/p&gt;

&lt;p&gt;Then I sent a prompt and got: no reachable peer.&lt;/p&gt;

&lt;p&gt;The control plane was green. The data plane was dead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The lie in "online"
&lt;/h2&gt;

&lt;p&gt;Most peer health checks answer the wrong question.&lt;/p&gt;

&lt;p&gt;A coordinator heartbeat proves the peer can talk to the coordinator.&lt;/p&gt;

&lt;p&gt;A WebSocket connection proves the peer can keep one control connection open.&lt;/p&gt;

&lt;p&gt;A WireGuard handshake proves two tunnel endpoints exchanged packets recently.&lt;/p&gt;

&lt;p&gt;A capabilities response proves a process can report what it thinks it can serve.&lt;/p&gt;

&lt;p&gt;None of those prove that an inference request can cross the exact path the product needs right now.&lt;/p&gt;

&lt;p&gt;For a local AI mesh, the real question is not "is this peer online?"&lt;/p&gt;

&lt;p&gt;The real question is:&lt;/p&gt;

&lt;p&gt;Can this prompt reach that model and stream a token back right now?&lt;/p&gt;

&lt;p&gt;That distinction matters because the failure modes sit between the layers. A peer can be present in the roster while the tunnel is broken. A tunnel can have a fresh handshake while HTTP over the tunnel fails. A model can be loaded while the process is unreachable from the other machine. A privacy VPN can silently drop traffic on an interface it does not recognize while every higher-level control check looks fine.&lt;/p&gt;

&lt;p&gt;That last one was my bug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first wrong theory
&lt;/h2&gt;

&lt;p&gt;My first theory was MTU.&lt;/p&gt;

&lt;p&gt;That was not random. WireGuard-over-WireGuard paths are good at producing partial success. A small handshake packet can pass while larger data packets disappear. If path MTU discovery is broken, the tunnel looks alive and the application path dies. This is exactly the kind of problem where "connected" and "usable" diverge.&lt;/p&gt;

&lt;p&gt;Tailscale and NetBird both default to conservative MTUs around 1280 for a reason. WireGuard adds overhead. Relays add overhead. Residential networks add weirdness. If you run a local mesh on top of another VPN, a 1420-byte default can turn into a packet shredder.&lt;/p&gt;

&lt;p&gt;So I checked the mesh MTU.&lt;/p&gt;

&lt;p&gt;It was already 1280.&lt;/p&gt;

&lt;p&gt;That was a useful dead end. It ruled out the cleanest explanation and left the uglier one: the packet was not too large. It was not allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real cause
&lt;/h2&gt;

&lt;p&gt;The Linux box runs Mullvad. The Mac also has Tailscale. Potluck uses a &lt;code&gt;potluck0&lt;/code&gt; WireGuard interface and mesh IPs in the &lt;code&gt;100.64.0.0/10&lt;/code&gt; range.&lt;/p&gt;

&lt;p&gt;That combination has two separate traps.&lt;/p&gt;

&lt;p&gt;Tailscale treats &lt;code&gt;100.64.0.0/10&lt;/code&gt; as its space. Its nftables rules can drop packets from that range when they arrive on a non-&lt;code&gt;tailscale0&lt;/code&gt; interface.&lt;/p&gt;

&lt;p&gt;Mullvad's killswitch is stricter. It installs nftables chains with default-drop policy and allows traffic only through interfaces it trusts. &lt;code&gt;potluck0&lt;/code&gt; is not one of them.&lt;/p&gt;

&lt;p&gt;From Mullvad's perspective, this is correct behavior. A privacy VPN killswitch should not let random interfaces become escape hatches.&lt;/p&gt;

&lt;p&gt;From Potluck's perspective, this means my own mesh interface is blocked unless I add a narrow exception.&lt;/p&gt;

&lt;p&gt;The fix was three scoped accept rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nft insert rule ip filter ts-input iifname potluck0 accept
nft insert rule inet mullvad input iifname potluck0 accept
nft insert rule inet mullvad output oifname potluck0 accept
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not a flush. Not a policy change. Not disabling the VPN firewall. Just a hole for the Potluck mesh interface.&lt;/p&gt;

&lt;p&gt;After that, the Mac could hit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://100.64.0.7:8321/health
curl http://100.64.0.7:8321/peer/capabilities
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both returned 200. The prompt routed to the Linux box. The footer showed it ran on &lt;code&gt;newtorob-a16&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That fixed the immediate problem.&lt;/p&gt;

&lt;p&gt;Then it broke again.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reconnects erase one-shot fixes
&lt;/h2&gt;

&lt;p&gt;VPN clients rebuild firewall rules.&lt;/p&gt;

&lt;p&gt;That sentence is obvious after you have been bitten by it once. It is not obvious when you are staring at a mesh that worked five minutes ago.&lt;/p&gt;

&lt;p&gt;Mullvad, Proton, Nord, and similar clients do not treat nftables as a stable place where your hand-inserted rule gets to live forever. Reconnect the VPN, switch servers, wake from sleep, change networks, and the client may recreate its ruleset. Your narrow exception disappears. The killswitch keeps doing its job. Your mesh goes dark again.&lt;/p&gt;

&lt;p&gt;My first fix was a boot-time one-shot. It installed the three accept rules when the machine started. That survives reboots. It does not survive VPN reconnects.&lt;/p&gt;

&lt;p&gt;The better fix was a watcher.&lt;/p&gt;

&lt;p&gt;Every few seconds it checks whether the accept rules that should exist still exist. If Tailscale or Mullvad is not present, it owes nothing. If they are present and any of the three &lt;code&gt;potluck0&lt;/code&gt; rules are missing, it reruns the same idempotent insert path.&lt;/p&gt;

&lt;p&gt;The loop is boring by design:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="k"&gt;while &lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
    &lt;/span&gt;&lt;span class="nb"&gt;sleep&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;POTLUCK_FW_WATCH_INTERVAL&lt;/span&gt;&lt;span class="k"&gt;:-&lt;/span&gt;&lt;span class="nv"&gt;5&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt; rules_intact&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;then
        &lt;/span&gt;log &lt;span class="s2"&gt;"accept rule(s) missing; reapplying"&lt;/span&gt;
        do_install &lt;span class="o"&gt;||&lt;/span&gt; log &lt;span class="s2"&gt;"reapply hit an error; will retry on next tick"&lt;/span&gt;
    &lt;span class="k"&gt;fi
done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The systemd unit is also boring:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;exec&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/local/lib/potluck/install-firewall-rules.sh --watch&lt;/span&gt;
&lt;span class="py"&gt;ExecStop&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/local/lib/potluck/install-firewall-rules.sh --uninstall&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I tested it the blunt way. Run &lt;code&gt;--uninstall&lt;/code&gt;, confirm all three rules are missing, wait seven seconds, confirm they are back. The journal logged the reapply event. The mesh stayed usable after that.&lt;/p&gt;

&lt;p&gt;That is not the whole product fix. It is only the repair for this Linux VPN coexistence case.&lt;/p&gt;

&lt;p&gt;The product fix is diagnostics.&lt;/p&gt;

&lt;h2&gt;
  
  
  A health check needs to follow the work
&lt;/h2&gt;

&lt;p&gt;The lesson is not "add firewall rules."&lt;/p&gt;

&lt;p&gt;The lesson is that local AI needs data-plane health checks.&lt;/p&gt;

&lt;p&gt;If a system routes inference across machines, it should have a check that uses the same route as inference. Not just the same peer. Not just the same coordinator. The same path.&lt;/p&gt;

&lt;p&gt;For my setup, a real health check should answer separate questions:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Is the local sidecar running?&lt;/li&gt;
&lt;li&gt;Is the coordinator reachable?&lt;/li&gt;
&lt;li&gt;Is the peer present in the roster?&lt;/li&gt;
&lt;li&gt;Is there a recent WireGuard handshake?&lt;/li&gt;
&lt;li&gt;Can this machine make an HTTP request to the peer sidecar over the mesh IP?&lt;/li&gt;
&lt;li&gt;Can the peer stream a small response from the model path?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Those are different failures with different owners and different fixes.&lt;/p&gt;

&lt;p&gt;If the coordinator is down, restarting Mullvad will not help.&lt;/p&gt;

&lt;p&gt;If the peer is powered off, reapplying nftables rules will not help.&lt;/p&gt;

&lt;p&gt;If the WireGuard key in the coordinator is stale, reloading the model will not help.&lt;/p&gt;

&lt;p&gt;If the model runtime is missing CUDA libraries, the tunnel can be perfect and inference will still fail.&lt;/p&gt;

&lt;p&gt;If Mullvad dropped &lt;code&gt;potluck0&lt;/code&gt;, the peer can look enrolled and still be unusable.&lt;/p&gt;

&lt;p&gt;The UI should not compress all of that into "offline."&lt;/p&gt;

&lt;p&gt;It should say "coordinator unreachable," "peer not present," "no tunnel," "relayed," "firewall likely blocking mesh traffic," or "model runtime not ready." The exact labels matter less than the principle: name the layer that failed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters more for local AI than normal SaaS
&lt;/h2&gt;

&lt;p&gt;In a normal SaaS product, most of the network path is owned by the operator. The user opens a browser. Your load balancer works or it does not. Your app servers work or they do not. There are still ugly edge cases, but the core path is under one operational umbrella.&lt;/p&gt;

&lt;p&gt;Local AI is different.&lt;/p&gt;

&lt;p&gt;The path crosses the user's laptop, their OS firewall, their VPN, their home router, a mesh tunnel, another machine's firewall, a model sidecar, a Python runtime, a GPU driver, and a model file on disk.&lt;/p&gt;

&lt;p&gt;The product does not get to pretend that is one boolean.&lt;/p&gt;

&lt;p&gt;This is especially true for "my machines" routing. The whole point is to make a user's idle hardware useful: Mac for the app, Linux box for GPU inference, Windows desktop for another model, maybe a mini PC in a closet. That is a better architecture for ownership and cost. It is also a worse architecture for lazy health checks.&lt;/p&gt;

&lt;p&gt;The user should not need to know nftables to understand why their peer is unavailable.&lt;/p&gt;

&lt;p&gt;The software should know enough to say: "Your peer is visible, but data-plane traffic over &lt;code&gt;potluck0&lt;/code&gt; is blocked. Reapply the scoped firewall rules or disable the VPN killswitch exception."&lt;/p&gt;

&lt;p&gt;Even better, with consent, it should offer the fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I changed
&lt;/h2&gt;

&lt;p&gt;The immediate change was operational:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add a narrow, reversible firewall helper for Linux systems using Tailscale and Mullvad.&lt;/li&gt;
&lt;li&gt;Run it as a long-lived systemd watcher, not a boot-only one-shot.&lt;/li&gt;
&lt;li&gt;Keep the scope to &lt;code&gt;potluck0&lt;/code&gt; accepts. Do not flush rulesets. Do not weaken the broader VPN policy.&lt;/li&gt;
&lt;li&gt;Verify the real path with &lt;code&gt;curl&lt;/code&gt; to &lt;code&gt;/health&lt;/code&gt; and &lt;code&gt;/peer/capabilities&lt;/code&gt;, then a prompt that actually runs on the remote machine.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The next change is product:&lt;/p&gt;

&lt;p&gt;Replace the single peer-status badge with a small diagnostics model. Local host, coordinator, relay, tunnel, peer data plane, model runtime. Each layer gets a named failure and a concrete fix.&lt;/p&gt;

&lt;p&gt;That is less elegant than a green dot.&lt;/p&gt;

&lt;p&gt;It is also more honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  The check I want
&lt;/h2&gt;

&lt;p&gt;The check I want is not expensive.&lt;/p&gt;

&lt;p&gt;Send a tiny HTTP probe to the peer over the mesh. Sometimes send a larger one to catch MTU and fragmentation problems. If the app is about to route inference, ask the peer for capabilities over the same path. If that passes, optionally send a tiny model-path probe before marking the peer usable for a real prompt.&lt;/p&gt;

&lt;p&gt;Cache the answer briefly. Debounce flaps. Suppress downstream errors when an upstream layer is already broken.&lt;/p&gt;

&lt;p&gt;But do not call the peer reachable just because a control-plane heartbeat exists.&lt;/p&gt;

&lt;p&gt;That is how I lost an afternoon.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would tell anyone building this
&lt;/h2&gt;

&lt;p&gt;If you are building local-first AI across machines, do not start with "peer online."&lt;/p&gt;

&lt;p&gt;Start with the path:&lt;/p&gt;

&lt;p&gt;Can the request leave this process?&lt;/p&gt;

&lt;p&gt;Can it cross the mesh?&lt;/p&gt;

&lt;p&gt;Can it reach the peer process?&lt;/p&gt;

&lt;p&gt;Can the peer reach the model runtime?&lt;/p&gt;

&lt;p&gt;Can one token come back?&lt;/p&gt;

&lt;p&gt;Everything else is metadata.&lt;/p&gt;

&lt;p&gt;The metadata is still useful. Heartbeats, handshakes, rosters, relay status, and capabilities all help narrow the search. But they are not proof that the system can do the work.&lt;/p&gt;

&lt;p&gt;A local AI mesh should not ask "is the peer online?"&lt;/p&gt;

&lt;p&gt;It should ask "can this prompt reach that model and stream a token back right now?"&lt;/p&gt;

&lt;p&gt;That is the health check that matters.&lt;/p&gt;




&lt;p&gt;Rob writes the &lt;em&gt;Local AI Engineering Notes&lt;/em&gt; series on strake.dev. He's building &lt;a href="https://trypotluck.ai" rel="noopener noreferrer"&gt;Potluck AI&lt;/a&gt;, a local-first AI system that routes inference across your own machines and trusted peers, and &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt;, a GitHub Action deploy gate.&lt;/p&gt;

</description>
      <category>localai</category>
      <category>wireguard</category>
      <category>networking</category>
      <category>debugging</category>
    </item>
    <item>
      <title>Keep the Credit Ledger Off-Chain. Checkpoint It On-Chain.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Fri, 05 Jun 2026 20:42:12 +0000</pubDate>
      <link>https://dev.to/newtorob/keep-the-credit-ledger-off-chain-checkpoint-it-on-chain-4acf</link>
      <guid>https://dev.to/newtorob/keep-the-credit-ledger-off-chain-checkpoint-it-on-chain-4acf</guid>
      <description>&lt;p&gt;I specced an on-chain credit ledger for a compute network, wrote the mint, and then the token program refused to initialize it. The two extensions I needed (non-transferable credits, plus a hook on every movement so the chain could run the accounting) are mutually exclusive on a single Solana mint. The runtime rejects the initialize instruction. That rejection sent me back to the design, and I came out the other side with the ledger off-chain and only a daily checkpoint on-chain.&lt;/p&gt;

&lt;p&gt;This post is why I think that is the right architecture for a usage-accounting ledger in general, not just the workaround for one runtime constraint. It is specific to compute-metering credits (earned for serving inference, spent for consuming it), and most of it generalizes to any high-frequency internal accounting system that someone reaches for a blockchain to secure.&lt;/p&gt;

&lt;h2&gt;
  
  
  What a credit ledger actually is
&lt;/h2&gt;

&lt;p&gt;Start with the workload, because the workload is what should pick the substrate.&lt;/p&gt;

&lt;p&gt;A usage credit ledger is a meter. It has two write paths: a contributor serves a job and earns credits, a user consumes a job and spends them. There is a third, smaller path for verification adjustments and refunds. Writes are append-heavy. Reads are dominated by one question: what is this account's balance right now. Each event carries a tiny value, a fraction of a cent of compute, and the events arrive at a rate set by network traffic rather than by anything financial.&lt;/p&gt;

&lt;p&gt;Put rough numbers on it. A network serving 10,000 inference requests a day generates at least 10,000 spend events, a comparable number of earn events, and some verification deltas on top. Call it 20,000 to 30,000 ledger writes a day at small scale. At real scale you multiply that by two or three orders of magnitude. This is the shape of a phone company's call-detail records or a cloud provider's billing meter. None of those run on a blockchain, and the reasons they don't are the reasons a credit ledger shouldn't either.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fee argument is the weakest one
&lt;/h2&gt;

&lt;p&gt;People assume the problem with an on-chain ledger is gas, so I want to deal with that first and set it aside, because on a cheap chain it mostly doesn't hold.&lt;/p&gt;

&lt;p&gt;Solana's base fee is 5,000 lamports per signature. Even with priority fees during contention you are paying fractions of a cent per transaction. Thirty thousand writes a day at those rates is a rounding error against any real infrastructure budget. If raw transaction cost were the only objection, I would have kept the ledger on-chain and moved on.&lt;/p&gt;

&lt;p&gt;The real objections are three, and none of them is about fees: a protocol constraint, a migration trap, and a liveness coupling.&lt;/p&gt;

&lt;h2&gt;
  
  
  The protocol constraint
&lt;/h2&gt;

&lt;p&gt;This is the one that physically stopped me, so it is worth being precise.&lt;/p&gt;

&lt;p&gt;A usage credit should be non-transferable. You do not want an internal metering unit trading on a secondary market as a speculative asset, because the moment it does, the price of compute on your network starts moving with a token chart instead of with the cost of compute. Solana's Token-2022 program supports this directly through the NonTransferable extension.&lt;/p&gt;

&lt;p&gt;An on-chain ledger also means custom logic runs on every credit movement. In Token-2022 that is the TransferHook extension: a program you control that fires on each transfer and does your accounting.&lt;/p&gt;

&lt;p&gt;You cannot initialize a single mint with both. NonTransferable and TransferHook are mutually exclusive at the runtime level, and the initialize instruction fails if you try to combine them. The logic is consistent once you see it: a hook that fires on transfer is meaningless on a token that can never transfer. So you get one or the other. You can have non-transferable credits with no movement logic, or transferable credits with a hook. The combination an internal metering credit actually wants is the one the program will not let you build. That alone took the on-chain ledger off the table before I reached any economic argument.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration trap
&lt;/h2&gt;

&lt;p&gt;A live balance system on-chain is the hardest component in the whole network to change.&lt;/p&gt;

&lt;p&gt;Any program upgrade that touches how balances are represented is a migration of live, real-value state, with all the replay and ordering surface that implies. You cannot test it the way you test a service, because the thing you are mutating is everyone's money-equivalent at once. The scale precedent here is Helium's own L1-to-Solana move in April 2023: migrating a live token and balance system across chains is a multi-quarter, high-risk operation that a small team executes once and dreads.&lt;/p&gt;

&lt;p&gt;If the ledger lives on-chain, every schema change inherits a slice of that risk. If the ledger lives off-chain, a schema change is a database migration you can stage, dry-run against a copy, and roll back. I would rather own that problem in Postgres than in a program upgrade.&lt;/p&gt;

&lt;h2&gt;
  
  
  The liveness coupling
&lt;/h2&gt;

&lt;p&gt;If every debit is a transaction, the product's ability to settle a credit is coupled to the base layer's ability to land a transaction.&lt;/p&gt;

&lt;p&gt;Solana has had multi-hour degradation events. During one, an on-chain ledger cannot record a spend. So either the network stops settling credits until the chain recovers, or it queues the events somewhere off-chain and replays them later, at which point you have reinvented an off-chain ledger under worse conditions than if you had designed one on purpose. The dependency runs backwards. A meter should keep metering through an outage in something it does not control.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Helium actually did
&lt;/h2&gt;

&lt;p&gt;The precedent worth copying is sitting right there, and most of the networks I look at miss it.&lt;/p&gt;

&lt;p&gt;Helium does not put its high-frequency accounting on-chain. Proof-of-Coverage and data-usage accounting run off-chain through oracles. Solana is the settlement and state anchor. Data Credits, the unit users burn to move bytes, are non-transferable and priced at a fixed $0.00001 per packet, and the per-event question of who covered what and who used how many bytes does not hit the chain one transaction at a time. When Helium moved its L1 onto Solana, it kept that split intact.&lt;/p&gt;

&lt;p&gt;The lesson I take from the closest structural analog to what I am building: the chain secures the anchor, and the accounting stays in a system built for accounting.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design: hash chain plus daily checkpoint
&lt;/h2&gt;

&lt;p&gt;Here is the architecture I landed on.&lt;/p&gt;

&lt;p&gt;The ledger is an append-only log in an ordinary database. Each entry includes the hash of the previous entry, which makes the log a hash chain. You cannot alter a past entry without rewriting every entry after it, and any such rewrite is detectable by recomputing the chain from any earlier known-good hash forward. This is the same construction a blockchain uses internally, applied to a private log.&lt;/p&gt;

&lt;p&gt;Once a day, you build a Merkle tree over the current balance set, take the single Merkle root, and publish that root on-chain through a tiny program whose only job is to store roots with timestamps. The program holds 32 bytes per checkpoint and does nothing else.&lt;/p&gt;

&lt;p&gt;That on-chain root is the commitment. It records, with the chain's finality behind it, the exact state of the ledger at the checkpoint. Later, anyone can be handed an inclusion proof: here is your account entry, here is the Merkle path, here is the root the chain recorded that day. They verify the path against the on-chain root and confirm their balance without trusting the operator's word for it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the checkpoint buys
&lt;/h2&gt;

&lt;p&gt;Three things, mapping one-to-one onto the three objections.&lt;/p&gt;

&lt;p&gt;Auditability without per-event cost. The public root is a commitment the operator cannot quietly walk back. A member who suspects their balance was altered requests an inclusion proof and checks it against the chain themselves. You get the property people actually wanted from an on-chain ledger, which is that nobody can rewrite history in the dark, without paying to write every line of history on-chain.&lt;/p&gt;

&lt;p&gt;Lower migration risk. This is the one that changed my own roadmap. With an on-chain ledger, shipping the credit system meant launching a live balance migration, which made it the single highest-risk primitive in the build. With checkpoints, shipping means deploying a program that stores 32-byte roots. The risky work shrinks from migrating a live balance system to publishing a checkpoint. Those are different sizes of problem, and the second one I can ship in an afternoon and sleep after.&lt;/p&gt;

&lt;p&gt;Liveness tolerance. The checkpoint cadence is daily, so a multi-hour base-layer degradation delays a root and changes nothing a user sees. The ledger keeps recording the whole time. The next checkpoint catches up. The settlement path never touched the chain to begin with, so an outage in the chain cannot stall it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits
&lt;/h2&gt;

&lt;p&gt;This design has real costs, and I would rather name them than have a reader find them.&lt;/p&gt;

&lt;p&gt;Between checkpoints, you trust the operator. The hash chain makes tampering detectable after the fact, but a dishonest operator can still serve a wrong balance for up to one checkpoint interval before the next root exposes the divergence, and detection requires someone to actually request and verify a proof. Shorter intervals shrink the window and cost more transactions. I think daily is right for a metering credit whose balances move in small increments, and I would shorten the interval before it became a real exposure.&lt;/p&gt;

&lt;p&gt;The hash chain proves integrity, not correctness. It proves the log was not altered. It says nothing about whether a given entry should have been written. Whether a contributor genuinely served the job they were credited for is a separate problem, solved by verification sampling, and no ledger structure substitutes for that.&lt;/p&gt;

&lt;p&gt;Availability of the off-chain log is your responsibility. The on-chain root is permanent. The log it commits to lives in your database, and if you lose the log, the root commits to something you can no longer reconstruct. So the off-chain store needs the durability discipline of any system of record: replication, backups, tested restores, the boring parts that decide whether the clever parts survive.&lt;/p&gt;

&lt;p&gt;And this approach assumes the credit never needs trustless real-time finality. If a unit genuinely has to be final on a public ledger the instant it moves, an exchange-traded asset for example, checkpoints are the wrong tool and you should pay the on-chain cost in full. The case where checkpoints fit is a usage credit that lives and dies inside the network and never trades.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for what I am building
&lt;/h2&gt;

&lt;p&gt;I am building Potluck, a peer-to-peer AI compute network, and the credit ledger is the layer that accounts for compute served and compute consumed. We run it off-chain as a hash chain and publish a daily Merkle root on-chain. I arrived here the long way, by speccing an on-chain ledger first, hitting the Token-2022 mutual-exclusivity wall, then the migration risk, then the liveness coupling. The off-chain-with-checkpoints design answered all three with one structure.&lt;/p&gt;

&lt;p&gt;The general claim I will stand behind: a usage-accounting ledger is a meter, and a meter wants a database with a cryptographic commitment over it, not a blockchain carrying every event. Put the trust anchor on-chain. Keep the meter where meters run.&lt;/p&gt;




&lt;p&gt;Rob writes the &lt;em&gt;Local AI Engineering Notes&lt;/em&gt; series on strake.dev. He is also building &lt;a href="https://trypotluck.ai" rel="noopener noreferrer"&gt;Potluck AI&lt;/a&gt;, the peer-to-peer AI compute network referenced in this post, and &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt;, a GitHub Action deploy gate.&lt;/p&gt;

</description>
      <category>depin</category>
      <category>solana</category>
      <category>crypto</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Why Decentralized AI Compute Needs Two Assets, Not One</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 04 Jun 2026 18:34:24 +0000</pubDate>
      <link>https://dev.to/newtorob/why-decentralized-ai-compute-needs-two-assets-not-one-2a5k</link>
      <guid>https://dev.to/newtorob/why-decentralized-ai-compute-needs-two-assets-not-one-2a5k</guid>
      <description>&lt;p&gt;Bittensor pays roughly eight dollars in TAO token emissions for every dollar of real AI revenue that flows through the network. The exact ratio fluctuates by quarter, but the shape is durable. Q1 2026: about $328 million in annual emissions against $43 million in real AI revenue. That is 7.6 to 1. It is what the crypto-skeptical press has called "extractive by default." It is also what the crypto-friendly analysts call "the subsidy treadmill."&lt;/p&gt;

&lt;p&gt;The Bittensor engineering team is sophisticated. The subnet validators run real ML evaluation. The miners serve real inference. The revenue is real. The emissions are also real.&lt;/p&gt;

&lt;p&gt;The cause is the token model itself. One asset is asked to do two jobs that do not belong together. I want to be specific about this part, because every other decentralized AI compute network I have looked at has the same problem, and the fix is well-known.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the token does
&lt;/h2&gt;

&lt;p&gt;A token in a decentralized AI compute network does two structurally distinct things.&lt;/p&gt;

&lt;p&gt;The first job is &lt;strong&gt;utility settlement&lt;/strong&gt;. Contributors run inference, and someone has to pay them for the compute work they did. The payment medium has to scale with usage, has to be denominated in something the contributor can spend on the network or convert to fiat, and has to remain stable enough that contributors can plan around it. This is a billing system.&lt;/p&gt;

&lt;p&gt;The second job is &lt;strong&gt;value capture&lt;/strong&gt;. Early supporters, investors, and contributors take risk to bootstrap a network that does not yet exist. They have to be paid back for that risk in a way that scales with the eventual success of the network. The payment medium has to be a speculative asset that appreciates as the network grows. This is an equity instrument.&lt;/p&gt;

&lt;p&gt;A billing system and an equity instrument want opposite things. A billing system that is also a speculative asset means that contributors who get paid in it cannot help but hold a speculative position. An equity instrument that is also a billing system means that token-price volatility shows up in the unit economics of the network itself.&lt;/p&gt;

&lt;p&gt;The single-token model conflates these two jobs. Bittensor's TAO is both. Akash's AKT is both. Render's RNDR is both. The conflation is what produces the subsidy treadmill.&lt;/p&gt;

&lt;h2&gt;
  
  
  The mechanism
&lt;/h2&gt;

&lt;p&gt;Walk through what happens in a single-token network at steady state.&lt;/p&gt;

&lt;p&gt;Real AI revenue flows in at rate R. Token emissions flow out at rate E. Contributors decide whether to keep running nodes based on whether (R + E)/N (where N is the number of active nodes) clears their economic threshold.&lt;/p&gt;

&lt;p&gt;In a healthy network, R rises as the network finds product-market fit, and E declines on a published glide-path. The two converge somewhere around year 5 or 7. At that convergence point, the network operates on real revenue and the equity holders capture the value that has accrued in the token.&lt;/p&gt;

&lt;p&gt;The problem is what happens before that convergence. Contributors are sensitive to the dollar value of their pay, and the token-denominated component (E) is large compared to the real revenue component (R). When the token price falls for any reason (macro selloff, competing subnet, ETF rejection, founder error), contributors leave. When contributors leave, the network's quality of service falls. When quality of service falls, real revenue falls. When real revenue falls, the token price falls further. The feedback loop is right there in the mechanism.&lt;/p&gt;

&lt;p&gt;The way out of the loop is more emissions, faster. That is the subsidy treadmill: emit faster to retain contributors, which dilutes the token, which weakens the contributor payoff, which requires more emissions. Bittensor's 7.6 to 1 ratio is the ratio of "what we have to emit to keep contributors here" to "what the network actually does." Most defenders call it a bootstrap phase. The math says it's the equilibrium that one-asset mechanism design produces.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two-asset alternative
&lt;/h2&gt;

&lt;p&gt;Separate the assets. Use a stable utility credit for the billing job. Use a tradeable governance token for the equity job. Let each asset do the job it is good at.&lt;/p&gt;

&lt;p&gt;The utility credit is denominated in compute hours. One credit equals one normalized A100-minute equivalent. Contributors earn credits by serving inference. Users earn credits by purchasing them with fiat at a rate set by the network. Users spend credits by consuming inference. The credit is non-transferable in v1, has no exchange listing, and has no speculative premium. It is a billing system and only a billing system.&lt;/p&gt;

&lt;p&gt;The governance token is tradeable on a public market and exists for value capture. It is allocated to the team, early investors, the foundation treasury, the ecosystem fund, and a contributor airdrop based on cumulative credit earnings. It carries governance rights over protocol parameters, treasury allocation, and slashing policy. A percentage of public-pool fees buys back and burns the token, so token holders capture the upside of network adoption.&lt;/p&gt;

&lt;p&gt;The two assets are kept apart on purpose: utility on one side, value capture on the other. A contributor who wants to participate in the value-capture side can earn credits and then convert their cumulative credit history into a governance-token airdrop at a Phase 2 milestone. A contributor who does not want speculation exposure can earn credits, redeem them for inference services, and never touch a tradeable asset.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why credits are stable
&lt;/h2&gt;

&lt;p&gt;The credit's value floor is the compute it represents. One credit can be redeemed for one normalized A100-minute of inference at any time. That redemption right is what makes the credit stable.&lt;/p&gt;

&lt;p&gt;The redemption right is a use right, not an FX peg. A user who holds 1000 credits can run 1000 A100-minutes of inference. The credits do not need to trade against the dollar or against other crypto assets; they just need to clear inference requests at the rate the network publishes.&lt;/p&gt;

&lt;p&gt;This is the same shape as data credits in Helium's redesign (one credit = one IoT data packet). Helium v1 used a single-token model and produced the same subsidy treadmill we see in Bittensor today. The v5 redesign split utility (Data Credits) from value capture (HNT), and the unit economics stabilized. It was the standard fix that mechanism design produces when the original mechanism fails. Nothing clever about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why governance gets its own token
&lt;/h2&gt;

&lt;p&gt;A governance token wants the opposite of stability. It wants to be a tradeable instrument that captures the value of the network as the network grows.&lt;/p&gt;

&lt;p&gt;A token whose only economic function is governance plus fee accrual is what the post-2024 DeFi consensus has converged on. MKR is the canonical example: it pays no one for operations; it just captures value through buybacks funded by DAI fees and votes on protocol parameters. The market prices it on expected future fee accrual. When the network grows, fees grow, buybacks grow, the token appreciates.&lt;/p&gt;

&lt;p&gt;That mechanism does not work if the same token is also paying contributors. Contributor payments dilute the token. Buybacks concentrate it. The two operations cancel out, and the token's price reflects the noise of the two flows rather than the signal of network growth.&lt;/p&gt;

&lt;p&gt;When the two operations are split across two assets, both can do their job cleanly. The credit pays contributors and absorbs all the operational dilution. The governance token captures value and absorbs all the buyback concentration. The signals stop fighting each other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The sequencing argument
&lt;/h2&gt;

&lt;p&gt;The mesh ships before the token. There is no reason to launch a token into an empty network, because a token coordinates supply, and supply that does not yet exist cannot be coordinated.&lt;/p&gt;

&lt;p&gt;The substantive engineering work that has to land before a token makes sense is the network itself. Cross-machine inference routing has to be measured. Quality-scoring has to be calibrated. The credit ledger has to be running at small scale among trusted contributors. The user-side product has to actually serve real workloads.&lt;/p&gt;

&lt;p&gt;When those things are working, the credit ledger can scale to a wider contributor pool. When the contributor pool is real, the governance token has something to govern. When the governance token has something to govern, it can launch.&lt;/p&gt;

&lt;p&gt;The Bittensor failure mode is launching the token first and trying to bootstrap supply through emissions. The Petals failure mode is having no token at all and hoping altruism scales. The two-asset model with mesh-first sequencing avoids both. It says: ship the boring part, prove the boring part works, then add the financial instrument that makes the network ownable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the math looks like
&lt;/h2&gt;

&lt;p&gt;A two-asset network running at steady state has two ratios to track instead of one.&lt;/p&gt;

&lt;p&gt;The credit ratio is real-revenue-per-credit-redeemed. A network is healthy if users redeem credits at a stable rate against fiat-denominated inference cost. If credits are inflating against compute hours (one credit redeems for less inference over time), the credit issuance mechanism is broken.&lt;/p&gt;

&lt;p&gt;The governance ratio is fee-buyback per token emission. A healthy governance token has buybacks at or above token emissions over rolling 12-month windows. If buybacks fall meaningfully below emissions, the token is in dilution territory and the governance economics are not working.&lt;/p&gt;

&lt;p&gt;Bittensor's published numbers do not allow a clean version of these two ratios because the two operations are conflated. But the closest analog (real revenue versus token emissions) is the 7.6 to 1 ratio. The same ratio for a two-asset network at steady state should be approximately 1 to 1 (fees fund both buybacks and contributor incentives at parity), with credits decoupled from the governance-token price.&lt;/p&gt;

&lt;p&gt;A network designed this way can recover from a token-price drawdown without losing contributors. The contributors are paid in credits, and the credits redeem for the same compute services they always did, even if the governance token is down 70%. The governance token's price reflects market sentiment about the network's future fee accrual; it does not control the network's day-to-day unit economics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits
&lt;/h2&gt;

&lt;p&gt;The two-asset model is not free. It adds engineering complexity (two accounting systems instead of one), adds legal complexity (the credit may or may not be a security depending on jurisdiction, and the governance token almost certainly is one in the US until decentralization thresholds are met), and adds adoption friction (contributors have to learn the difference between earning credits and earning governance-token allocations).&lt;/p&gt;

&lt;p&gt;The model also does not magically solve demand-side problems. If users do not want to buy credits, the credit is not a real billing system, regardless of how cleanly it is separated from the governance token. The two-asset model fixes the supply-side dynamics that the single-token model breaks. It does not fix the demand side. The demand side comes from building a product people actually want.&lt;/p&gt;

&lt;p&gt;The Helium precedent is instructive on this point. The two-asset migration stabilized Helium's unit economics, but Helium still had to find a real use case (mobile carrier service) before the network became economically self-sustaining. The original IoT use case had not materialized at the supply density the network had bootstrapped. A clean mechanism cannot rescue a bad market thesis.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for what I am building
&lt;/h2&gt;

&lt;p&gt;I am the founder of Potluck, which is a peer-to-peer AI compute network that has been measuring the boring parts (memory mesh latency, cross-machine retrieval, inference verification) for six months. The two-asset model is the token design we have committed to. It is documented in detail at trypotluck.ai if you want the longer version.&lt;/p&gt;

&lt;p&gt;I wrote this to make a specific argument I think is true about decentralized AI compute networks in general: the single-token model is mechanism-design failure that no amount of execution can fix, and the fix has been understood since Helium v5 in 2023. Several networks are launching with the single-token model anyway.&lt;/p&gt;

&lt;p&gt;If you are building one of those networks, separate the assets before you launch. The math gets better the moment you do.&lt;/p&gt;




&lt;p&gt;Rob writes the &lt;em&gt;Local AI Engineering Notes&lt;/em&gt; series on strake.dev. He is also building &lt;a href="https://trypotluck.ai" rel="noopener noreferrer"&gt;Potluck AI&lt;/a&gt;, the peer-to-peer AI compute network referenced in this post, and &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt;, a GitHub Action deploy gate.&lt;/p&gt;

</description>
      <category>tokenomics</category>
      <category>depin</category>
      <category>crypto</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Brute-Force Retrieval Holds Through 5,000 Memories. Then It Doesn't.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 03 Jun 2026 16:52:56 +0000</pubDate>
      <link>https://dev.to/newtorob/brute-force-retrieval-holds-through-5000-memories-then-it-doesnt-5bm3</link>
      <guid>https://dev.to/newtorob/brute-force-retrieval-holds-through-5000-memories-then-it-doesnt-5bm3</guid>
      <description>&lt;p&gt;The last post I wrote measured how long it takes to query my Linux box's memory store from my Mac over a WireGuard mesh. The answer was about 20 milliseconds, plus or minus a few for Tailscale jitter, stable across store sizes from 10 to 500 entries. That ended with a sentence I should not have left there without testing: "the linear scan over a few hundred vectors is negligible at this scale."&lt;/p&gt;

&lt;p&gt;That is true at a few hundred. The honest follow-up question is the one I had skipped. At what scale does it stop being true.&lt;/p&gt;

&lt;p&gt;I knew the rough shape of the answer. The retriever is a brute-force cosine scan over every embedding in the local store. No approximate-nearest-neighbor index. No HNSW, no IVF, no FAISS. Just a loop. At 500 entries the loop is essentially free; the embedder cost dominates the call. Somewhere above 500 the loop starts to cost real milliseconds. The question is where.&lt;/p&gt;

&lt;p&gt;I ran the bench again with a &lt;code&gt;--sizes&lt;/code&gt; flag I added that morning, and pointed it at 1,000, 5,000, and 10,000 synthetic memories on the Linux peer. Same script as before. Same Tailscale mesh. Warm embedder, 30 query samples per store size.&lt;/p&gt;

&lt;h2&gt;
  
  
  The measurement
&lt;/h2&gt;

&lt;p&gt;I ran each cell once. Cleanup between cells so the next size starts from zero. The Mac-local p50 baseline (14.8 ms, an essentially empty embedding-pass-through) is shown for reference.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Store size on peer&lt;/th&gt;
&lt;th&gt;p50&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;th&gt;min&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac local (baseline)&lt;/td&gt;
&lt;td&gt;14.8 ms&lt;/td&gt;
&lt;td&gt;18.3 ms&lt;/td&gt;
&lt;td&gt;7.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10 memories on peer&lt;/td&gt;
&lt;td&gt;35.9 ms&lt;/td&gt;
&lt;td&gt;40.7 ms&lt;/td&gt;
&lt;td&gt;18.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 memories on peer&lt;/td&gt;
&lt;td&gt;36.1 ms&lt;/td&gt;
&lt;td&gt;41.2 ms&lt;/td&gt;
&lt;td&gt;19.1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 memories on peer&lt;/td&gt;
&lt;td&gt;36.3 ms&lt;/td&gt;
&lt;td&gt;46.0 ms&lt;/td&gt;
&lt;td&gt;20.9 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1,000 memories on peer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;42.0 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;51.1 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;21.3 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5,000 memories on peer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40.6 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;116.5 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;32.0 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;10,000 memories on peer&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;57.6 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;131.9 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;47.2 ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three observations the table makes obvious.&lt;/p&gt;

&lt;p&gt;First, p50 is remarkably flat through 5,000. Forty-two milliseconds at 1K, forty milliseconds at 5K. The store size grew fifty times and the median round-trip moved by zero. The cosine scan is contributing essentially nothing to the median at these sizes; the embedder cost on the peer is still the dominant term in a typical query.&lt;/p&gt;

&lt;p&gt;Second, p95 starts to widen at 1K and breaks at 5K. Fifty-one milliseconds at 1K, then 116 milliseconds at 5K, then 132 at 10K. The 5K p95 more than doubled vs the 1K p95 even though the median barely moved. That is the linear scan showing up in the tail. The median samples are landing in a regime where the embedder dominates, but the slow samples are catching the scan doing real work.&lt;/p&gt;

&lt;p&gt;Third, p50 starts moving at 10K. Fifty-eight milliseconds, fifteen above the 1K and 5K p50s. The scan is now contributing visibly to the median, not just the tail.&lt;/p&gt;

&lt;p&gt;The threshold is somewhere between 5K and 10K. p95 has already broken by 5K; p50 breaks by 10K. The "fine" regime ends in that window.&lt;/p&gt;

&lt;p&gt;Full bench data and the script that produced it: &lt;a href="https://trypotluck.ai/benchmarks" rel="noopener noreferrer"&gt;trypotluck.ai/benchmarks&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What that means in practice
&lt;/h2&gt;

&lt;p&gt;Most people who would use a local AI memory store do not have 10,000 entries in it. I have been dogfooding the system for a few months and my own store has 280 memories in it as of writing this. A heavy daily user storing every decision, preference, and project fact would maybe reach 2,000 in a year. The 5K threshold is something most users will never hit. The 10K threshold is firmly in power-user territory.&lt;/p&gt;

&lt;p&gt;That is the result I wanted to be true, and it is. Brute-force cosine scan is the right default for personal AI memory at the scale most people will operate at. Adding an ANN index now would be a premature optimization that buys nothing for 95% of users and adds operational complexity for everyone.&lt;/p&gt;

&lt;p&gt;It is also useful to know exactly when ANN starts paying off. If a user reaches around 5,000 memories, p95 starts wobbling. If they reach around 10,000, p50 starts wobbling. The right time to add HNSW or IVF-PQ to this codebase is when I see real users hitting those sizes, not before. Until then the scan is the right answer and the engineering effort is better spent elsewhere.&lt;/p&gt;

&lt;h2&gt;
  
  
  A note on what is in the embedder budget
&lt;/h2&gt;

&lt;p&gt;p50 is forty milliseconds through 5K. About fifteen of those are the WireGuard round-trip plus FastAPI middleware. The rest is the embedder running on the peer. bge-small-en-v1.5 on a 2080 Ti via CUDA does a single query embedding in about twenty milliseconds when the model is warm. The cosine scan over 500 vectors of dimension 384 is about 0.2 milliseconds in numpy, which is below the precision of my measurement. The scan over 5,000 vectors is about 2 milliseconds, which is also below the noise floor of an HTTP-over-WireGuard probe. The scan over 10,000 vectors is about 4 milliseconds and that one is just barely visible in the median delta between 5K and 10K.&lt;/p&gt;

&lt;p&gt;What is visible in the 5K p95 is not the average scan cost. It is the worst-case scan cost colliding with a worst-case scheduling stall, a worst-case GC pause, a worst-case context switch on the peer. The tail samples are catching the system at its slowest. As the store grows, more of the scan's wall time happens during a bad moment, so the tail widens disproportionately to the median.&lt;/p&gt;

&lt;p&gt;The interesting engineering implication is that the first optimization that matters is not ANN. It is reducing the number of vectors the scan touches in the first place. Project-scoped filtering (only scan vectors tagged for the current project), confidence-threshold pruning (skip vectors below a minimum stored confidence), recency cutoff (skip vectors older than N days unless re-accessed) all reduce the scan size cheaply. Most realistic queries on a 10,000-vector store are not actually asking the retriever to consider all 10,000. They are asking it to consider the few hundred relevant to the current project, which puts the effective scan size right back in the "fine" regime.&lt;/p&gt;

&lt;p&gt;ANN indexing is the right move when even the project-scoped scan crosses 5K. That is a real product moment, but it is not now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits of the measurement
&lt;/h2&gt;

&lt;p&gt;This was Linux peer only, with CUDA-accelerated embedding and a 2080 Ti. Mac local would look slightly different at scale because Metal embedder throughput is slightly higher than CUDA on this generation of GPU. Windows peer would look worse because the embedder runs on CPU there. I expect the same shape, the same threshold around 5K to 10K, slightly different absolute numbers. I have not yet measured the Mac local or Windows peer cases at 1K plus and that is the next bench.&lt;/p&gt;

&lt;p&gt;The synthetic memories the bench script populates are deliberately diverse but they are not real user memories. Real memories are more semantically clustered (you ask about kubernetes more than you ask about cassandra) which means the cosine scores will have a different distribution, which means the top-k selection will land in slightly different cache regimes. I would expect this to make the tail samples noisier in production than in the bench. The 5K and 10K p95s are probably underestimates of what a real user with a power-user store would see.&lt;/p&gt;

&lt;p&gt;I ran each size once. A statistically defensible characterization would run each size five to ten times and report distributions, not point estimates. The p50 numbers are stable enough run-to-run that I am confident in the 5K threshold within a few hundred memories either way. The p95 numbers have wider run-to-run variance, especially at 5K and 10K, so the exact p95 values are less trustworthy than the trend.&lt;/p&gt;

&lt;p&gt;These results are for retrieval latency only. They do not address what happens to retrieval quality as the store grows. A 10K-memory store has different recall characteristics than a 500-memory store because there is more competition for the top-k slots. That is a separate measurement and one I have not run yet. The right place for it is on the next pass.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would build next
&lt;/h2&gt;

&lt;p&gt;In order of what actually pays off for most users:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Project-scoped pre-filtering in the retriever, so the cosine scan only touches vectors with a matching project tag. Cheap to implement, immediately reduces effective scan size for anyone with more than one project.&lt;/li&gt;
&lt;li&gt;A tiny in-process LRU cache for embeddings of recently-asked queries. Same query within a session skips the embedder entirely. Embedder is the dominant cost at small sizes, so a 50% cache hit rate cuts p50 by ten milliseconds.&lt;/li&gt;
&lt;li&gt;Recency-based cold-storage tiering for memories older than some threshold and unaccessed for some other threshold. Keeps the hot scan small even as total store grows past 10K.&lt;/li&gt;
&lt;li&gt;ANN indexing (probably HNSW via hnswlib for the python bindings, or usearch for a smaller dependency) only when the hot scan size crosses 5K. That is the right moment, not before.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I will probably ship 1 this week, 2 next week, defer 3 until a user actually has the scale to need it, and defer 4 until item 3 is not enough. The order is from "fixes a thing I can measure today" to "fixes a thing I will not need to fix for months."&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changed
&lt;/h2&gt;

&lt;p&gt;The previous post about cross-machine memory query ended with a hand-wave. "Negligible at this scale" without an upper bound is not a measurement, it is a vibe. The fix was a bench script that already existed plus three CLI flags. The cost was thirty seconds of typing and fifteen minutes of waiting for the populate step to finish at 10K entries.&lt;/p&gt;

&lt;p&gt;The result is the same shape I expected with a sharper number than I expected. The cosine scan is fine through about 5,000 entries. By 10,000 it is starting to be a real cost. Most users will live their whole Potluck life inside the "fine" regime. For the ones that don't, the right next optimization is not ANN. It is the cheaper pre-filter step that puts most queries back in the fine regime even on a large store.&lt;/p&gt;

&lt;p&gt;The architecture was right. The threshold I was working off was wrong. The measurement tightened it from "negligible at this scale" to "negligible through five thousand, real cost past ten thousand," which is a more honest sentence to put in a benchmark caption.&lt;/p&gt;




&lt;p&gt;Rob writes the &lt;em&gt;Local AI Engineering Notes&lt;/em&gt; series on strake.dev. He's also building &lt;a href="https://trypotluck.ai" rel="noopener noreferrer"&gt;Potluck AI&lt;/a&gt;, the local-first AI memory system measured in this post, and &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt;, a GitHub Action deploy gate.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>benchmarks</category>
      <category>machinelearning</category>
      <category>retrieval</category>
    </item>
    <item>
      <title>Cross-Machine Memory Query: About 20 Milliseconds, Most Days</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 03 Jun 2026 14:26:30 +0000</pubDate>
      <link>https://dev.to/newtorob/cross-machine-memory-query-about-20-milliseconds-most-days-1a3d</link>
      <guid>https://dev.to/newtorob/cross-machine-memory-query-about-20-milliseconds-most-days-1a3d</guid>
      <description>&lt;p&gt;I wrote about hardware benchmarks twice this week. Different problem this time. Same machines.&lt;/p&gt;

&lt;p&gt;I have a Mac for daily work, a Linux box that runs a few media services and a GPU, and a Windows desktop I keep for gaming and AMD testing. They are all on the same Tailscale-managed WireGuard mesh. Each one runs a local memory store I use with my AI coding tools.&lt;/p&gt;

&lt;p&gt;The store is local-first by design. No vendor cloud. No memory-sharing API. When I move from the Mac to the Linux box for some weekend project, the context I built up on the Mac stays on the Mac, and the new context I build on Linux stays on Linux. That has always been a feature for me. Until last weekend, when I realized it was also a constraint I had built for myself.&lt;/p&gt;

&lt;p&gt;I wanted to query the Linux box's memory store from my Mac.&lt;/p&gt;

&lt;h2&gt;
  
  
  The simplest version of the problem
&lt;/h2&gt;

&lt;p&gt;The local sidecar that powers the memory store exposes an HTTP API. Locally, my agent hits &lt;code&gt;POST /memory/search&lt;/code&gt; and gets back the relevant memories. The sidecar binds to &lt;code&gt;0.0.0.0:8321&lt;/code&gt; so household-mesh peers can dial it. A middleware enforces that non-loopback callers can only reach &lt;code&gt;/peer/*&lt;/code&gt; and &lt;code&gt;/health&lt;/code&gt;. Everything else returns 403.&lt;/p&gt;

&lt;p&gt;This is the right default. The local user's memories, tasks, chat history, and observability data should not be reachable from any other machine, including machines I own, until I explicitly opt that in.&lt;/p&gt;

&lt;p&gt;The result is a clean privacy boundary and exactly zero cross-machine memory features.&lt;/p&gt;

&lt;h2&gt;
  
  
  The design fork before any code
&lt;/h2&gt;

&lt;p&gt;Before I wrote a line of code I had to settle one thing. Three real options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A.&lt;/strong&gt; Memory stays strictly local. The cross-machine claim was an aspiration that does not match the actual product. Remove the claim, do not build the feature.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;B.&lt;/strong&gt; Own-fleet memory aggregation gated behind a per-machine opt-in environment variable. When the user sets it, household-mesh peers can query that machine's full memory store. Trust is mesh-IP reachability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;C.&lt;/strong&gt; Per-project sharing flags. Memories tagged for shared projects are queryable; everything else stays strictly local. Cleaner privacy model, more code.&lt;/p&gt;

&lt;p&gt;A is the cleanest privacy story, but it walks back a published claim. C is the right long-run answer but it is at least three times the code to ship. B is the pragmatic prototype.&lt;/p&gt;

&lt;p&gt;I picked B with an opt-in default of off. The privacy posture stays correct for users who never opt in. The architecture stays correct for users who do.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the endpoint looks like
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@router.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/memory/search&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PeerMemorySearchResponse&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;peer_memory_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;PeerMemorySearchRequest&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;_peer_memory_enabled&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;503&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Peer memory search is opt-in and disabled on this machine. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Set POTLUCK_PEER_MEMORY_ENABLED=1 to enable household-fleet &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                   &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;memory aggregation.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;_memory_active&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;HTTPException&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;403&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;detail&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;_DISABLED_DETAIL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;semantic_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;min_confidence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min_confidence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;include_pinned&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;PeerMemorySearchResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;memories&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;model_dump&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="n"&gt;count&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pairs&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the whole endpoint. Same retriever as the local call. Read-only. The env var gate reads the variable per request so the user can toggle without restarting the sidecar.&lt;/p&gt;

&lt;p&gt;There is also a header comment block that documents the new trust model so the next person who reads the file does not have to guess at what is intentional.&lt;/p&gt;

&lt;h2&gt;
  
  
  The first measurement, and why I did not trust it
&lt;/h2&gt;

&lt;p&gt;Three machines on Tailscale. Each runs the sidecar bound to &lt;code&gt;0.0.0.0:8321&lt;/code&gt;. Opt-in env var set on each one. I issued 30 warm-embedder queries from the Mac to each peer's &lt;code&gt;/peer/memory/search&lt;/code&gt; and to my own local &lt;code&gt;/memory/search&lt;/code&gt; for comparison.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;p50&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Mac local &lt;code&gt;/memory/search&lt;/code&gt; (baseline)&lt;/td&gt;
&lt;td&gt;14.8 ms&lt;/td&gt;
&lt;td&gt;18.3 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac to Linux peer &lt;code&gt;/peer/memory/search&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;29.0 ms&lt;/td&gt;
&lt;td&gt;37.8 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac to Windows peer &lt;code&gt;/peer/memory/search&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;40.8 ms&lt;/td&gt;
&lt;td&gt;53.5 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Linux peer added 14 ms over local. The minimum cross-machine call was inside the local p95. The shape of the numbers was clean and the conclusion was easy. It feels instant.&lt;/p&gt;

&lt;p&gt;I had a draft post written around that 14 ms. I did not publish it.&lt;/p&gt;

&lt;p&gt;The numbers felt too generous. The probe was a single ad-hoc script against whatever store happened to be on the Linux box at the time, which was nearly empty. That is not a measurement. That is a vibe check.&lt;/p&gt;

&lt;p&gt;I wrote a real bench.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second measurement: the actual bench script
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;bench/run_peer_retrieval.py&lt;/code&gt; does two things.&lt;/p&gt;

&lt;p&gt;First, it populates the peer's store with a known set of 10 synthetic memories, queries them with semantically distinct natural-language questions, and verifies recall@1, recall@3, and &lt;a href="mailto:recall@10"&gt;recall@10&lt;/a&gt;. This catches silent breakage in the wire path: truncation, reordering, dropped fields. All three recall numbers should be 1.00 by construction. The point of the probe is the negative result: confirming the architecture introduces no silent corruption.&lt;/p&gt;

&lt;p&gt;Second, it populates the peer at three store sizes (10, 100, 500 memories), then issues 30 warm-embedder queries against each. The point is to characterize how latency scales with the size of the embedding scan, which is the part of the system most likely to degrade gracelessly as memory stores grow.&lt;/p&gt;

&lt;p&gt;The populate step uses SSH local-forwarding from Mac to peer so the writes hit the peer's loopback-only &lt;code&gt;/memory&lt;/code&gt; endpoint, satisfying the &lt;code&gt;peer_access_middleware&lt;/code&gt; loopback check. Memory writes stay strictly local even during the bench. The query step uses the cross-machine &lt;code&gt;/peer/memory/search&lt;/code&gt; directly.&lt;/p&gt;

&lt;p&gt;I ran it the same evening as the first measurement. The numbers came back higher than the ad-hoc probe.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Store size on peer&lt;/th&gt;
&lt;th&gt;p50&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;10 memories&lt;/td&gt;
&lt;td&gt;41.8 ms&lt;/td&gt;
&lt;td&gt;52.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;100 memories&lt;/td&gt;
&lt;td&gt;39.5 ms&lt;/td&gt;
&lt;td&gt;56.6 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;500 memories&lt;/td&gt;
&lt;td&gt;45.0 ms&lt;/td&gt;
&lt;td&gt;85.2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Overhead vs Mac local jumped from 14 ms to roughly 27 ms. Correctness was 1.00 across the board.&lt;/p&gt;

&lt;p&gt;Two questions immediately. First: which number is right, 14 ms or 27 ms? Second: why did 500 memories show a p95 of 85 ms when 10 memories showed 52 ms? The linear-scan answer explains 500-vs-10 in principle, but only by a few milliseconds at this scale, not thirty.&lt;/p&gt;

&lt;h2&gt;
  
  
  The third measurement: the next morning
&lt;/h2&gt;

&lt;p&gt;I ran the same bench script the next morning. Twice in a row, about ten minutes apart.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;10 mem p50&lt;/th&gt;
&lt;th&gt;100 mem p50&lt;/th&gt;
&lt;th&gt;500 mem p50&lt;/th&gt;
&lt;th&gt;worst p95&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;First (evening)&lt;/td&gt;
&lt;td&gt;41.8 ms&lt;/td&gt;
&lt;td&gt;39.5 ms&lt;/td&gt;
&lt;td&gt;45.0 ms&lt;/td&gt;
&lt;td&gt;85.2 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Second (morning)&lt;/td&gt;
&lt;td&gt;35.9 ms&lt;/td&gt;
&lt;td&gt;36.1 ms&lt;/td&gt;
&lt;td&gt;36.3 ms&lt;/td&gt;
&lt;td&gt;46.0 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Third (morning, ten min later)&lt;/td&gt;
&lt;td&gt;34.7 ms&lt;/td&gt;
&lt;td&gt;34.9 ms&lt;/td&gt;
&lt;td&gt;41.3 ms&lt;/td&gt;
&lt;td&gt;50.4 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The two morning runs agree to within ~1.5 ms p50 at every store size. The evening run was 5 to 10 ms higher across the board. Same code. Same machines. Same Tailscale mesh.&lt;/p&gt;

&lt;p&gt;The variance is the Tailscale path. Tailscale prefers direct UDP between peers when the network conditions allow; if that fails, traffic relays through Tailscale's DERP servers, which adds a hop and a few milliseconds of geographic latency. Whether a given session lands direct vs DERP can flip based on residential ISP behavior, NAT state, and time of day. The 5 to 10 ms band in my morning-vs-evening numbers is what that flip looks like from the Mac's HTTP stopwatch. The evening's worst p95 (85 ms at 500 memories) is the same flip plus the long tail of a worse direct path.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I publish
&lt;/h2&gt;

&lt;p&gt;For a stable headline: cross-machine memory query adds about 20 milliseconds of overhead over local on Mac, plus or minus 5 milliseconds of Tailscale jitter, plus a few more milliseconds of p95 widening as the peer's memory store grows past a few hundred entries. The 14 ms number from the ad-hoc probe was the low end of that band. The 27 ms from the first bench was the high end. About 20 ms is the honest middle.&lt;/p&gt;

&lt;p&gt;The architectural conclusion does not depend on which number you pick. Twenty milliseconds is well inside the "feels instant" range for interactive coding-agent workflows. Even at the worst measurement I have on file (85 ms p95 on a 500-memory store on a slow-Tailscale night), it is faster than most users can perceive as a pause.&lt;/p&gt;

&lt;h2&gt;
  
  
  What that 20 ms is made of
&lt;/h2&gt;

&lt;p&gt;Roughly: WireGuard tunnel round-trip plus HTTP request and response serialization plus FastAPI middleware plus the actual retriever call on the peer side. The WireGuard round-trip alone is around 9 to 12 ms when Tailscale lands a direct path, 14 to 18 ms when it relays through DERP. The retriever and serialization are the rest.&lt;/p&gt;

&lt;p&gt;This is meaningfully lower overhead than I expected when I started. I had been mentally budgeting cross-machine as a 50 to 100 ms operation that I would have to design around. At 20 to 30 ms most days, it just works.&lt;/p&gt;

&lt;h2&gt;
  
  
  Things that surprised me along the way
&lt;/h2&gt;

&lt;p&gt;I started a sidecar on Linux from an SSH session via &lt;code&gt;nohup ... &amp;amp;&lt;/code&gt;. The process died as soon as the SSH session closed. SSH sessions over Tailscale's built-in SSH server do not behave like normal openssh sessions. &lt;code&gt;setsid&lt;/code&gt; works. &lt;code&gt;nohup&lt;/code&gt; does not. Half an hour of debugging that I would rather have back.&lt;/p&gt;

&lt;p&gt;On Windows I tried to launch the sidecar by passing &lt;code&gt;set X=1 &amp;amp;&amp;amp; set Y=path &amp;amp;&amp;amp; python -m ...&lt;/code&gt; through a single &lt;code&gt;cmd.exe /c&lt;/code&gt; chain via WMI &lt;code&gt;Win32_Process.Create&lt;/code&gt;. The env vars did not survive. The fix was to write a &lt;code&gt;.bat&lt;/code&gt; wrapper file and invoke that. Cleaner. Reliable. Should have been my first move.&lt;/p&gt;

&lt;p&gt;The first peer query took 22 seconds. That was the sentence-transformers embedder lazy-loading on the peer the first time &lt;code&gt;/peer/memory/search&lt;/code&gt; was called. Subsequent calls were ~36 ms. Worth pre-warming after a sidecar restart.&lt;/p&gt;

&lt;p&gt;The privacy guard middleware works exactly as designed. It returned 403 from &lt;code&gt;/memory/search&lt;/code&gt; to my Mac, returned 200 from &lt;code&gt;/peer/memory/search&lt;/code&gt; after I set the env var, and stayed 503 from peers where I had not opted in. No accidental data leaks during all my probing.&lt;/p&gt;

&lt;p&gt;And the headline number from my own first probe was 50% off from the rigorous measurement, which is why the rigorous measurement exists. The 14 ms number would have aged badly the first time a user with a real store ran the same probe and reported back the truth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Honest limits of the prototype
&lt;/h2&gt;

&lt;p&gt;Opt-in is all-or-nothing per machine. If I set the env var on Linux, every peer in the mesh that can dial my Linux box's IP can query the whole memory store. There is no per-project sharing flag. The cleaner project-scoped sharing model is the obvious next step.&lt;/p&gt;

&lt;p&gt;Trust is mesh-IP reachability. Whoever can dial my mesh IP can call the peer endpoint if I have opted in. Signed-nonce challenge replacing mesh-IP-as-credential is the next hardening pass.&lt;/p&gt;

&lt;p&gt;There is no graceful federation. If I query my Mac and it forwards to Linux, I get Linux's results back. If I want Mac's local results merged with Linux's, I do that in the client. A peer-aware retriever that automatically aggregates across all opted-in peers is the next product step.&lt;/p&gt;

&lt;p&gt;The endpoint is read-only. Memory writes stay strictly local. That is deliberate; turning the cross-machine endpoint into a write path needs more careful trust modeling than I want to do in a prototype.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this changed
&lt;/h2&gt;

&lt;p&gt;Until last weekend, the Linux box was a peer that could serve me inference but not memory. The memory layer was strictly local. That was a clean privacy story but also a real product limit.&lt;/p&gt;

&lt;p&gt;After two hours of work, the same architecture has a new opt-in endpoint that lets me query any of my machines' memory stores from any other. The default privacy posture is unchanged. The published architecture invariants still hold. The only change is that the people who want this can have it, by opting in on the machines they want to participate.&lt;/p&gt;

&lt;p&gt;I measured it three times before I published a number because the first answer felt too clean. The truth turned out to be a band, not a point. About 20 ms most days. About 25 ms on slow-Tailscale nights. Scaling gracefully through 500 memories on the peer. That is a more honest claim than 14 ms, and it took two extra runs to learn it.&lt;/p&gt;

&lt;p&gt;The architecture was right. The trust boundary was already where it needed to be. The thing I was missing was an env var, one new path on the peer surface, and the discipline to measure twice before publishing once.&lt;/p&gt;




&lt;p&gt;Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>benchmarks</category>
      <category>machinelearning</category>
      <category>wireguard</category>
    </item>
    <item>
      <title>An AMD GPU Beat My Mac on Llama 8B. The Same GPU Lost on Phi-3.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Tue, 02 Jun 2026 18:28:57 +0000</pubDate>
      <link>https://dev.to/newtorob/an-amd-gpu-beat-my-mac-on-llama-8b-the-same-gpu-lost-on-phi-3-233c</link>
      <guid>https://dev.to/newtorob/an-amd-gpu-beat-my-mac-on-llama-8b-the-same-gpu-lost-on-phi-3-233c</guid>
      <description>&lt;p&gt;I wrote a post yesterday about why GPUs barely help small text embeddings at batch=1. Different workload, same machines. This time I ran a local LLM inference benchmark across the same three boxes. The result complicated my hardware mental model in a way I think is worth sharing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Three machines.&lt;/p&gt;

&lt;p&gt;A Mac M2 Pro with 16 GB of unified memory, running Metal through llama-cpp-python.&lt;/p&gt;

&lt;p&gt;A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. CUDA 13.&lt;/p&gt;

&lt;p&gt;A Windows desktop with an AMD 5800X, 64 GB of RAM, and an RX 6600 XT with 8 GB of VRAM. Vulkan through llama.cpp.&lt;/p&gt;

&lt;p&gt;Four models, all Q4_K_M quantization except the last. Phi-3 mini 3.8B. Qwen 2.5 7B. Llama 3.1 8B. Llama 3.1 70B at the more aggressive Q3_K_S as a stretch test.&lt;/p&gt;

&lt;p&gt;Ten-prompt suite, mixing short Q&amp;amp;A, code generation, summarization, and long context. Three runs per prompt. Median across the runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Generation tokens per second, overall median across the suite.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Mac M2 Pro (Metal)&lt;/th&gt;
&lt;th&gt;Linux 2080 Ti (CUDA)&lt;/th&gt;
&lt;th&gt;Windows 6600 XT (Vulkan)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3 mini 3.8B&lt;/td&gt;
&lt;td&gt;19.1&lt;/td&gt;
&lt;td&gt;59.9&lt;/td&gt;
&lt;td&gt;16.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5 7B&lt;/td&gt;
&lt;td&gt;12.4&lt;/td&gt;
&lt;td&gt;43.0&lt;/td&gt;
&lt;td&gt;20.4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;11.6&lt;/td&gt;
&lt;td&gt;40.1&lt;/td&gt;
&lt;td&gt;20.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 70B Q3&lt;/td&gt;
&lt;td&gt;won't fit&lt;/td&gt;
&lt;td&gt;1.3&lt;/td&gt;
&lt;td&gt;won't fit&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 2080 Ti winning everything makes intuitive sense. The Mac-versus-AMD comparison is the part that surprised me.&lt;/p&gt;

&lt;h2&gt;
  
  
  The anomaly
&lt;/h2&gt;

&lt;p&gt;The RX 6600 XT is a roughly $200 used consumer GPU. It beats my Mac M2 Pro on Llama 3.1 8B by 80 percent. 20.9 tokens per second versus 11.6.&lt;/p&gt;

&lt;p&gt;The same RX 6600 XT loses to my Mac on Phi-3 mini. 16.4 versus 19.1. A 14 percent loss.&lt;/p&gt;

&lt;p&gt;Same hardware. Same benchmark harness. Same prompts. Opposite winner.&lt;/p&gt;

&lt;p&gt;The reflex answer is "noise." It is not noise. The numbers held up across three runs per cell and ten prompts per cell. They held up in the per-category breakdowns, the prompt-eval rates, and the time-to-first-token measurements. The Mac wins for small models and loses for medium models. That is the finding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this happens
&lt;/h2&gt;

&lt;p&gt;Phi-3 mini at Q4_K_M is about 2.2 GB of weights. That fits in the M2 Pro's cache hierarchy comfortably.&lt;/p&gt;

&lt;p&gt;Apple Silicon's unified memory architecture means there is no host-to-device transfer. The CPU and GPU share the same physical memory pool with the same bandwidth. There is no PCIe bus to cross. Dispatch overhead is the only fixed cost.&lt;/p&gt;

&lt;p&gt;The RX 6600 XT has more raw VRAM bandwidth than the M2 Pro's unified pool. About 256 GB/s versus 200. But for a 2.2 GB model running one token at a time, you cannot saturate that bandwidth. The compute work per dispatch is too small. The PCIe round-trip and the Vulkan driver overhead eat the win.&lt;/p&gt;

&lt;p&gt;For Qwen 7B and Llama 8B at Q4, the model is around 5 GB. That exceeds the M2 Pro's cache. The Mac is now memory-bandwidth-bound at the SoC level, sharing 200 GB/s between CPU and GPU. The discrete card is bandwidth-bound at the VRAM level, with 256 GB/s dedicated to the GPU alone. The discrete card wins.&lt;/p&gt;

&lt;p&gt;The threshold where this flips is roughly where the model exceeds the M2 Pro's effective cache. For Q4 quantization, that threshold lives somewhere between 3.8B and 7B parameters.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means if you are buying hardware
&lt;/h2&gt;

&lt;p&gt;The right question is not "which platform is faster for local AI." It is "which platform is faster for the model size I actually use."&lt;/p&gt;

&lt;p&gt;If your loop is small specialized models. Routing classifiers. Lightweight rerankers. Sentence embedders. Mac wins. Buy more unified memory.&lt;/p&gt;

&lt;p&gt;If your loop is 7B and 8B chat models. The midrange AMD card wins on price-per-token. Buy used.&lt;/p&gt;

&lt;p&gt;If your loop is 13B and larger. NVIDIA's mature CUDA dispatch and the higher-end VRAM widen the gap, but the gap is still roughly proportional to the cost.&lt;/p&gt;

&lt;p&gt;If your loop is 70B and above. None of this hardware is enough.&lt;/p&gt;

&lt;p&gt;The honest answer to "what should I buy for local AI" is "what is the model going to be."&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardware tier ceiling
&lt;/h2&gt;

&lt;p&gt;The 70B result is the most useful data point in this benchmark, because it stops being about which platform wins.&lt;/p&gt;

&lt;p&gt;Llama 3.1 70B at Q3 will not load on a 16 GB Mac. Will not load on an 8 GB AMD card. Runs at 1.3 tokens per second on the 62 GB system RAM Linux box with the 2080 Ti partially offloaded. Time to first token is 3.8 seconds. Technically possible. Unusable for chat.&lt;/p&gt;

&lt;p&gt;Above that tier, you need a Mac Studio M3 Ultra with 512 GB of unified memory, or a 192 GB DDR5 workstation with a 24 GB GPU, or a multi-GPU rig. Those exist. They are not most developers' desks.&lt;/p&gt;

&lt;p&gt;DeepSeek V3 and R1 sit higher still. At Unsloth's most aggressive Q1.58 quant they need around 131 GB of unified memory or 192 GB of system RAM. People do run them on consumer hardware. Just not on the kind of consumer hardware most developers own.&lt;/p&gt;

&lt;p&gt;The "you need pooled compute" argument used to feel abstract to me. It does not anymore. There is a specific tier of model your current desk cannot run. Whichever model that is, that is where pooled compute starts to matter.&lt;/p&gt;

&lt;h2&gt;
  
  
  The summary
&lt;/h2&gt;

&lt;p&gt;Hardware-versus-model-size matters more than vendor for local model inference. The Mac M2 Pro wins for small models that fit in cache. The discrete GPUs win once the model exceeds cache. The cheap AMD card is competitive with the more expensive NVIDIA card on price-per-token. None of this hardware runs 70B usably, and the larger 671B-class models need a hardware tier above any of it.&lt;/p&gt;

&lt;p&gt;There is no universal winner. The right hardware depends on which model is in your loop.&lt;/p&gt;

&lt;p&gt;If you are about to spend money on hardware for local AI, run the benchmark on the model you actually use before you commit. The vendor wars are not the answer.&lt;/p&gt;




&lt;p&gt;Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>benchmarks</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Your GPU Probably Isn't Helping Your Retrieval System</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Tue, 02 Jun 2026 16:28:07 +0000</pubDate>
      <link>https://dev.to/newtorob/your-gpu-probably-isnt-helping-your-retrieval-system-2c0n</link>
      <guid>https://dev.to/newtorob/your-gpu-probably-isnt-helping-your-retrieval-system-2c0n</guid>
      <description>&lt;p&gt;Most "just use a GPU" advice is wrong for how anyone actually runs small models.&lt;/p&gt;

&lt;p&gt;I spent yesterday benchmarking a 33M parameter embedding model across five hardware backends. The results were not what I expected.&lt;/p&gt;

&lt;h2&gt;
  
  
  The setup
&lt;/h2&gt;

&lt;p&gt;Model: BAAI/bge-small-en-v1.5. 33M params, 384-dim output. The workhorse small embedder a lot of retrieval systems use.&lt;/p&gt;

&lt;p&gt;Workload: LongMemEval oracle split, 500 instances, batch=1, single-query retrieval per call. Published academic benchmark, not a synthetic microbench. The query distribution is realistic.&lt;/p&gt;

&lt;p&gt;Three machines.&lt;/p&gt;

&lt;p&gt;A Mac M2 Pro with 16 GB of unified memory, running Metal via PyTorch MPS.&lt;/p&gt;

&lt;p&gt;A Linux desktop with an Intel 13700K, 62 GB of RAM, and an RTX 2080 Ti with 11 GB of VRAM. Running Ubuntu 22.04 and CUDA 13. I had to fight the driver to get there. More on that below.&lt;/p&gt;

&lt;p&gt;A Windows desktop with an AMD 5800X, 64 GB of RAM, and an RX 6600 XT with 8 GB of VRAM. Running Windows 11 with DirectML on top.&lt;/p&gt;

&lt;p&gt;Metric: p50 and p95 latency per embedding call, measured across 500 instances.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;p50 (ms)&lt;/th&gt;
&lt;th&gt;p95 (ms)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Metal&lt;/td&gt;
&lt;td&gt;M2 Pro (unified memory)&lt;/td&gt;
&lt;td&gt;10.6&lt;/td&gt;
&lt;td&gt;35.0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CUDA 13&lt;/td&gt;
&lt;td&gt;RTX 2080 Ti&lt;/td&gt;
&lt;td&gt;17.8&lt;/td&gt;
&lt;td&gt;20.8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;Intel 13700K&lt;/td&gt;
&lt;td&gt;22.2&lt;/td&gt;
&lt;td&gt;34.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU&lt;/td&gt;
&lt;td&gt;AMD 5800X&lt;/td&gt;
&lt;td&gt;18.9&lt;/td&gt;
&lt;td&gt;21.7&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DirectML&lt;/td&gt;
&lt;td&gt;RX 6600 XT&lt;/td&gt;
&lt;td&gt;17.9&lt;/td&gt;
&lt;td&gt;22.0&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  What jumps out
&lt;/h2&gt;

&lt;p&gt;DirectML on the 6600 XT is statistically break-even with the AMD CPU on the same machine.&lt;/p&gt;

&lt;p&gt;The GPU "acceleration" did nothing.&lt;/p&gt;

&lt;p&gt;CUDA on the 2080 Ti wins, but only by about 20 percent on p50 and 40 percent on p95. That is a GPU costing five times what the 6600 XT does. The win is real but modest.&lt;/p&gt;

&lt;p&gt;Metal wins outright. Unified memory eliminates host-to-device transfer entirely. There is no copy to make. Only dispatch overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why
&lt;/h2&gt;

&lt;p&gt;At batch=1 with a 33M parameter model, you are dispatch-bound. Not compute-bound.&lt;/p&gt;

&lt;p&gt;The per-call cost of kernel dispatch plus host-to-device transfer roughly equals the cost of just running the forward pass on a modern CPU. The GPU never gets a large enough block of work to justify its setup overhead.&lt;/p&gt;

&lt;p&gt;This is the part of the "use a GPU" advice nobody mentions.&lt;/p&gt;

&lt;p&gt;It is correct for big models. It is correct for big batches. It is correct for long sequences.&lt;/p&gt;

&lt;p&gt;For small models running one query at a time, the math goes the other way.&lt;/p&gt;

&lt;p&gt;The threshold shifts based on three things.&lt;/p&gt;

&lt;p&gt;Model size: more params, more compute, more amortization of dispatch.&lt;/p&gt;

&lt;p&gt;Batch size: more parallel work, same overhead spread thinner.&lt;/p&gt;

&lt;p&gt;Sequence length: longer prompts, more matmul, same logic.&lt;/p&gt;

&lt;p&gt;If you are embedding 32 documents at once on a serious GPU, CUDA wins decisively. If you are embedding one query at a time on a midrange consumer card, you are paying GPU tax to do CPU-scale work.&lt;/p&gt;

&lt;h2&gt;
  
  
  A debugging story you will probably relate to
&lt;/h2&gt;

&lt;p&gt;On the Linux box, torch.cuda.is_available() returned False on a working CUDA install.&lt;/p&gt;

&lt;p&gt;My first move was to blame the driver. It was on 535. Surely too old.&lt;/p&gt;

&lt;p&gt;I bumped it to 580 and the problem went away.&lt;/p&gt;

&lt;p&gt;The fix was correct. The diagnosis was not.&lt;/p&gt;

&lt;p&gt;The actual root cause: torch 2.12.0+cu130 ships the CUDA 13 runtime. The 535 driver only supported CUDA 12.2. PyTorch needs runtime and driver to be ABI compatible. They were not. CUDA reported unavailable.&lt;/p&gt;

&lt;p&gt;The driver was not "too old" in some general sense. It was specifically mismatched with my torch build tag.&lt;/p&gt;

&lt;p&gt;I spent an hour writing the wrong root cause into my notes before I checked torch.version.cuda and saw the actual story.&lt;/p&gt;

&lt;p&gt;If you are debugging a "broken" CUDA install, here is the order to check.&lt;/p&gt;

&lt;p&gt;First, what CUDA version was your torch wheel built against. &lt;code&gt;torch.__version__&lt;/code&gt; tells you. &lt;code&gt;cu130&lt;/code&gt; means CUDA 13.&lt;/p&gt;

&lt;p&gt;Then, what does your driver expose. &lt;code&gt;nvidia-smi&lt;/code&gt; shows the max CUDA version it supports.&lt;/p&gt;

&lt;p&gt;Mismatched ABI is failure mode one. Outdated driver is failure mode two. They look identical until you check.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benchmark methodology trap
&lt;/h2&gt;

&lt;p&gt;One more result caught me cleanly. Worth flagging.&lt;/p&gt;

&lt;p&gt;Total wall time across 500 instances went up when I enabled CUDA. From 35.6 seconds on CPU to 64.6 seconds on CUDA. Even though per-call latency dropped.&lt;/p&gt;

&lt;p&gt;The reason: my benchmark was re-initializing the model for every instance. GPU context init dominated the runtime. Per-call latency improved. Throughput regressed. CUDA looked slower in aggregate.&lt;/p&gt;

&lt;p&gt;In production, where the model is loaded once and serves many queries, this inverts entirely.&lt;/p&gt;

&lt;p&gt;If your benchmark structure does not match your production structure, you will measure the wrong thing. Cold-start cost is real. It lives outside the fast path.&lt;/p&gt;

&lt;h2&gt;
  
  
  The takeaway
&lt;/h2&gt;

&lt;p&gt;For small embedders at batch=1, Metal with unified memory is fastest. CUDA is modestly better than CPU. DirectML is break-even with CPU.&lt;/p&gt;

&lt;p&gt;The throughput answer is honestly kind of boring.&lt;/p&gt;

&lt;p&gt;The reason is the part that generalizes. Dispatch overhead dominates at this scale.&lt;/p&gt;

&lt;p&gt;If you are reaching for a GPU because that is what you do with ML, measure first. There is a real chance your CPU is fine and you are optimizing the wrong thing.&lt;/p&gt;




&lt;p&gt;Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>benchmarks</category>
      <category>machinelearning</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Why Strake Is Free Right Now</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 01 Jun 2026 13:47:19 +0000</pubDate>
      <link>https://dev.to/newtorob/why-strake-is-free-right-now-1c1i</link>
      <guid>https://dev.to/newtorob/why-strake-is-free-right-now-1c1i</guid>
      <description>&lt;p&gt;Strake is free right now.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xlirvk6xxw8af7s2p7j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xlirvk6xxw8af7s2p7j.png" alt=" " width="800" height="441"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not because deploy safety has no value. Not because pricing does not matter. And not because I think free users magically turn into a business.&lt;/p&gt;

&lt;p&gt;It is free because I need real teams trying it on real pull requests.&lt;/p&gt;

&lt;p&gt;That is the only feedback that matters at this stage.&lt;/p&gt;

&lt;p&gt;I can make the site cleaner. I can rewrite the homepage. I can tune the scoring rules in a local demo until they feel right. None of that tells me what happens when Strake comments on an actual PR and an engineer has to decide whether the verdict helped.&lt;/p&gt;

&lt;p&gt;That is what I want to learn.&lt;/p&gt;

&lt;p&gt;Was the GO obvious?&lt;/p&gt;

&lt;p&gt;Was the HOLD annoying?&lt;/p&gt;

&lt;p&gt;Did the CRITICAL verdict catch something the team would have shipped anyway?&lt;/p&gt;

&lt;p&gt;Did the PR comment give enough context, or did the engineer still have to open PagerDuty, Datadog, Slack, GitHub, and a wiki tab to understand what was going on?&lt;/p&gt;

&lt;p&gt;Those are product questions, not pricing questions.&lt;/p&gt;

&lt;h2&gt;
  
  
  I made the product smaller
&lt;/h2&gt;

&lt;p&gt;Earlier versions of Strake were trying to explain too much at once.&lt;/p&gt;

&lt;p&gt;Incident workflow. Runbooks. Service health. Deploy history. Dependency changes. Operational memory.&lt;/p&gt;

&lt;p&gt;All of those pieces still matter. They are part of where Strake is going. But they were too much to lead with.&lt;/p&gt;

&lt;p&gt;The first useful thing is much simpler:&lt;/p&gt;

&lt;p&gt;Strake is a GitHub Action that tells you whether a deploy is riskier than it looks.&lt;/p&gt;

&lt;p&gt;It runs in the pull request. It reads production context. It posts a GO / HOLD / CRITICAL verdict where the team is already deciding whether to ship.&lt;/p&gt;

&lt;p&gt;That is the part I want people to try first.&lt;/p&gt;

&lt;p&gt;Not a dashboard.&lt;/p&gt;

&lt;p&gt;Not a new ceremony.&lt;/p&gt;

&lt;p&gt;Not another place engineers have to remember to check before a release.&lt;/p&gt;

&lt;p&gt;Open a PR. Run the gate check. Read the verdict. Decide whether to ship.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why I moved it to GitHub Actions
&lt;/h2&gt;

&lt;p&gt;The shift to GitHub Actions was not just an integration choice. It changed how I think about the product.&lt;/p&gt;

&lt;p&gt;Most deploy-safety workflows ask you to leave the deploy decision to inspect the deploy risk.&lt;/p&gt;

&lt;p&gt;That sounds small, but it is where the habit breaks.&lt;/p&gt;

&lt;p&gt;If the answer is in another dashboard, someone has to remember the dashboard exists.&lt;/p&gt;

&lt;p&gt;If the answer is in Slack, someone has to ask the right question in the right channel at the right time.&lt;/p&gt;

&lt;p&gt;If the answer is in a wiki, someone has to know what to search for.&lt;/p&gt;

&lt;p&gt;The PR is different.&lt;/p&gt;

&lt;p&gt;The PR is already where the decision is happening.&lt;/p&gt;

&lt;p&gt;That is where the gate belongs.&lt;/p&gt;

&lt;p&gt;A green build tells you the code passed the checks it was given. It does not tell you production is already fragile. It does not tell you an incident is open on the same service. It does not tell you the last deploy failed. It does not tell you the dependency tree moved in a weird way. It does not tell you the runbook is missing.&lt;/p&gt;

&lt;p&gt;Those are deploy-boundary questions.&lt;/p&gt;

&lt;p&gt;They should show up before the deploy goes out.&lt;/p&gt;

&lt;p&gt;GitHub Actions is boring in the right way. Teams already use it. It already runs on PRs. It already has a place to report status. Strake should meet the deploy decision there instead of asking teams to build a new habit from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why free is the right first step
&lt;/h2&gt;

&lt;p&gt;I do not want someone to ask, "Is Strake worth buying?" before they know whether Strake is worth using.&lt;/p&gt;

&lt;p&gt;That is backwards.&lt;/p&gt;

&lt;p&gt;The right first test is small:&lt;/p&gt;

&lt;p&gt;Pick one repo.&lt;/p&gt;

&lt;p&gt;Install the Action.&lt;/p&gt;

&lt;p&gt;Open a pull request.&lt;/p&gt;

&lt;p&gt;See whether the verdict is useful.&lt;/p&gt;

&lt;p&gt;If the gate is noisy, I want to hear that. If it misses context your team cares about, I want to hear that too. If it catches a deploy risk you would have otherwise waved through, that is the signal I am looking for.&lt;/p&gt;

&lt;p&gt;Charging too early would make the feedback worse.&lt;/p&gt;

&lt;p&gt;People get polite when money is involved. They start evaluating plan limits, seat counts, procurement, and whether the pricing model maps to their org chart. I do not need that yet.&lt;/p&gt;

&lt;p&gt;I need blunt product feedback from teams that actually ship production software.&lt;/p&gt;

&lt;p&gt;So the first version is free to start.&lt;/p&gt;

&lt;h2&gt;
  
  
  What free does not mean
&lt;/h2&gt;

&lt;p&gt;Free does not mean Strake is a toy.&lt;/p&gt;

&lt;p&gt;It means the product is intentionally narrow right now.&lt;/p&gt;

&lt;p&gt;The first question is:&lt;/p&gt;

&lt;p&gt;Can Strake make one deploy decision better?&lt;/p&gt;

&lt;p&gt;If the answer is no, a paid plan would not fix that.&lt;/p&gt;

&lt;p&gt;If the answer is yes, the next questions get more interesting. More repos. Longer signal history. Team controls. Support. Security review. Better runbook workflows. Whatever a real production team needs before this becomes part of how they ship.&lt;/p&gt;

&lt;p&gt;Those are good problems to earn later.&lt;/p&gt;

&lt;p&gt;Right now, I would rather have five teams try Strake on one real repo and tell me exactly where the verdict is wrong than have fifty people admire a polished demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who I want using it
&lt;/h2&gt;

&lt;p&gt;Strake is not for teams with a mature release engineering group and years of internal deploy tooling.&lt;/p&gt;

&lt;p&gt;It is for the teams in the middle.&lt;/p&gt;

&lt;p&gt;You have customers in production. You ship through GitHub. You have PagerDuty or Datadog or Slack alerts or some mix of all three. You have runbooks somewhere, but they are not always where the on-call engineer needs them. You do not have a full SRE bench to build internal guardrails from scratch.&lt;/p&gt;

&lt;p&gt;You still need to answer the same question before you push:&lt;/p&gt;

&lt;p&gt;Is this deploy riskier than it looks?&lt;/p&gt;

&lt;p&gt;That is what Strake is for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it on one repo
&lt;/h2&gt;

&lt;p&gt;The ask is intentionally small.&lt;/p&gt;

&lt;p&gt;Try Strake on one repo.&lt;/p&gt;

&lt;p&gt;Do not migrate your incident process. Do not rebuild your release workflow. Do not sit through a long demo if you would rather just see the thing work.&lt;/p&gt;

&lt;p&gt;Install the GitHub Action, open a pull request, and judge the verdict.&lt;/p&gt;

&lt;p&gt;If it is useful, keep going.&lt;/p&gt;

&lt;p&gt;If it is too noisy or too quiet, tell me where.&lt;/p&gt;

&lt;p&gt;That feedback is the whole reason Strake is free right now.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://strake.dev/signup" rel="noopener noreferrer"&gt;Try Strake on one repo.&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Rob is building Strake, a GitHub Action deploy gate for engineering teams that run production without a full SRE bench. Strake posts a GO / HOLD / CRITICAL verdict in the pull request using context from incidents, deploy history, dependency changes, service health, and runbooks.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>devops</category>
      <category>discuss</category>
    </item>
    <item>
      <title>You're Migrating Off Opsgenie. Here's What You Should Actually Fix.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Thu, 26 Mar 2026 14:35:11 +0000</pubDate>
      <link>https://dev.to/newtorob/youre-migrating-off-opsgenie-heres-what-you-should-actually-fix-5fln</link>
      <guid>https://dev.to/newtorob/youre-migrating-off-opsgenie-heres-what-you-should-actually-fix-5fln</guid>
      <description>&lt;h1&gt;
  
  
  You're Migrating Off Opsgenie. Here's What You Should Actually Fix.
&lt;/h1&gt;

&lt;p&gt;Opsgenie's end-of-support is April 2027. If you're on a small engineering team, you're probably mid-migration right now — comparing PagerDuty pricing tiers, reading incident.io vs. BetterStack threads, maybe resigning yourself to Jira Service Management because you're already deep in the Atlassian ecosystem.&lt;/p&gt;

&lt;p&gt;I want to suggest something uncomfortable before you pick your next tool: alerting was never your actual problem.&lt;/p&gt;

&lt;p&gt;I managed Opsgenie rotations at three different companies over the past eight years. FreightWaves, TextNow, Pilot Flying J. Different industries, different stacks, different team sizes. The pattern was always the same.&lt;/p&gt;

&lt;p&gt;Someone would deploy a change. Something would break. Opsgenie would page the on-call engineer. That engineer would open a Notion doc titled "Runbook — Service X" that hadn't been updated since 2022. They'd mostly ignore it and Slack the person who wrote the service. That person would fix it. Everyone would move on. Two weeks later, something similar would happen again.&lt;/p&gt;

&lt;p&gt;Opsgenie did its job perfectly. It routed the alert to the right person. The problem was everything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  The question nobody was asking
&lt;/h2&gt;

&lt;p&gt;At none of those companies — not one — did anyone ask the obvious question before deploying: is it safe to push right now?&lt;/p&gt;

&lt;p&gt;Not "did CI pass." Not "did someone approve the PR." I mean: is the system healthy enough to absorb a change right now? Are we burning through error budget? Is there already an active incident? Did someone just deploy 20 minutes ago and we haven't seen the impact yet?&lt;/p&gt;

&lt;p&gt;Nobody asked because there was no way to answer it. The information existed — scattered across Datadog, PagerDuty, GitHub, Slack — but nobody had assembled it into a single decision. So engineers deployed based on gut feel. "Seems fine." "I don't see anything in Slack." "The dashboards look okay I guess."&lt;/p&gt;

&lt;p&gt;43% of incidents are preceded by a recent deploy. That number didn't surprise me at all when I first saw it. It matched what I'd lived through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runbook problem is worse than you think
&lt;/h2&gt;

&lt;p&gt;Here's the thing about the Opsgenie migration conversation that nobody is having: most teams using Opsgenie didn't just use it for alerting. It was their entire incident process. Alert comes in, Opsgenie pages someone, that person figures it out. There was no structure beyond that.&lt;/p&gt;

&lt;p&gt;The runbooks — if they existed — lived in Confluence or Notion. I wrote about this in &lt;a href="https://dev.to/blog/incident-management-without-sre"&gt;incident management without a dedicated SRE&lt;/a&gt;, and the core problem hasn't changed: a runbook that's three clicks away from the alert that triggered it is a runbook that doesn't get opened at 3am.&lt;/p&gt;

&lt;p&gt;I've seen this enough times to have a visceral reaction to it. The on-call engineer gets paged, opens Slack, asks "has anyone seen this before?" and waits. Meanwhile the customer is staring at a broken login page. The runbook that would have told them to check the config deployment and roll back the last change is sitting in a Confluence space that the engineer didn't even know existed.&lt;/p&gt;

&lt;p&gt;Teams that connect their runbooks directly to their alerts — so the runbook opens automatically when the relevant alert fires — cut their mean time to resolution from 67 minutes to 23. That's not a marginal improvement. That's the difference between an incident that costs you a customer and one that costs you 20 minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What "migrating off Opsgenie" should actually mean
&lt;/h2&gt;

&lt;p&gt;If you're going to rip out a core piece of your incident workflow, this is the moment to ask harder questions than "which alerting tool has the best Slack integration."&lt;/p&gt;

&lt;p&gt;The questions I'd ask:&lt;/p&gt;

&lt;p&gt;Do you know whether it's safe to deploy right now? Not in a gut-feel way. In a "here's your error budget status, here's your active incident count, here's your deploy velocity over the last 24 hours" way. If you don't have that, you're going to keep causing the incidents that your new shiny alerting tool routes to your team.&lt;/p&gt;

&lt;p&gt;When someone gets paged, do they know what to do? Not "figure it out." Actually know — because the runbook showed up in front of them automatically, with the steps they need and the context about what changed. If your runbooks are still in a wiki, your migration isn't going to fix the thing that actually hurts.&lt;/p&gt;

&lt;p&gt;Are you learning anything from your incidents? Not in a blameless-postmortem-Google-Doc way. I mean: does your system know that this service broke last Tuesday after a similar deploy? Does your deploy process incorporate the history of what's gone wrong before? Most teams I've worked with have zero institutional memory. Every incident is treated as a surprise, even when it's the third time it's happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  Don't just swap your paging tool
&lt;/h2&gt;

&lt;p&gt;The Opsgenie shutdown is a forcing function. Use it.&lt;/p&gt;

&lt;p&gt;If you just swap Opsgenie for PagerDuty or BetterStack, you'll have the same problem in a different UI. Engineers deploying blind. Runbooks gathering dust. Your on-call rotation burning people out because every incident starts from scratch. I wrote about the monitoring version of this trap in &lt;a href="https://dev.to/blog/your-startup-doesn-t-need-better-monitoring-it-needs-less-of-it"&gt;your startup doesn't need better monitoring&lt;/a&gt; — the tooling isn't the bottleneck. The process is.&lt;/p&gt;

&lt;p&gt;The actual fix is a layer that sits before your alerting tool. A deploy gate that tells your team whether it's safe to push. Connected runbooks that show up when things break. Incident data that compounds into institutional knowledge so your team stops relearning the same failure every quarter.&lt;/p&gt;

&lt;p&gt;That's what I'm building at Strake. It's in private beta right now and it works alongside whatever alerting tool you pick — PagerDuty, BetterStack, Grafana OnCall, whatever. The deploy gate and the runbook layer are the parts that were always missing, regardless of who was routing the page.&lt;/p&gt;

&lt;p&gt;If you're mid-migration and want to talk through how your team handles deploy safety, I'm happy to jump on a call. Not a pitch — I'm genuinely trying to learn from teams going through this right now.&lt;/p&gt;

&lt;p&gt;Rob is building Strake — a deploy gate and incident workflow platform for engineering teams without dedicated SRE coverage. If your current incident process is "someone posts in Slack and we figure it out from there," come take a look at strake.dev.&lt;/p&gt;

</description>
      <category>devops</category>
      <category>automation</category>
      <category>ai</category>
      <category>discuss</category>
    </item>
    <item>
      <title>Incident Management for Teams Without a Dedicated SRE: A Practical Guide</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Mon, 23 Mar 2026 19:32:10 +0000</pubDate>
      <link>https://dev.to/newtorob/incident-management-for-teams-without-a-dedicated-sre-a-practical-guide-16cb</link>
      <guid>https://dev.to/newtorob/incident-management-for-teams-without-a-dedicated-sre-a-practical-guide-16cb</guid>
      <description>&lt;p&gt;Most incident management advice assumes you have a real SRE function already in place. Dedicated rotations, formal roles, long severity docs, postmortem templates with twelve sections. That advice is useful in the right environment. It just doesn't map especially well to a smaller team where the CTO, the senior backend engineer, and the person who shipped the last deploy are all effectively part of the incident process.&lt;/p&gt;

&lt;p&gt;If you're running with a lean engineering team and no dedicated SRE, the goal isn't sophistication. The goal is clarity. When something breaks, you want three things to be true: you notice quickly, the right person knows what to do next, and the team fixes the underlying issue often enough that the same incident doesn't keep resurfacing.&lt;/p&gt;

&lt;p&gt;That's the version of incident management that actually helps when your current process is still "someone posts in Slack and we figure it out from there."&lt;/p&gt;

&lt;h2&gt;
  
  
  What You Actually Need (vs. What SRE Content Tells You You Need)
&lt;/h2&gt;

&lt;p&gt;For a small team, the list is shorter than people make it sound.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What actually matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need to know something is broken before a customer tells you.&lt;/strong&gt; That means basic monitoring and alerting. Nothing fancy, just reliable enough that you are not learning about outages from support tickets. I wrote more about that in &lt;a href="https://strake.dev/blog/your-startup-doesnt-need-better-monitoring" rel="noopener noreferrer"&gt;your startup doesn't need better monitoring&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need a simple response path.&lt;/strong&gt; Who gets paged, what they check first, where the incident lives, and when they pull in help. That can fit on one page.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You need a lightweight habit of learning from incidents.&lt;/strong&gt; Not a heavy postmortem ceremony. Just enough follow-through that the same issue doesn't bite you for the fourth time.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Everything else is secondary. SLOs, error budgets, review boards, chaos exercises, and the rest can be useful later. They are not the first thing standing between you and a workable incident process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building Your Incident Response Process From Scratch
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Start with three severity levels, not five
&lt;/h3&gt;

&lt;p&gt;I've found that three severity levels are enough for most small teams. More than that usually creates debate without improving the response.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P1 — the product is down or a core workflow is broken for everyone.&lt;/strong&gt; Someone gets paged immediately. You stay on it until service is back, and if customers are clearly affected, you communicate early instead of waiting for a perfect explanation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P2 — something important is degraded, but the product still basically works.&lt;/strong&gt; Maybe a major feature is unstable, or a subset of users is having a bad time. This should get attention quickly, but it usually does not justify waking someone up overnight unless the business impact is unusually high.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P3 — something is wrong, but it can wait.&lt;/strong&gt; A background job is failing, a dashboard is stale, or a non-critical dependency is acting up. This becomes a ticket, not a page.&lt;/p&gt;

&lt;p&gt;The real value here is not the wording. It's the discipline behind it. A P1 should mean "wake someone up." A P3 should mean "nobody loses sleep." Once teams blur those lines, alert fatigue shows up fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  Set up your on-call rotation
&lt;/h3&gt;

&lt;p&gt;Three engineers is the minimum rotation I've seen hold up for more than a few weeks. With two people, someone is on-call every other week and starts to dread the whole thing. With three, it's still not luxurious, but it's survivable.&lt;/p&gt;

&lt;p&gt;On tooling, this is one place where I would spend the money. Use PagerDuty or OpsGenie. Don't build a homemade paging system around Slack, calendars, and someone's phone settings. Alert routing at 3am is a solved problem, and solved problems are worth buying.&lt;/p&gt;

&lt;h3&gt;
  
  
  Create your war room protocol
&lt;/h3&gt;

&lt;p&gt;When a real incident starts, you need a predictable place for it to live.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The on-call engineer opens a Slack channel like &lt;code&gt;#inc-2026-03-23-api-errors&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;They post the current state right away, even if the update is just "seeing elevated 500s, investigating."&lt;/li&gt;
&lt;li&gt;If they are still stuck after 10-15 minutes, they pull in the person closest to the affected system.&lt;/li&gt;
&lt;li&gt;They keep posting short updates on a fixed rhythm. Fifteen minutes is usually enough.&lt;/li&gt;
&lt;li&gt;When the incident is over, they leave behind a short summary of what happened, what fixed it, and what follow-up work is needed.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That is usually enough. Small teams do not need to invent every formal incident role they have seen in enterprise playbooks. If three people are involved, one of them can keep the channel updated while working the problem.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build your first 10 runbooks
&lt;/h3&gt;

&lt;p&gt;"Runbook" makes this sound heavier than it is. For a small team, a runbook is just a checklist for a known failure mode.&lt;/p&gt;

&lt;p&gt;Start with the ten things that have already hurt you. For most startups, the list looks roughly like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;API returning 5xx errors&lt;/li&gt;
&lt;li&gt;Database connection failures&lt;/li&gt;
&lt;li&gt;High response latency&lt;/li&gt;
&lt;li&gt;Background job queue backed up&lt;/li&gt;
&lt;li&gt;Third-party API dependency down&lt;/li&gt;
&lt;li&gt;SSL certificate expired (yes, this still happens)&lt;/li&gt;
&lt;li&gt;Disk full&lt;/li&gt;
&lt;li&gt;Deploy broke something&lt;/li&gt;
&lt;li&gt;DNS issues&lt;/li&gt;
&lt;li&gt;Authentication/login broken&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each runbook only needs three things: &lt;strong&gt;what to check first&lt;/strong&gt;, &lt;strong&gt;how to mitigate&lt;/strong&gt;, and &lt;strong&gt;who to pull in if the first pass doesn't work&lt;/strong&gt;. In practice that means a few dashboards, a few commands, a rollback or restart path, and a clear escalation point.&lt;/p&gt;

&lt;p&gt;The standard I like is simple: could a reasonably capable engineer follow this at 3am while half-awake and either stabilize the system or know exactly who to call next? If yes, the runbook is doing its job.&lt;/p&gt;

&lt;h2&gt;
  
  
  The On-Call Rotation Reality for Small Teams
&lt;/h2&gt;

&lt;p&gt;On-call at a startup is never going to feel glamorous, and pretending otherwise usually makes it worse.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compensation matters.&lt;/strong&gt; That can mean extra PTO, comp time, a monthly stipend, or some combination. The exact mechanism matters less than the signal that on-call work is real work. If someone gets dragged out of bed at 3am and is still expected to operate like nothing happened at 9am, resentment builds quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Set a sane page budget.&lt;/strong&gt; As a rule of thumb, outside-business-hours pages should be rare. If people are getting woken up multiple times a week, either the alerts are too noisy or the system is genuinely unstable. Both are fixable engineering problems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ease people into the rotation.&lt;/strong&gt; New engineers should shadow first, then serve as backup, then take primary. On-call is stressful enough without making someone learn your systems and your incident process at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runbooks lower the emotional cost.&lt;/strong&gt; Most people can handle being paged occasionally. What really spikes the stress is waking up and feeling like there is no map. A decent runbook doesn't remove the pressure, but it changes the experience from "solve a mystery in the dark" to "work through a checklist and escalate if needed."&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Track and Why
&lt;/h2&gt;

&lt;p&gt;You do not need a massive reliability dashboard. Four numbers will tell you most of what you need to know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to detect (TTD).&lt;/strong&gt; How long does it take from breakage to awareness? If customers usually tell you first, your alerting is not doing its job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Time to resolve (TTR).&lt;/strong&gt; How long does it take from the first alert to a verified fix in production? This is the number that tells you whether incidents are annoying or truly expensive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incident frequency by service.&lt;/strong&gt; Which part of the system keeps paging you? That is where your reliability work should go first, even if another problem feels more interesting technically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeat incidents.&lt;/strong&gt; What keeps coming back? This one is painful, but useful. Recurring incidents usually mean you only treated the symptom last time.&lt;/p&gt;

&lt;p&gt;You can track all of this in a spreadsheet. The tool doesn't matter much. The habit does.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Hire a Dedicated SRE
&lt;/h2&gt;

&lt;p&gt;Usually later than you think.&lt;/p&gt;

&lt;p&gt;Here are the signals that it may actually be time to bring in dedicated SRE help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A big chunk of engineering time is disappearing into operational work.&lt;/strong&gt; If incident response, infra maintenance, deploy babysitting, and general firefighting are eating the team alive, the opportunity cost becomes real.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The on-call burden is consistently high.&lt;/strong&gt; If engineers are getting paged constantly and the causes are infra-heavy rather than straightforward application bugs, that is often a sign that reliability needs more dedicated ownership.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You now have contractual uptime expectations.&lt;/strong&gt; Once you are selling into larger customers with SLA language, uptime reporting, and incident expectations, someone needs to own that discipline full-time.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The system has outgrown shared context.&lt;/strong&gt; When no one person can explain the major moving parts with confidence, the risk profile changes.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The team is large enough that coordination itself is becoming the problem.&lt;/strong&gt; At some point, the process needs an owner even if the tech stack is still manageable.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Until then, the more immediate win is usually better operational visibility. The on-call engineer should not have to bounce between PagerDuty, Slack, GitHub, cloud dashboards, and three monitoring tabs just to answer the basic question of "what changed, what is broken, and who owns it?"&lt;/p&gt;

&lt;p&gt;That's the problem we're focused on at Strake. Not replacing an SRE team, but giving smaller teams enough context to respond faster, understand what is failing, and stop relearning the same incident twice.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake is in beta and it's free to try.&lt;/a&gt;&lt;/strong&gt; If you're a small team managing incidents with Slack threads and tribal knowledge, come take a look.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob is building &lt;a href="https://strake.dev" rel="noopener noreferrer"&gt;Strake&lt;/a&gt; — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If that's the problem you're living with, follow along or reach out on &lt;a href="https://x.com/strakedev" rel="noopener noreferrer"&gt;X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>startup</category>
      <category>devops</category>
      <category>webdev</category>
      <category>programming</category>
    </item>
    <item>
      <title>Your Startup Doesn't Need Better Monitoring. It Needs Less of It.</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 18 Mar 2026 18:56:37 +0000</pubDate>
      <link>https://dev.to/newtorob/your-startup-doesnt-need-better-monitoring-it-needs-less-of-it-nmf</link>
      <guid>https://dev.to/newtorob/your-startup-doesnt-need-better-monitoring-it-needs-less-of-it-nmf</guid>
      <description>&lt;p&gt;I'm going to say something that will annoy every SRE who's ever given a conference talk: most of what they tell you about observability is wrong for your stage.&lt;/p&gt;

&lt;p&gt;Not wrong in general. Wrong for you. A founding team of six people shipping a B2B SaaS product does not have the same operational needs as Google. I know this sounds obvious written down. But I watch founders set up Datadog with 47 custom dashboards before they have 47 customers, and nobody's telling them to stop.&lt;/p&gt;

&lt;p&gt;I did exactly this. About two years into my first startup, I spent an entire weekend building what I genuinely believed was a world-class monitoring stack. Prometheus, Grafana, custom exporters, alert rules for CPU, memory, disk, network throughput, request latency at p50, p95, p99, p99.9 — the works. I felt like a real engineer. Professional. Prepared.&lt;/p&gt;

&lt;p&gt;Then I got paged at 3am on a Tuesday because CPU hit 80% on a box that was completely fine. The alert was technically correct. The threshold was just wrong. I silenced it, went back to sleep, got paged again at 4am for a memory warning that also didn't matter. By morning I'd silenced four alerts and missed the one email from a customer saying they couldn't log in.&lt;/p&gt;

&lt;p&gt;The login bug had nothing to do with CPU or memory. A config file got borked during a deploy. None of my beautiful dashboards caught it because I was monitoring infrastructure when I should have been monitoring the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Actually Matters
&lt;/h2&gt;

&lt;p&gt;Here's what I think you need at the early stage. Not what the monitoring vendor's blog post says. Not what the "complete observability guide" on Medium recommends. What actually keeps your customers happy and lets you sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One: know if your thing is up.&lt;/strong&gt; That's it. A simple HTTP check against your most important endpoint, every 30 seconds. If it fails three times in a row, text yourself. I don't care if you use UptimeRobot, Pingdom, or a cron job that curls your health check — it doesn't matter. The fancy tool doesn't help if you're checking the wrong thing. Hit the endpoint your customers actually use. Not &lt;code&gt;/health&lt;/code&gt;. Not &lt;code&gt;/ping&lt;/code&gt;. The actual login page, or the API call that matters most. If that works, you're probably fine. If it doesn't, you need to know immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two: know if your customers are getting errors.&lt;/strong&gt; This means tracking your HTTP 5xx rate. You can do this in CloudWatch, in your application logs, in whatever. The point is: if more than, say, 1% of requests are returning server errors, something is wrong and you should look at it. During business hours. Not at 3am. Unless it's way above 1%, in which case yes, wake up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three: know if things are slow.&lt;/strong&gt; Response time matters, but you don't need seven percentile buckets. Track p95 latency. If your p95 is under 500ms for a typical API call, you're fine. If it's climbing, investigate when you're awake. If it suddenly spikes to 5 seconds, that's worth waking up for.&lt;/p&gt;

&lt;p&gt;That's the list. Three things. Everything else is noise at your stage.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Alert Hygiene Nobody Talks About
&lt;/h2&gt;

&lt;p&gt;Here's a rule I wish someone had tattooed on my forearm before I started: every alert that wakes you up must require you to do something right now. Not "hmm, interesting." Not "I should look at this tomorrow." Right now, tonight, in your underwear, something needs to be done or it wasn't worth waking you up.&lt;/p&gt;

&lt;p&gt;If you get paged and the correct response is "I'll check this in the morning," that alert is broken. Downgrade it. Make it a Slack notification. Make it an email. Make it a dashboard you glance at with your coffee. But do not let it wake you up.&lt;/p&gt;

&lt;p&gt;I know this sounds aggressive. You're thinking "but what if I miss something?" You might. And that's okay. Because the alternative is alert fatigue, which is when you've been woken up by false alarms so many times that you start sleeping through the real ones. Alert fatigue has caused more outages than missing alerts ever has. I'd bet money on it.&lt;/p&gt;

&lt;p&gt;At one point I had 30+ alert rules configured. I was getting maybe 4-5 notifications a day. I started ignoring all of them. It took a customer emailing our support address (which was my personal Gmail) to tell me the payment flow had been broken for six hours. Six hours. While my monitoring stack was happily telling me that CPU utilization was nominal.&lt;/p&gt;

&lt;p&gt;I deleted 25 of those alerts in one commit. Kept five. Slept better. Caught more real problems. Go figure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tools Question
&lt;/h2&gt;

&lt;p&gt;People ask me what monitoring tools to use and I think the honest answer is: it barely matters, and spending a week evaluating tools is a week you didn't spend building your product.&lt;/p&gt;

&lt;p&gt;If you're on AWS, CloudWatch is already there and it's fine. The UI is ugly and the query language is annoying but it works. If you want something nicer, Grafana Cloud has a free tier that's generous enough for a small startup. If you have money to spend and want things to just work out of the box, Datadog is great — but you will be shocked by the bill once you grow. Their pricing model is designed to be cheap when you're small and extremely expensive when you're not. Just know what you're signing up for.&lt;/p&gt;

&lt;p&gt;The one tool I'd say is genuinely worth paying for early: an error tracking service. Sentry, Bugsnag, something like that. It catches unhandled exceptions in your application code, groups them, shows you the stack trace, tells you which deploy introduced it. This is the stuff that actually breaks your product for users, and application-level error tracking catches it way faster than infrastructure monitoring ever will.&lt;/p&gt;

&lt;h2&gt;
  
  
  What To Add Later (Not Now)
&lt;/h2&gt;

&lt;p&gt;When you have paying customers with SLAs, or when you've got 10+ services talking to each other, or when you're waking up more than twice a month for real incidents — that's when you start thinking about distributed tracing, log aggregation, SLOs, error budgets, and all the other stuff that makes the SRE Twitter crowd excited.&lt;/p&gt;

&lt;p&gt;Not before. I promise you, nobody churned because you didn't have distributed tracing. They churned because your app was down and you didn't notice for two hours because you were drowning in alerts about disk utilization.&lt;/p&gt;

&lt;p&gt;The bigger unlock at the early stage is getting all your operational context — deploys, errors, customer signals, team activity — in one place so you're not switching between eight tools to figure out what's happening. That's a different problem than monitoring, and it's the one that actually slows founders down.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Actual Hard Part
&lt;/h2&gt;

&lt;p&gt;The real operational skill at the early stage isn't monitoring. It's deploy discipline. Can you ship a change and roll it back in under five minutes if something goes wrong? Do you know what changed between "it was working" and "it's not working"? Can you look at your deploy history and your error rate on the same timeline?&lt;/p&gt;

&lt;p&gt;If you can do that, you can fix almost anything fast enough that your customers won't care. And at the early stage, fast recovery beats prevention every single time. You don't have the team or the time to prevent every problem. But you can damn sure get good at fixing them quickly.&lt;/p&gt;

&lt;p&gt;Build the smallest monitoring setup that tells you when customers are hurting. Delete everything else. Ship your product.&lt;/p&gt;

&lt;p&gt;The operational layer of a startup — knowing what's happening, what needs attention, what can wait — should take minutes a day, not hours. That's the problem worth solving.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Rob is building &lt;a href="https://strake.dev/" rel="noopener noreferrer"&gt;Strake&lt;/a&gt; — an operational platform for startup founders that connects your tools, surfaces what needs your attention, and cuts the overhead of running a company before it buries you. Less time managing operations. More time building the thing.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If that's the problem you're living with, follow along or reach out on &lt;a href="https://x.com/strakedev" rel="noopener noreferrer"&gt;X&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>startup</category>
      <category>ai</category>
      <category>monitoring</category>
    </item>
    <item>
      <title>What are some of your favorite live coders?</title>
      <dc:creator>Rob</dc:creator>
      <pubDate>Wed, 23 Jan 2019 17:02:07 +0000</pubDate>
      <link>https://dev.to/newtorob/what-are-some-of-your-favorite-live-coders-3fm6</link>
      <guid>https://dev.to/newtorob/what-are-some-of-your-favorite-live-coders-3fm6</guid>
      <description>&lt;p&gt;Hey all,&lt;/p&gt;

&lt;p&gt;I am always looking for people to watch that code live. I love to see other people as they think through their problems and fix issues. Who are some of your favorite people to watch live code? &lt;/p&gt;

</description>
      <category>discuss</category>
    </item>
  </channel>
</rss>
