Rob

Posted on Jun 14 • Originally published at strake.dev

Local AI Needs Data-Plane Health Checks

#localai #wireguard #networking #debugging

The worst network bugs are the ones where every dashboard says green and the packet still dies.

That was my Sunday.

I have a Mac that I use as my daily machine and a Linux box called newtorob with a 2080 Ti in it. Potluck runs a local AI sidecar on each machine. The Mac can use its own model locally, or route a request to another machine in my household over a WireGuard mesh.

The product shape is simple:

Mac app -> local sidecar -> WireGuard mesh -> Linux sidecar -> model runtime -> streamed tokens back to the Mac.

This is the "my machines" path. No model API. No cloud inference. The coordinator handles roster and signaling metadata, but the prompt itself should go directly over the private mesh to my own hardware.

Everything looked connected. The Linux peer was enrolled. The coordinator knew about it. The mesh sidecar was running. The UI showed a peer. The machine had a model loaded.

Then I sent a prompt and got: no reachable peer.

The control plane was green. The data plane was dead.

The lie in "online"

Most peer health checks answer the wrong question.

A coordinator heartbeat proves the peer can talk to the coordinator.

A WebSocket connection proves the peer can keep one control connection open.

A WireGuard handshake proves two tunnel endpoints exchanged packets recently.

A capabilities response proves a process can report what it thinks it can serve.

None of those prove that an inference request can cross the exact path the product needs right now.

For a local AI mesh, the real question is not "is this peer online?"

The real question is:

Can this prompt reach that model and stream a token back right now?

That distinction matters because the failure modes sit between the layers. A peer can be present in the roster while the tunnel is broken. A tunnel can have a fresh handshake while HTTP over the tunnel fails. A model can be loaded while the process is unreachable from the other machine. A privacy VPN can silently drop traffic on an interface it does not recognize while every higher-level control check looks fine.

That last one was my bug.

The first wrong theory

My first theory was MTU.

That was not random. WireGuard-over-WireGuard paths are good at producing partial success. A small handshake packet can pass while larger data packets disappear. If path MTU discovery is broken, the tunnel looks alive and the application path dies. This is exactly the kind of problem where "connected" and "usable" diverge.

Tailscale and NetBird both default to conservative MTUs around 1280 for a reason. WireGuard adds overhead. Relays add overhead. Residential networks add weirdness. If you run a local mesh on top of another VPN, a 1420-byte default can turn into a packet shredder.

So I checked the mesh MTU.

It was already 1280.

That was a useful dead end. It ruled out the cleanest explanation and left the uglier one: the packet was not too large. It was not allowed.

The real cause

The Linux box runs Mullvad. The Mac also has Tailscale. Potluck uses a potluck0 WireGuard interface and mesh IPs in the 100.64.0.0/10 range.

That combination has two separate traps.

Tailscale treats 100.64.0.0/10 as its space. Its nftables rules can drop packets from that range when they arrive on a non-tailscale0 interface.

Mullvad's killswitch is stricter. It installs nftables chains with default-drop policy and allows traffic only through interfaces it trusts. potluck0 is not one of them.

From Mullvad's perspective, this is correct behavior. A privacy VPN killswitch should not let random interfaces become escape hatches.

From Potluck's perspective, this means my own mesh interface is blocked unless I add a narrow exception.

The fix was three scoped accept rules:

nft insert rule ip filter ts-input iifname potluck0 accept
nft insert rule inet mullvad input iifname potluck0 accept
nft insert rule inet mullvad output oifname potluck0 accept

Not a flush. Not a policy change. Not disabling the VPN firewall. Just a hole for the Potluck mesh interface.

After that, the Mac could hit:

curl http://100.64.0.7:8321/health
curl http://100.64.0.7:8321/peer/capabilities

Both returned 200. The prompt routed to the Linux box. The footer showed it ran on newtorob-a16.

That fixed the immediate problem.

Then it broke again.

Reconnects erase one-shot fixes

VPN clients rebuild firewall rules.

That sentence is obvious after you have been bitten by it once. It is not obvious when you are staring at a mesh that worked five minutes ago.

Mullvad, Proton, Nord, and similar clients do not treat nftables as a stable place where your hand-inserted rule gets to live forever. Reconnect the VPN, switch servers, wake from sleep, change networks, and the client may recreate its ruleset. Your narrow exception disappears. The killswitch keeps doing its job. Your mesh goes dark again.

My first fix was a boot-time one-shot. It installed the three accept rules when the machine started. That survives reboots. It does not survive VPN reconnects.

The better fix was a watcher.

Every few seconds it checks whether the accept rules that should exist still exist. If Tailscale or Mullvad is not present, it owes nothing. If they are present and any of the three potluck0 rules are missing, it reruns the same idempotent insert path.

The loop is boring by design:

while true; do
    sleep "${POTLUCK_FW_WATCH_INTERVAL:-5}"
    if ! rules_intact; then
        log "accept rule(s) missing; reapplying"
        do_install || log "reapply hit an error; will retry on next tick"
    fi
done

The systemd unit is also boring:

Type=exec
ExecStart=/usr/local/lib/potluck/install-firewall-rules.sh --watch
ExecStop=/usr/local/lib/potluck/install-firewall-rules.sh --uninstall
Restart=always

I tested it the blunt way. Run --uninstall, confirm all three rules are missing, wait seven seconds, confirm they are back. The journal logged the reapply event. The mesh stayed usable after that.

That is not the whole product fix. It is only the repair for this Linux VPN coexistence case.

The product fix is diagnostics.

A health check needs to follow the work

The lesson is not "add firewall rules."

The lesson is that local AI needs data-plane health checks.

If a system routes inference across machines, it should have a check that uses the same route as inference. Not just the same peer. Not just the same coordinator. The same path.

For my setup, a real health check should answer separate questions:

Is the local sidecar running?
Is the coordinator reachable?
Is the peer present in the roster?
Is there a recent WireGuard handshake?
Can this machine make an HTTP request to the peer sidecar over the mesh IP?
Can the peer stream a small response from the model path?

Those are different failures with different owners and different fixes.

If the coordinator is down, restarting Mullvad will not help.

If the peer is powered off, reapplying nftables rules will not help.

If the WireGuard key in the coordinator is stale, reloading the model will not help.

If the model runtime is missing CUDA libraries, the tunnel can be perfect and inference will still fail.

If Mullvad dropped potluck0, the peer can look enrolled and still be unusable.

The UI should not compress all of that into "offline."

It should say "coordinator unreachable," "peer not present," "no tunnel," "relayed," "firewall likely blocking mesh traffic," or "model runtime not ready." The exact labels matter less than the principle: name the layer that failed.

Why this matters more for local AI than normal SaaS

In a normal SaaS product, most of the network path is owned by the operator. The user opens a browser. Your load balancer works or it does not. Your app servers work or they do not. There are still ugly edge cases, but the core path is under one operational umbrella.

Local AI is different.

The path crosses the user's laptop, their OS firewall, their VPN, their home router, a mesh tunnel, another machine's firewall, a model sidecar, a Python runtime, a GPU driver, and a model file on disk.

The product does not get to pretend that is one boolean.

This is especially true for "my machines" routing. The whole point is to make a user's idle hardware useful: Mac for the app, Linux box for GPU inference, Windows desktop for another model, maybe a mini PC in a closet. That is a better architecture for ownership and cost. It is also a worse architecture for lazy health checks.

The user should not need to know nftables to understand why their peer is unavailable.

The software should know enough to say: "Your peer is visible, but data-plane traffic over potluck0 is blocked. Reapply the scoped firewall rules or disable the VPN killswitch exception."

Even better, with consent, it should offer the fix.

What I changed

The immediate change was operational:

Add a narrow, reversible firewall helper for Linux systems using Tailscale and Mullvad.
Run it as a long-lived systemd watcher, not a boot-only one-shot.
Keep the scope to potluck0 accepts. Do not flush rulesets. Do not weaken the broader VPN policy.
Verify the real path with curl to /health and /peer/capabilities, then a prompt that actually runs on the remote machine.

The next change is product:

Replace the single peer-status badge with a small diagnostics model. Local host, coordinator, relay, tunnel, peer data plane, model runtime. Each layer gets a named failure and a concrete fix.

That is less elegant than a green dot.

It is also more honest.

The check I want

The check I want is not expensive.

Send a tiny HTTP probe to the peer over the mesh. Sometimes send a larger one to catch MTU and fragmentation problems. If the app is about to route inference, ask the peer for capabilities over the same path. If that passes, optionally send a tiny model-path probe before marking the peer usable for a real prompt.

Cache the answer briefly. Debounce flaps. Suppress downstream errors when an upstream layer is already broken.

But do not call the peer reachable just because a control-plane heartbeat exists.

That is how I lost an afternoon.

What I would tell anyone building this

If you are building local-first AI across machines, do not start with "peer online."

Start with the path:

Can the request leave this process?

Can it cross the mesh?

Can it reach the peer process?

Can the peer reach the model runtime?

Can one token come back?

Everything else is metadata.

The metadata is still useful. Heartbeats, handshakes, rosters, relay status, and capabilities all help narrow the search. But they are not proof that the system can do the work.

A local AI mesh should not ask "is the peer online?"

It should ask "can this prompt reach that model and stream a token back right now?"

That is the health check that matters.

Rob writes the Local AI Engineering Notes series on strake.dev. He's building Potluck AI, a local-first AI system that routes inference across your own machines and trusted peers, and Strake, a GitHub Action deploy gate.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.