Making a fleet of self-hosted LLM agents trustworthy

#ai #kubernetes #llm #opensource

Originally published at llmkube.com/blog/making-self-hosted-llm-agents-trustworthy. Cross-posted here for the dev.to audience.

Running a single local LLM node is a solved problem. You write an InferenceService, the operator schedules it, llama.cpp or MLX serves it, and you get an OpenAI-compatible endpoint. We have been doing that for months.

Running a fleet of them is where it stops being easy. My fleet is heterogeneous on purpose: CUDA pods in the cluster, and Apple Silicon Macs sitting off-cluster on the homelab network, each one running two separate agents (one for inference, one for the agentic coding harness). The day I shipped 0.8.4 to that fleet, I learned exactly how it does not scale.

I updated each Mac by hand. The control plane had no idea what version any agent was running. And the launchd reload I used to restart an agent was a silent no-op on an already-loaded service, so the old binary kept running while I believed I had updated it. I found that out by hand-inspecting a process tree. Three machines made it annoying. Thirty would make it impossible, and the whole pitch for sovereign, on-prem AI is that you run a lot more than three.

So the last stretch of work on LLMKube was not about a faster runtime or a bigger model. It was about making the fleet trustworthy: able to update itself safely, and unable to lie to the control plane about its own state. Here is what that took.

Helm and brew for the edge

The fix is a new cluster-scoped CRD, AgentRelease, and a self-update path in the agents themselves. You describe the release you want once, the operator rolls it out, and the agents pull and apply it. The design borrows directly from prior art that already solved this for Kubernetes nodes: Rancher's system-upgrade-controller, k0s autopilot's per-platform SHA-256 staging, and Teleport's outbound-only poll model.

The properties that make it safe to leave running:

Declarative and approved. An AgentRelease names the agent, the version, and the per-platform artifacts (URL plus SHA-256). Nothing moves until a human flips approved: true. The approved CR is the trust anchor.
Staged and health-gated. The operator updates one node at a time. A freshly updated node has to come back, register, and stay healthy past a soak window before the next node is touched.
Halt-on-failure. If a node does not reach the target version inside the timeout, the rollout stops cold. Blast radius is exactly one node, and you go look at it.
Verified and reversible. The agent downloads the artifact, checks the SHA-256 before it touches anything, stages the new binary beside the old one, flips a single current symlink atomically, and keeps a previous symlink for a one-command rollback. A bad checksum leaves the running version untouched.
Outbound-only. Edge agents are behind NAT and Tailscale. They poll out; nothing reaches in. The same shape that lets a laptop update itself lets a Mac in a closet three sites away update itself.

The end state is that a release I cut becomes a one-line kubectl apply and an approval, instead of an afternoon of SSH. I proved the whole loop on a live node: publish a version, apply the AgentRelease, watch it sit at AwaitingApproval, approve it, and watch the node drain, download, verify, flip, restart onto the new binary, and report back, the rollout closing out at Succeeded. The first one is still a manual hop (an agent on the old, unaware binary cannot update itself to the version that teaches it how), but every release after that is hands-off.

Trustworthy is more than updatable

An auto-updating fleet that lies about its health is worse than a manual one. So alongside the update path, a batch of less glamorous reliability work, the "trustworthy fleet" milestone, had to land.

Liveness, not optimism. A metal node used to register an endpoint and then keep reporting one ready replica forever, even after the host went offline for weeks. Now agents heartbeat, and the controller expires a registration that goes stale. A dead backend stops counting as a live one.
Admission validation. A new validating webhook checks agent and task definitions at kubectl apply time, so an invalid spec is rejected at the door instead of failing confusingly three steps later when a task gets dispatched to it.
A real end-to-end test. Unit tests and envtest cover a lot, but nothing was exercising the full install path: helm install the chart, the operator comes up, an agent registers a node, the scheduler routes a task to it, and the task actually runs to completion. Now a kind-based CI job does exactly that.

None of these are features you would put on a billboard. They are the difference between a demo and something you would leave pointed at production hardware.

The part where dogfooding earns its keep

Here is the honest build-in-public bit, and the reason I trust this work more than I would trust a green test suite alone.

When I ran the very first live self-update against a real node, it did not engage. The agent logged that self-update was disabled because it was "not running from a managed install root", which was wrong: it was running from exactly that root. The detection compared the running binary's resolved path against the literal current/ symlink path, but resolving the binary's path followed the symlink to the real versioned directory, so the two could never match. The unit test had passed for two reasons: it fed the check an unresolved path that never happens in production, and it cached its answer once, forever, so it could not have noticed anyway. The feature had quietly disabled itself on every real install, and only dogfooding the actual rollout surfaced it. The fix was small. Finding it required running the thing for real.

Then there was the end-to-end test. I wrote it specifically to catch install-path bugs that unit tests cannot see, and it caught one on its first CI run: a task reached Scheduled and then stalled, because the agent was watching one namespace while the task lived in another. The scheduler assigned the work; the agent never saw it. That is exactly the class of bug a real apiserver surfaces and a mock does not. The test earned its place before it had even merged.

I am not going to pretend the rest of the cycle was clean either. Pinning a webhook's TLS certs the simple way tripped a CI script that had been quietly passing a giant blob through an environment variable, which works on macOS and dies on Linux. A glob model pattern that routed correctly one way compiled to a literal that matched nothing the other way, while reporting itself healthy. Every one of these passed review or local checks and got caught by the next layer: a full lint, a real cluster, an adversarial second look. That layering is the point. The goal was never zero bugs. It was no bug that survives to a node you cannot reach.

Why this is the hard part of sovereign AI

It is tempting to think the hard problem in self-hosted AI is the inference: the quantization, the GPU memory, the tokens per second. Those are hard, and we spend plenty of time there. But the thing that actually keeps people on a managed cloud is not raw capability. It is that someone else runs the fleet. Updates land, dead nodes get pulled, bad config gets rejected, and you do not think about any of it.

If sovereign AI is going to be a real alternative and not a hobby, it has to offer that same "do not think about it" property while keeping the data and the models on hardware you own. A fleet you have to babysit by hand is not sovereign in any way that matters; it is just someone else's operational burden moved onto you. The work in this post is the unglamorous half of closing that gap: a fleet that updates itself safely, tells the truth about its own health, and refuses to accept a configuration that would break it.

That is the control plane I want for local AI at scale. It is in LLMKube now, it is open source, and it caught its own bugs on the way in.

LLMKube is a Kubernetes operator for self-hosted LLM inference: CUDA, Apple Silicon Metal, multi-GPU, and a heterogeneous fleet under one control plane. Apache 2.0, github.com/defilantech/LLMKube.