I built an autonomous SRE that lets an LLM diagnose incidents — but never touch a shell unsupervised

#go #devops #distributedsystems #ai

I built an autonomous SRE system where a local LLM diagnoses production incidents, proposes a fix, and a deterministic engine decides whether that fix is ever allowed to run. The whole thing works inside your own network — zero data egress.

It's called Sentinel. It's a prototype and a learning project: I started from scratch on a Mac and used it to go deep on distributed systems, gRPC, local LLM inference, and safe automation. This post walks through why it's built the way it is — the design decisions are the interesting part.

Repo: https://github.com/Blazi2002/sentinel

The problem

Modern observability is passive. Prometheus and Grafana tell you that something is wrong, but a human still has to diagnose the cause and type the fix — often at 3 a.m.

LLM assistants could close that gap. But the good ones send your logs, configs, and environment data to a public cloud API. For regulated sectors — defense, finance, healthcare, utilities — that's a non-starter, legally before technically.

Sentinel closes the gap without breaking the constraint: detection, LLM inference, validation, and execution all happen inside the customer's network.

The core idea: separate the probabilistic from the deterministic

An LLM is probabilistic. Given the same input it can answer differently, and occasionally dangerously. Letting it run commands as root unsupervised would be reckless.

So Sentinel draws a hard line:

The LLM proposes. It diagnoses the anomaly and drafts a remediation plan.
A deterministic engine disposes. Every command is parsed and judged by a fixed, verifiable, repeatable rule set before it can run.

That engine — the policy engine — is the moat between an LLM's output and a privileged shell. It's the most distinctive part of the system.

How it works

The pipeline has eight stages:

detect → capture → transport → reason → validate → persist → approve → execute

Detect — a Go agent on each host spots a metric over threshold (e.g. memory at 85%).
Capture — it builds a telemetry event, enriched with context (e.g. the top memory-consuming processes).
Transport — the event travels to the hub over gRPC.
Reason — the hub prompts a locally-hosted LLM, which returns a structured JSON plan: root cause, risk level, confidence, ordered commands.
Validate — the policy engine judges each command: allow, review, or block.
Persist — the full incident is saved to PostgreSQL in a single transaction.
Approve — an operator reviews the incident on a dashboard and approves or rejects.
Execute — the node pulls the approved plan, runs only the allow commands (dry-run by default), and reports back.

The policy engine, in a little more detail

This is the part I'm most proud of.

Instead of naive string matching — which is trivially bypassed — the engine parses each command into an abstract syntax tree using a real shell grammar parser (mvdan/sh). That lets it catch dangerous commands even when they're hidden inside:

eval or sh -c (shell code run as a string)
&& / || chains
command substitutions $(...)
redirection targets (>, >>) — e.g. a write into a system file

Rules are evaluated most-dangerous-first, and a plan's verdict equals its worst command. One safety guarantee worth stating plainly:

The node executes only allow commands. review and block are never executed automatically — even after operator approval.

Defense in depth: the deterministic gate filters commands, the human approves the set. The human can't override the technical verdict.

A finding worth sharing

Early on, given the same anomaly ("memory at 82%"), the LLM produced vague, shifting guesses — a memory leak, too much load, inefficient processes, take your pick.

Then I enriched the event with the actual top memory-consuming processes — carried in a free-form labels map, with no schema change — and told the prompt to ground its answer in that data. The diagnosis became precise and repeatable: it named the specific culprit process, with its memory percentage, every time.

The lesson generalizes:

The quality of an LLM's diagnosis depends on the quality of the context you give it, not just on the size of the model.

Tech stack

Go — node, hub, policy engine
gRPC + Protocol Buffers — typed transport, code generated from .proto
Ollama (dev) / vLLM (planned, prod) — local inference; model qwen2.5-coder:7b
PostgreSQL — persistence, with versioned migrations (golang-migrate) and type-safe queries (sqlc)
mvdan/sh — shell AST parsing in the policy engine

What's done, and what isn't

I want to be honest about the state.

Done and demonstrable end-to-end: detection, capture, gRPC transport, LLM reasoning, deterministic policy validation, transactional persistence, an operator dashboard with filtering and an approve/reject workflow, and node-side execution (dry-run) that reports back and closes the incident lifecycle. The policy engine and executor have tests covering their safety guarantees.

Planned / in progress (tracked, not hidden):

mTLS on the transport (currently plaintext)
Asynchronous pipeline (the node currently waits for the LLM synchronously)
vLLM on GPUs for production inference
Hash-chained audit log (the message exists in the data contract; not yet wired)
Resilient hub startup (degrade gracefully when the LLM is unreachable)
Packaging (RPM), k3s deployment, secret management via a vault
More anomaly types beyond memory and disk

Design principles

Clear component boundaries — implementations swap without rewrites (Ollama → vLLM, dry-run → live, in-memory → PostgreSQL).
Schema as the source of truth — both messages (Protobuf) and data access (sqlc) are generated from a formal definition.
Safe by default — the default state is always the prudent one (dry-run; unknown verdict treated as needs-review; the policy filter as the final technical gate).
Probabilistic proposes, deterministic disposes — the central safety idea.

If you want to dig into any of the design decisions, the code is here: https://github.com/Blazi2002/sentinel — the README is honest about what's done vs. planned. Happy to answer questions.