<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Blazi2002</title>
    <description>The latest articles on DEV Community by Blazi2002 (@blazi2002).</description>
    <link>https://dev.to/blazi2002</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3976675%2F5a69a755-ccf4-4aca-ad87-05a54fcc107b.png</url>
      <title>DEV Community: Blazi2002</title>
      <link>https://dev.to/blazi2002</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/blazi2002"/>
    <language>en</language>
    <item>
      <title>I built an autonomous SRE that lets an LLM diagnose incidents — but never touch a shell unsupervised</title>
      <dc:creator>Blazi2002</dc:creator>
      <pubDate>Tue, 09 Jun 2026 23:13:47 +0000</pubDate>
      <link>https://dev.to/blazi2002/i-built-an-autonomous-sre-that-lets-an-llm-diagnose-incidents-but-never-touch-a-shell-unsupervised-2mgh</link>
      <guid>https://dev.to/blazi2002/i-built-an-autonomous-sre-that-lets-an-llm-diagnose-incidents-but-never-touch-a-shell-unsupervised-2mgh</guid>
      <description>&lt;p&gt;I built an autonomous SRE system where a local LLM diagnoses production incidents, proposes a fix, and a deterministic engine decides whether that fix is ever allowed to run. The whole thing works inside your own network — zero data egress.&lt;/p&gt;

&lt;p&gt;It's called &lt;strong&gt;Sentinel&lt;/strong&gt;. It's a prototype and a learning project: I started from scratch on a Mac and used it to go deep on distributed systems, gRPC, local LLM inference, and safe automation. This post walks through &lt;em&gt;why&lt;/em&gt; it's built the way it is — the design decisions are the interesting part.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Repo: &lt;a href="https://github.com/Blazi2002/sentinel" rel="noopener noreferrer"&gt;https://github.com/Blazi2002/sentinel&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;Modern observability is passive. Prometheus and Grafana tell you &lt;em&gt;that&lt;/em&gt; something is wrong, but a human still has to diagnose the cause and type the fix — often at 3 a.m.&lt;/p&gt;

&lt;p&gt;LLM assistants could close that gap. But the good ones send your logs, configs, and environment data to a public cloud API. For regulated sectors — defense, finance, healthcare, utilities — that's a non-starter, legally before technically.&lt;/p&gt;

&lt;p&gt;Sentinel closes the gap &lt;strong&gt;without breaking the constraint&lt;/strong&gt;: detection, LLM inference, validation, and execution all happen inside the customer's network.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core idea: separate the probabilistic from the deterministic
&lt;/h2&gt;

&lt;p&gt;An LLM is probabilistic. Given the same input it can answer differently, and occasionally dangerously. Letting it run commands as root unsupervised would be reckless.&lt;/p&gt;

&lt;p&gt;So Sentinel draws a hard line:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The LLM proposes.&lt;/strong&gt; It diagnoses the anomaly and drafts a remediation plan.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A deterministic engine disposes.&lt;/strong&gt; Every command is parsed and judged by a fixed, verifiable, repeatable rule set &lt;em&gt;before&lt;/em&gt; it can run.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That engine — the &lt;strong&gt;policy engine&lt;/strong&gt; — is the moat between an LLM's output and a privileged shell. It's the most distinctive part of the system.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;The pipeline has eight stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;detect → capture → transport → reason → validate → persist → approve → execute
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Detect&lt;/strong&gt; — a Go agent on each host spots a metric over threshold (e.g. memory at 85%).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Capture&lt;/strong&gt; — it builds a telemetry event, enriched with context (e.g. the top memory-consuming processes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transport&lt;/strong&gt; — the event travels to the hub over gRPC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reason&lt;/strong&gt; — the hub prompts a locally-hosted LLM, which returns a structured JSON plan: root cause, risk level, confidence, ordered commands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Validate&lt;/strong&gt; — the policy engine judges each command: &lt;code&gt;allow&lt;/code&gt;, &lt;code&gt;review&lt;/code&gt;, or &lt;code&gt;block&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persist&lt;/strong&gt; — the full incident is saved to PostgreSQL in a single transaction.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approve&lt;/strong&gt; — an operator reviews the incident on a dashboard and approves or rejects.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute&lt;/strong&gt; — the node pulls the approved plan, runs only the &lt;code&gt;allow&lt;/code&gt; commands (dry-run by default), and reports back.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The policy engine, in a little more detail
&lt;/h2&gt;

&lt;p&gt;This is the part I'm most proud of.&lt;/p&gt;

&lt;p&gt;Instead of naive string matching — which is trivially bypassed — the engine &lt;strong&gt;parses each command into an abstract syntax tree&lt;/strong&gt; using a real shell grammar parser (&lt;code&gt;mvdan/sh&lt;/code&gt;). That lets it catch dangerous commands even when they're hidden inside:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;eval&lt;/code&gt; or &lt;code&gt;sh -c&lt;/code&gt; (shell code run as a string)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;&amp;amp;&amp;amp;&lt;/code&gt; / &lt;code&gt;||&lt;/code&gt; chains&lt;/li&gt;
&lt;li&gt;command substitutions &lt;code&gt;$(...)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;redirection targets (&lt;code&gt;&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;gt;&amp;gt;&lt;/code&gt;) — e.g. a write into a system file&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Rules are evaluated most-dangerous-first, and a plan's verdict equals its &lt;strong&gt;worst&lt;/strong&gt; command. One safety guarantee worth stating plainly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The node executes only &lt;code&gt;allow&lt;/code&gt; commands. &lt;code&gt;review&lt;/code&gt; and &lt;code&gt;block&lt;/code&gt; are never executed automatically — &lt;strong&gt;even after operator approval.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Defense in depth: the deterministic gate filters commands, the human approves the set. The human can't override the technical verdict.&lt;/p&gt;

&lt;h2&gt;
  
  
  A finding worth sharing
&lt;/h2&gt;

&lt;p&gt;Early on, given the same anomaly ("memory at 82%"), the LLM produced vague, shifting guesses — a memory leak, too much load, inefficient processes, take your pick.&lt;/p&gt;

&lt;p&gt;Then I enriched the event with the actual top memory-consuming processes — carried in a free-form labels map, &lt;strong&gt;with no schema change&lt;/strong&gt; — and told the prompt to ground its answer in that data. The diagnosis became precise and repeatable: it named the specific culprit process, with its memory percentage, every time.&lt;/p&gt;

&lt;p&gt;The lesson generalizes:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The quality of an LLM's diagnosis depends on the quality of the context you give it, not just on the size of the model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Tech stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Go&lt;/strong&gt; — node, hub, policy engine&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gRPC + Protocol Buffers&lt;/strong&gt; — typed transport, code generated from &lt;code&gt;.proto&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; (dev) / &lt;strong&gt;vLLM&lt;/strong&gt; (planned, prod) — local inference; model &lt;code&gt;qwen2.5-coder:7b&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PostgreSQL&lt;/strong&gt; — persistence, with versioned migrations (&lt;code&gt;golang-migrate&lt;/code&gt;) and type-safe queries (&lt;code&gt;sqlc&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;mvdan/sh&lt;/strong&gt; — shell AST parsing in the policy engine&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's done, and what isn't
&lt;/h2&gt;

&lt;p&gt;I want to be honest about the state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Done and demonstrable end-to-end:&lt;/strong&gt; detection, capture, gRPC transport, LLM reasoning, deterministic policy validation, transactional persistence, an operator dashboard with filtering and an approve/reject workflow, and node-side execution (dry-run) that reports back and closes the incident lifecycle. The policy engine and executor have tests covering their safety guarantees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Planned / in progress (tracked, not hidden):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;mTLS on the transport (currently plaintext)&lt;/li&gt;
&lt;li&gt;Asynchronous pipeline (the node currently waits for the LLM synchronously)&lt;/li&gt;
&lt;li&gt;vLLM on GPUs for production inference&lt;/li&gt;
&lt;li&gt;Hash-chained audit log (the message exists in the data contract; not yet wired)&lt;/li&gt;
&lt;li&gt;Resilient hub startup (degrade gracefully when the LLM is unreachable)&lt;/li&gt;
&lt;li&gt;Packaging (RPM), k3s deployment, secret management via a vault&lt;/li&gt;
&lt;li&gt;More anomaly types beyond memory and disk&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Design principles
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Clear component boundaries&lt;/strong&gt; — implementations swap without rewrites (Ollama → vLLM, dry-run → live, in-memory → PostgreSQL).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Schema as the source of truth&lt;/strong&gt; — both messages (Protobuf) and data access (sqlc) are generated from a formal definition.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Safe by default&lt;/strong&gt; — the default state is always the prudent one (dry-run; unknown verdict treated as needs-review; the policy filter as the final technical gate).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probabilistic proposes, deterministic disposes&lt;/strong&gt; — the central safety idea.&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you want to dig into any of the design decisions, the code is here: &lt;a href="https://github.com/Blazi2002/sentinel" rel="noopener noreferrer"&gt;https://github.com/Blazi2002/sentinel&lt;/a&gt; — the README is honest about what's done vs. planned. Happy to answer questions.&lt;/p&gt;

</description>
      <category>go</category>
      <category>devops</category>
      <category>distributedsystems</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
