<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: paul_h</title>
    <description>The latest articles on DEV Community by paul_h (@paul_knoxops).</description>
    <link>https://dev.to/paul_knoxops</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3955555%2F68e3b7cb-e74f-4ff9-9391-0af09081ef3a.jpg</url>
      <title>DEV Community: paul_h</title>
      <link>https://dev.to/paul_knoxops</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/paul_knoxops"/>
    <language>en</language>
    <item>
      <title>Loop Engineering: Building an Agent Loop with agent-runbook</title>
      <dc:creator>paul_h</dc:creator>
      <pubDate>Wed, 17 Jun 2026 09:02:40 +0000</pubDate>
      <link>https://dev.to/paul_knoxops/loop-engineering-building-an-agent-loop-with-agent-runbook-206</link>
      <guid>https://dev.to/paul_knoxops/loop-engineering-building-an-agent-loop-with-agent-runbook-206</guid>
      <description>&lt;p&gt;Recently, another interesting new term has appeared in the AI industry.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Loop Engineering&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;If you follow the AI space, you've probably seen it everywhere in the past couple of days. It's all over X, all over various social media, and quite a few people are discussing it in group chats too.&lt;/p&gt;

&lt;p&gt;Recently Addy Osmani formally organized this concept into Loop Engineering — the fourth Engineering after Prompt Engineering, Context Engineering, and Harness Engineering.&lt;/p&gt;

&lt;p&gt;What is a Loop? Here's a concrete scenario:&lt;/p&gt;

&lt;p&gt;You have a project with 16 failing tests. Previously you'd do this: run the tests, see what failed, tell Claude "fix this", it fixes it, you run the tests again, find new issues, say something again... back and forth, &lt;strong&gt;you are the person driving the loop&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The idea behind Loop Engineering is: you no longer manually drive it round by round. You define the goal (all tests pass), define what to do each round (run tests → fix code), define constraints (can't modify test files), then let go. The system runs on its own until the goal is met.&lt;/p&gt;

&lt;h2&gt;
  
  
  /goal Is Not Enough
&lt;/h2&gt;

&lt;p&gt;At this point you might say: doesn't Claude Code already have the &lt;code&gt;/goal&lt;/code&gt; command? Can't I just &lt;code&gt;/goal "all tests pass"&lt;/code&gt; and be done?&lt;/p&gt;

&lt;p&gt;On the surface, yes. &lt;code&gt;/goal&lt;/code&gt; gives you a completion condition, and Claude works on its own until it's satisfied. But after using it a few times you'll notice the problem — the goal is defined, but the agent still won't work properly. Because you only told it "what counts as done", you didn't tell it "what to do each round".&lt;/p&gt;

&lt;p&gt;/goal "all tests pass" — what did it do:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tells the agent "keep going until this condition is met"&lt;/li&gt;
&lt;li&gt;At the end of each round, an independent model judges whether the goal is satisfied&lt;/li&gt;
&lt;li&gt;The agent has complete freedom in what it does each round&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What it doesn't do:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Doesn't define the internal structure of each round.&lt;/strong&gt; In /goal the agent does whatever it wants each round. Maybe the first round it runs tests + fixes code, the second round it suddenly goes refactoring, the third round it modifies test files.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No iteration-level constraints.&lt;/strong&gt; /goal only has a termination condition. There's no guardrail like "only modify one file per round", and you can't control when the agent goes out of bounds.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not reusable.&lt;/strong&gt; /goal "all tests pass" is gone once you type it. Next time you switch repos or switch people, you have to type it all over again.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Not auditable.&lt;/strong&gt; When your boss asks "what's the logic of this automated fix workflow", you can't show them /goal.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To summarize: &lt;code&gt;/goal&lt;/code&gt; solves "keeping the agent from stopping", but doesn't solve "making the agent follow the rules".&lt;/p&gt;

&lt;p&gt;What you need is a place to write down the loop's structure, constraints, and goals — not a one-time command typed into the terminal, but a file that can be committed to the repo, where anyone who gets it can run it and get the same behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  agent-runbook: The Contract Format for Loops
&lt;/h2&gt;

&lt;p&gt;This is what agent-runbook does.&lt;/p&gt;

&lt;p&gt;agent-runbook is an open source project (&lt;a href="https://github.com/KnoxOps/agent-runbook" rel="noopener noreferrer"&gt;github.com/KnoxOps/agent-runbook&lt;/a&gt;), it's not the execution engine for loops, but rather the &lt;strong&gt;contract format&lt;/strong&gt; for loops. You use YAML to declare "what to iterate on, when to stop, what the constraints are for each round", and the compiler generates a SKILL.md for you — this is the reusable instruction format for Claude Code and Codex, put it in your project and it can be directly invoked with &lt;code&gt;claude --skill&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A loop step has three elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;body&lt;/strong&gt;: what to do each round (the rhythm of observe → act → verify)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;goal&lt;/strong&gt;: when to stop (must be a machine-verifiable condition)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;max_iterations&lt;/strong&gt;: safety boundary (exceeding this number means the design has a problem, prevents burning tokens)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's also one more key thing: &lt;strong&gt;quality_check&lt;/strong&gt;. This is an iteration-level guardrail — after each round it checks whether the agent went out of bounds (e.g. modified files it shouldn't have). If blocking: true, the round doesn't count as complete if the check fails.&lt;/p&gt;

&lt;h2&gt;
  
  
  Hands-on: Building an Automated Test Fix Loop
&lt;/h2&gt;

&lt;p&gt;Here's a simple example to show you how we use agent-runbook to build an agent loop.&lt;/p&gt;

&lt;p&gt;We're going to build an &lt;strong&gt;automated test fix&lt;/strong&gt; Loop. This loop is simple, the goal is 100% unit test pass rate. Each iteration has only two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;run_tests&lt;/strong&gt; - run the tests, see which ones are still failing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fix&lt;/strong&gt; - launch a clean context agent to fix the discovered issues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Beyond that, we also need to define our safety boundary: &lt;strong&gt;max_iterations&lt;/strong&gt;. I wonder if any readers here have had the experience of burning through all their tokens with the /goal command — max_iterations is what prevents that.&lt;/p&gt;

&lt;p&gt;Here's the full runbook, defined in structured YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fix-failing-tests&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Iteratively fix all failing tests until the test suite is green&lt;/span&gt;

&lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fix_loop&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;loop&lt;/span&gt;
    &lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Run&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tests,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;analyze&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;fix&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;code,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;repeat&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;until&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;green"&lt;/span&gt;
    &lt;span class="na"&gt;goal&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pytest&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;exits&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;with&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;failures&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;(all&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tests&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pass)"&lt;/span&gt;
    &lt;span class="na"&gt;max_iterations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
    &lt;span class="na"&gt;body&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;run_tests&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;script&lt;/span&gt;
        &lt;span class="na"&gt;command&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cd&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;examples/fix-loop&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;amp;&amp;amp;&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;python3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-m&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;pytest&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tests/&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;--tb=short&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;2&amp;gt;&amp;amp;1&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;|&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;tail&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;-60"&lt;/span&gt;
        &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[]&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fix&lt;/span&gt;
        &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;agent&lt;/span&gt;
        &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;Look at the pytest failures from run_tests.&lt;/span&gt;
          &lt;span class="s"&gt;Pick ONE source file that has failing tests and fix the bugs in that file.&lt;/span&gt;

          &lt;span class="s"&gt;Rules:&lt;/span&gt;
            &lt;span class="s"&gt;- Only modify files in src/, NEVER modify test files&lt;/span&gt;
            &lt;span class="s"&gt;- Fix exactly ONE file, then stop immediately&lt;/span&gt;
            &lt;span class="s"&gt;- Do NOT read or modify any other source files&lt;/span&gt;
        &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;run_tests&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
        &lt;span class="na"&gt;quality_check&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;blocking&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
          &lt;span class="na"&gt;rules&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Only&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;files&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;in&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;src/&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;were&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;modified,&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;not&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;test&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;files"&lt;/span&gt;
            &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Exactly&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;one&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;source&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;file&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;was&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;modified"&lt;/span&gt;

  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;present&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;inline&lt;/span&gt;
    &lt;span class="na"&gt;prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
      &lt;span class="s"&gt;Generate a markdown report summarizing the fix loop results.&lt;/span&gt;
      &lt;span class="s"&gt;Include:&lt;/span&gt;
        &lt;span class="s"&gt;- Total iterations taken&lt;/span&gt;
        &lt;span class="s"&gt;- What was fixed in each iteration (file + bug description)&lt;/span&gt;
        &lt;span class="s"&gt;- Final test results&lt;/span&gt;
        &lt;span class="s"&gt;- How cascading dependencies caused failures to clear automatically&lt;/span&gt;
      &lt;span class="s"&gt;Write the report to fix_report.md&lt;/span&gt;
    &lt;span class="na"&gt;depends_on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;fix_loop&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  From YAML to Executable SKILL.md
&lt;/h2&gt;

&lt;p&gt;Next we need to compile the YAML into a SKILL.md that Claude Code/Codex can directly execute. The generation command is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python3 &lt;span class="nt"&gt;-m&lt;/span&gt; agent_runbook generate runbook.yaml &lt;span class="nt"&gt;-o&lt;/span&gt; output/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated SKILL.md looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="nn"&gt;---&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;fix-failing-tests&lt;/span&gt;
&lt;span class="na"&gt;description&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;&amp;gt;-"&lt;/span&gt;
  &lt;span class="s"&gt;Iteratively fix all failing tests until the test suite is green&lt;/span&gt;
&lt;span class="na"&gt;user-invocable&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="nn"&gt;---&lt;/span&gt;

&lt;span class="gu"&gt;## Execution Flow&lt;/span&gt;

&lt;span class="gu"&gt;### Task Context&lt;/span&gt;

Before starting execution, initialize &lt;span class="sb"&gt;`task_context.json`&lt;/span&gt;:

&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;json
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"task_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;task_id from input&amp;gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_step"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"current_step_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"running"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"fix_loop"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pending"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"present"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pending"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"updated_at"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;ISO timestamp&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;```&lt;/span&gt;

Update this file after each step completes. On error, set step status to &lt;span class="sb"&gt;`"failed"`&lt;/span&gt; and overall &lt;span class="sb"&gt;`status`&lt;/span&gt; to &lt;span class="sb"&gt;`"failed"`&lt;/span&gt;.

&lt;span class="gu"&gt;### Step 1: fix_loop&lt;/span&gt;

&lt;span class="gs"&gt;**Type:**&lt;/span&gt; loop
&lt;span class="gs"&gt;**Description:**&lt;/span&gt; Run tests, analyze failures, fix source code, repeat until green

&lt;span class="gu"&gt;## Iteration Loop&lt;/span&gt;

&lt;span class="gs"&gt;**Goal:**&lt;/span&gt; pytest exits with 0 failures (all tests pass)
&lt;span class="gs"&gt;**Max Iterations:**&lt;/span&gt; 10
&lt;span class="gt"&gt;
&amp;gt; This step executes as a loop. The body steps repeat until the goal is met or max iterations reached.&lt;/span&gt;

&lt;span class="gu"&gt;## Loop Body (repeats each iteration)&lt;/span&gt;

&lt;span class="gu"&gt;#### Body Step 1: run_tests&lt;/span&gt;

&lt;span class="gs"&gt;**Type:**&lt;/span&gt; script

&lt;span class="gs"&gt;**Execution:**&lt;/span&gt; Execute the following command:
&lt;span class="p"&gt;```&lt;/span&gt;&lt;span class="nl"&gt;bash
&lt;/span&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;examples/fix-loop &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; python3 &lt;span class="nt"&gt;-m&lt;/span&gt; pytest tests/ &lt;span class="nt"&gt;--tb&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;short 2&amp;gt;&amp;amp;1 | &lt;span class="nb"&gt;tail&lt;/span&gt; &lt;span class="nt"&gt;-60&lt;/span&gt;
&lt;span class="p"&gt;```&lt;/span&gt;

&lt;span class="gu"&gt;#### Body Step 2: fix&lt;/span&gt;

&lt;span class="gs"&gt;**Type:**&lt;/span&gt; agent

&lt;span class="gs"&gt;**Execution:**&lt;/span&gt; Launch an independent agent with the following prompt file:

Look at the pytest failures from run_tests.
Pick ONE source file that has failing tests and fix the bugs in that file.

Rules:
&lt;span class="p"&gt;  -&lt;/span&gt; Only modify files in src/, NEVER modify test files
&lt;span class="p"&gt;  -&lt;/span&gt; Fix exactly ONE file, then stop immediately
&lt;span class="p"&gt;  -&lt;/span&gt; Do NOT read or modify any other source files&lt;span class="sb"&gt;


&lt;/span&gt;&lt;span class="gu"&gt;## Goal Evaluation&lt;/span&gt;

After all body steps complete, evaluate:

&lt;span class="gs"&gt;**Goal:**&lt;/span&gt; pytest exits with 0 failures (all tests pass)
&lt;span class="p"&gt;
1.&lt;/span&gt; If goal IS met → mark this step completed, proceed to next step.
&lt;span class="p"&gt;2.&lt;/span&gt; If goal NOT met and iterations remain → reset body steps, start next iteration.
&lt;span class="p"&gt;3.&lt;/span&gt; If max iterations reached → mark step completed with status "max_iterations_reached", report what remains.

Append a summary to &lt;span class="sb"&gt;`iteration_history`&lt;/span&gt; after each iteration.

&lt;span class="gu"&gt;### Progress Tracking&lt;/span&gt;

After completing this step, update &lt;span class="sb"&gt;`task_context.json`&lt;/span&gt;:
&lt;span class="p"&gt;-&lt;/span&gt; Set &lt;span class="sb"&gt;`current_step_id`&lt;/span&gt; to &lt;span class="sb"&gt;`"fix_loop"`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Set &lt;span class="sb"&gt;`steps.fix_loop`&lt;/span&gt; to &lt;span class="sb"&gt;`"completed"`&lt;/span&gt;
&lt;span class="gu"&gt;### Step 2: present&lt;/span&gt;

&lt;span class="gs"&gt;**Type:**&lt;/span&gt; inline

&lt;span class="gu"&gt;## Execution&lt;/span&gt;
Follow these instructions:

Generate a markdown report summarizing the fix loop results.
Include:
&lt;span class="p"&gt;  -&lt;/span&gt; Total iterations taken
&lt;span class="p"&gt;  -&lt;/span&gt; What was fixed in each iteration (file + bug description)
&lt;span class="p"&gt;  -&lt;/span&gt; Final test results
&lt;span class="p"&gt;  -&lt;/span&gt; How cascading dependencies caused failures to clear automatically
Write the report to fix_report.md&lt;span class="sb"&gt;


&lt;/span&gt;&lt;span class="gu"&gt;### Progress Tracking&lt;/span&gt;

After completing this step, update &lt;span class="sb"&gt;`task_context.json`&lt;/span&gt;:
&lt;span class="p"&gt;-&lt;/span&gt; Set &lt;span class="sb"&gt;`current_step_id`&lt;/span&gt; to &lt;span class="sb"&gt;`"present"`&lt;/span&gt;
&lt;span class="p"&gt;-&lt;/span&gt; Set &lt;span class="sb"&gt;`steps.present`&lt;/span&gt; to &lt;span class="sb"&gt;`"completed"`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What does the generated SKILL.md contain? It translates the contracts you declared in YAML into execution instructions that the agent can understand:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;iteration_history&lt;/strong&gt;: requires the agent to record what was done each round and whether the goal was met, forming structured iteration memory&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;goal evaluation&lt;/strong&gt;: the judgment logic after each round — if met then stop, if not met then continue, if limit reached then report&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;progress tracking&lt;/strong&gt;: tracks overall progress through task_context.json, supports checkpoint resume&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Running It: 3-Round Convergence
&lt;/h2&gt;

&lt;p&gt;Now we can trigger this skill to run in Claude Code:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbf24jz6prqh3s9g2ct7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbf24jz6prqh3s9g2ct7.jpg" alt=" " width="798" height="63"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The run included three iterations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Iteration 1: calculator fix → 6 failures disappeared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl30tsy6nq760kxxi0lir.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl30tsy6nq760kxxi0lir.jpg" alt=" " width="800" height="678"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Iteration 2: validator fix → 5 failures disappeared&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17nybu6xwiyavmjr66gp.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17nybu6xwiyavmjr66gp.jpg" alt=" " width="800" height="782"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Iteration 3: formatter fix → all green&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1e66qs5t66wp2r6cwz8s.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1e66qs5t66wp2r6cwz8s.jpg" alt=" " width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finally, this is also what we defined earlier in the runbook — a fix_report.md to be produced after the loop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Key Points for Designing a Good Loop
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Choose the right task.&lt;/strong&gt; Not all tasks are suitable for loops. A good loop task has two characteristics: objective feedback signals (test results, lint output, whether compilation passes), and the ability to make incremental progress building on the previous round. Fixing tests, code migration, and performance optimization are all good candidates. Tasks requiring one-time creative decisions (architecture choices, naming) are not suitable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Write the goal as a decidable end state.&lt;/strong&gt; "pytest exit 0" is a good goal, "better code quality" is not. The agent must be able to determine true or false on its own through tool output, otherwise the loop never knows whether it should stop.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Keep the body in an "observe—act" rhythm.&lt;/strong&gt; First use script steps to see the current state clearly (run tests, run lint), then use agent steps to make decisions and modifications. Don't let the agent observe, act, and verify all in one round — split them up, each step has clear responsibilities, and when something goes wrong it's easier to locate.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Leave an exit for failure.&lt;/strong&gt; max_iterations is not the number of rounds you expect, but a safety valve for "exceeding this number means the approach has a problem". A normal loop should converge well below the upper limit. If it maxes out, it means the goal is too hard or the body design has flaws, and human intervention is needed.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  agent-runbook: More Than Just Loops
&lt;/h2&gt;

&lt;p&gt;Due to the AI product I'm developing, I frequently need to write many long-running, as-error-free-as-possible DevOps skills for SREs. &lt;/p&gt;

&lt;p&gt;During debugging I often encounter two types of problems:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One is &lt;strong&gt;agents not following instructions&lt;/strong&gt; — you tell it to only restart the service, and it goes ahead and changes the configuration too.&lt;/li&gt;
&lt;li&gt;The other is in a complex multi-step skill, &lt;strong&gt;agents not collaborating according to the established norms&lt;/strong&gt;, where the output from the previous step isn't read by the next step at all, or it's read but the format is wrong.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on these problems, I developed agent-runbook: &lt;strong&gt;a contract-based skill generation tool, where the generated SKILL.md can be directly used as a skill integrated into the Claude Code/Codex ecosystem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Its core philosophy is: use contracts to constrain agent collaboration, instead of relying on prompts and hoping for the best.&lt;/p&gt;

&lt;p&gt;This table gives you a quick sense of how agent-runbook differs from /goal:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;/goal&lt;/th&gt;
&lt;th&gt;agent-runbook&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Per-round structure&lt;/td&gt;
&lt;td&gt;Agent does whatever it wants&lt;/td&gt;
&lt;td&gt;Body declaratively defines each round's steps&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Iteration constraints&lt;/td&gt;
&lt;td&gt;None, only a termination condition&lt;/td&gt;
&lt;td&gt;quality_check guardrails, out-of-bounds doesn't count as complete&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inter-step communication&lt;/td&gt;
&lt;td&gt;Relies on LLM context passing&lt;/td&gt;
&lt;td&gt;JSON Schema files, inspectable, parallel-readable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error recovery&lt;/td&gt;
&lt;td&gt;Start over&lt;/td&gt;
&lt;td&gt;Checkpoint &amp;amp; Resume, pick up from where it crashed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Build-time checks&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;DAG cycle detection, schema reference validation, contract closure checks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reusability&lt;/td&gt;
&lt;td&gt;Gone once you type it&lt;/td&gt;
&lt;td&gt;Commit to repo, anyone can run it with the same behavior&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Loop is a step type added on top of this foundation&lt;/strong&gt; — when your task requires iteration, use the same contract-based approach to define the loop's body, goal, and constraints.&lt;/p&gt;

&lt;p&gt;You don't have to start from scratch either. &lt;a href="https://github.com/KnoxOps/open-devops-skills" rel="noopener noreferrer"&gt;open-devops-skills&lt;/a&gt; is a production-grade DevOps skill library built on agent-runbook, currently featuring infrastructure/cloud resource cost optimization skills, with more DevOps scenarios to be expanded in the future. You can use them directly, or use them as reference for designing your own skills.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It's also worth mentioning that agent-runbook itself is not limited to DevOps&lt;/strong&gt;. Any scenario requiring multi-step orchestration, inter-agent collaboration, and long-term reliable operation is suitable — code migration, security auditing, documentation generation, data pipeline validation. As long as your task can be broken down into "steps + contracts + dependencies", it can be expressed with a runbook.&lt;/p&gt;

&lt;p&gt;The repo is at &lt;a href="https://github.com/KnoxOps/agent-runbook" rel="noopener noreferrer"&gt;github.com/KnoxOps/agent-runbook&lt;/a&gt;, feel free to try it out and give feedback. If you have a workflow where you're repeatedly prompting agents manually, try writing it as a runbook — you'll find that once it becomes a contract, the cost of debugging and reuse drops significantly.&lt;/p&gt;

</description>
      <category>loopengineering</category>
      <category>claudecode</category>
      <category>ai</category>
      <category>aiskill</category>
    </item>
    <item>
      <title>I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.</title>
      <dc:creator>paul_h</dc:creator>
      <pubDate>Mon, 15 Jun 2026 07:13:09 +0000</pubDate>
      <link>https://dev.to/paul_knoxops/i-asked-claude-to-map-my-infrastructure-then-i-asked-a-purpose-built-tool-51jp</link>
      <guid>https://dev.to/paul_knoxops/i-asked-claude-to-map-my-infrastructure-then-i-asked-a-purpose-built-tool-51jp</guid>
      <description>&lt;p&gt;I manage a small stack. Three Linux VMs, one Kubernetes cluster, maybe 20-something services total. Not big. But underdocumented — the kind of environment where you SSH in and discover things you forgot were running.&lt;/p&gt;

&lt;p&gt;Last week I ran the same task through two different AI tools: "tell me what's running, how it connects, and what looks risky." One is a general-purpose LLM (Claude). The other is a purpose-built AI SRE tool. Same environment, same ask. The results were... instructive.&lt;/p&gt;

&lt;h2&gt;
  
  
  The task
&lt;/h2&gt;

&lt;p&gt;Simple brief: infrastructure discovery. I want a full picture — services, dependencies, topology, risks. The kind of thing a new hire would spend their first week piecing together from wikis that haven't been updated since 2023.&lt;/p&gt;

&lt;h2&gt;
  
  
  Claude Code (Opus model)
&lt;/h2&gt;

&lt;p&gt;My prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"I manage a small infrastructure — 3 Linux VMs (172.30.0.41, 172.30.0.42, 172.30.0.43) and a Kubernetes cluster. SSH access is already configured. Help me understand what's running across this environment — I want a full picture of my services, dependencies, and topology."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;I'm running Claude Code locally with the Opus model — their flagship tier. Claude didn't ask questions. It just started SSH-ing in.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov7crtomi3yzwvxq501h.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fov7crtomi3yzwvxq501h.jpg" alt="Claude exploring hosts via SSH — ss, systemctl, kubectl across all three VMs&lt;br&gt;
" width="800" height="354"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Five minutes later it handed me a report. And honestly? It was better than I expected.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxe97qehhz67494t3vq7y.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxe97qehhz67494t3vq7y.jpg" alt="Claude's final output — ASCII topology plus service inventory" width="800" height="1991"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What Claude delivered:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Identified all three VM roles correctly (API Gateway, Order Processing, Data Tier)&lt;/li&gt;
&lt;li&gt;Drew an ASCII topology showing Nginx routing to backend services with canary weights&lt;/li&gt;
&lt;li&gt;Built a full service table — host, port, tech stack, notes&lt;/li&gt;
&lt;li&gt;Mapped the Redis Sentinel cluster including a stale replica on a decommissioned node&lt;/li&gt;
&lt;li&gt;Enumerated every K8s namespace and workload&lt;/li&gt;
&lt;li&gt;Traced the observability pipeline (node_exporter → Prometheus, OTel → Jaeger, Datadog agents)&lt;/li&gt;
&lt;li&gt;Flagged four real issues: dead Redis replica, broken image pulls in aigc-app, active canary split, multiple knoxd versions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Five minutes. No hand-holding. For a "quick, what's running here?" sweep, this is genuinely useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it stops
&lt;/h2&gt;

&lt;p&gt;Here's what I noticed after the initial "wow, that was fast" wore off.&lt;/p&gt;

&lt;p&gt;The output is a wall of markdown. Accurate, mostly. But flat. Everything has the same weight — a critical single-point-of-failure sits next to a cosmetic naming inconsistency. No severity. No priority.&lt;/p&gt;

&lt;p&gt;More specifically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No topology visualization.&lt;/strong&gt; I got an ASCII diagram. It's readable for 6 machines. At 60 machines, it's unreadable. At 600, impossible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No business grouping.&lt;/strong&gt; Claude listed every service but couldn't tell me which ones form the e-commerce flow vs. the logistics flow vs. the platform layer. That requires domain context it doesn't have.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No risk assessment.&lt;/strong&gt; Four issues found, but no severity classification. The dead Redis replica and the cosmetic knoxd naming thing are presented with equal weight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No quality gate.&lt;/strong&gt; Nobody verified whether Claude's topology was actually correct. It connected things confidently — but was the canary weight really 90/10? I'd need to go check.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No persistence.&lt;/strong&gt; Close the chat window. The report is gone. Tomorrow I'd run it again and get a slightly different exploration path, slightly different findings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No depth control.&lt;/strong&gt; I can't say "that Business Island looks risky, go deeper on it." It's all-or-nothing.&lt;/p&gt;

&lt;p&gt;This maps to a pattern I keep seeing across industries. In legal tech, people noticed the same thing — general LLMs are good at summarizing contracts but can't do precision clause verification. In finance, ChatGPT can describe how to post a journal entry but can't actually post one. The dividing line is consistent: general AI is a thinking tool; specialized AI is an acting tool.&lt;/p&gt;

&lt;p&gt;When the task is "reason about this data and explain it to me" — general tools are great. When the task shifts to "build a structured, persistent, verifiable model of my environment" — you've crossed into territory they weren't designed for.&lt;/p&gt;

&lt;h2&gt;
  
  
  Purpose-built tool, same task
&lt;/h2&gt;

&lt;p&gt;For comparison, here's what happens when I send one line to Knox (our purpose-built AI SRE tool — yes, this is our product, stating that upfront):&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Run a full infrastructure discovery on our production environment."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Shorter prompt. No need to explain the environment — it already has connectors configured.&lt;/p&gt;

&lt;p&gt;Twenty minutes later:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkydq0x5to1p2nnax3i6n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkydq0x5to1p2nnax3i6n.png" alt="Knox service topology — interactive graph, not ASCII art" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvyibi1tbuuqf4ul13zr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftvyibi1tbuuqf4ul13zr.png" alt="Business Islands — services grouped by business function, with criticality&lt;br&gt;
" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fym3v1qfy6rlt9cz6n131.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fym3v1qfy6rlt9cz6n131.png" alt="Knox configuration drift report with severity ranking" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The differences that matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Visual topology&lt;/strong&gt; — not ASCII art, an interactive service relationship graph&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Business Islands&lt;/strong&gt; — services auto-grouped by business function with criticality labels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Risk Triage&lt;/strong&gt; — findings ranked by severity with a distribution chart&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt; — results stored in a graph database, queryable later&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Depth on demand&lt;/strong&gt; — "Deep Analysis Available" button for any Business Island&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How it got there — a team of agents, not a single model:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkg9zcrhsmhuicz2dlfgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkg9zcrhsmhuicz2dlfgy.png" alt="Captain — confirms scope before dispatching specialists&lt;br&gt;
8" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpi9igjjx9e55dj2vng9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpi9igjjx9e55dj2vng9w.png" alt="Specialists collaborating — Architect plans, Collector scans" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5huotd02a75z4cfd64r5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5huotd02a75z4cfd64r5.png" alt="Supervisor — independently cross-checks the findings" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flb9s0zrgutwhi2hpfjho.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flb9s0zrgutwhi2hpfjho.png" alt="Final review — 12 verified, 9 uncertain items flagged for human review" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the work process, not a deliverable. Multiple specialized agents collaborated — one coordinated the task, one did the actual discovery, one quality-checked the findings — flagging 9 uncertain items for human review instead of presenting everything with equal confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scale question
&lt;/h2&gt;

&lt;p&gt;We ran this on 5-6 machines. The gap is already visible. But this is the minimum-gap scenario.&lt;/p&gt;

&lt;p&gt;At 60 servers across multiple environments, Claude's context window fills up. You'd need multiple sessions, manual stitching, and the "flat markdown" problem becomes unbearable. The gap doesn't grow linearly — it compounds.&lt;/p&gt;

&lt;p&gt;That's not a knock on Claude. A Swiss Army knife is great. But when you need surgery, you reach for a scalpel.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;What's your environment look like? At what scale did you find general AI tools hitting their ceiling for ops work?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If you want to try the purpose-built approach: &lt;a href="https://knoxops.app/?invite_token=DEVTO26" rel="noopener noreferrer"&gt;knoxops.app&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>sre</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Agentic Ops: How I Shipped My Vibe-Coded Game to Production</title>
      <dc:creator>paul_h</dc:creator>
      <pubDate>Sat, 30 May 2026 07:31:08 +0000</pubDate>
      <link>https://dev.to/paul_knoxops/agentic-ops-how-i-shipped-my-vibe-coded-code-to-production-1mgk</link>
      <guid>https://dev.to/paul_knoxops/agentic-ops-how-i-shipped-my-vibe-coded-code-to-production-1mgk</guid>
      <description>&lt;p&gt;Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff like "This tastes like regret and too much butter." I'd wanted to build this for a while. Eventually I'll hook it up to an AI model to generate more combinations and even harsher critiques.&lt;/p&gt;

&lt;h2&gt;
  
  
  One Prompt, One Hour
&lt;/h2&gt;

&lt;p&gt;I opened Claude Code and typed a single prompt:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Create a cooking game where players combine ingredients to discover recipes..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;An hour of coding and debugging later, I had a working version running on localhost.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ybarf7apbxmfexzylul.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ybarf7apbxmfexzylul.jpg" alt=" " width="800" height="521"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wall
&lt;/h2&gt;

&lt;p&gt;Then came the real problem: deploying it so my friends could actually play.&lt;/p&gt;

&lt;p&gt;AI has collapsed the barrier to building software. But no matter how low the entry gets, even the most seasoned SRE can't rattle off HTTPS configs, domain setups, and nginx routing rules from memory. As a vibe coder, what was I supposed to do next?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Plan
&lt;/h2&gt;

&lt;p&gt;I spun up an AWS VM, installed a Knox Daemon (Knox is an AIOps product), and connected it to my GitHub repo. Then I told it:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How I Shipped My Vibe-Coded Code to Production"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It started exploring my codebase. It discussed the task with me, asked clarifying questions, and came back with a full plan — five stages covering pre-checks, building the game, requesting certificates, updating nginx routes, final verification, and documenting what it learned for next time. Nothing would execute until I approved it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzee0pkyc0l84dht2jg3x.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzee0pkyc0l84dht2jg3x.jpg" alt=" " width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Execution
&lt;/h2&gt;

&lt;p&gt;I reviewed the plan and hit approve. The agents kicked off in parallel — one checking the environment, one executing changes, another validating the output of each stage. They ran efficiently, every step visible. It looked exactly like a human SRE team at work.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w068oa6lzjst4qj3jz4.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3w068oa6lzjst4qj3jz4.jpg" alt=" " width="800" height="366"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When it was done, the agent handed me a report. I clicked the URL in the report and — there it was. My game. Live. Someone could play it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8wd4k6d29u6yqthc9av.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft8wd4k6d29u6yqthc9av.jpg" alt=" " width="799" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ws6p98fs1gzg863m8e1.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3ws6p98fs1gzg863m8e1.jpg" alt=" " width="800" height="439"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  30 Minutes
&lt;/h2&gt;

&lt;p&gt;I was doing other things throughout the deployment, so I wasn't always quick to respond when the agent needed input — requirement discussions, plan approval, execution confirmations on my AWS box. Total time from start to live: about an hour. If I'd been fully focused, probably 30 minutes.&lt;/p&gt;

&lt;p&gt;The whole experience was striking. More and more people are building things in the AI era. They think about product design and development, but then what? How do you deploy? How do you keep the service running?&lt;/p&gt;

&lt;p&gt;I think this is what agentic ops means.&lt;/p&gt;

&lt;p&gt;Agentic ops gives you the same answer: describe what you want, and an agent operates the server. Same loop as vibe coding. The output just isn't code anymore — it's a running service.&lt;/p&gt;

&lt;p&gt;The endpoint of vibe coding shouldn't be localhost:3000. It should be a link you can drop in a group chat.&lt;/p&gt;

</description>
      <category>vibecoding</category>
      <category>devops</category>
      <category>sre</category>
      <category>aiops</category>
    </item>
    <item>
      <title>AI Agents Mapped My Legacy Production Environment in One Hour.</title>
      <dc:creator>paul_h</dc:creator>
      <pubDate>Thu, 28 May 2026 03:55:11 +0000</pubDate>
      <link>https://dev.to/paul_knoxops/ai-agents-mapped-my-legacy-production-environment-in-one-hour-it-cost-0-2fnn</link>
      <guid>https://dev.to/paul_knoxops/ai-agents-mapped-my-legacy-production-environment-in-one-hour-it-cost-0-2fnn</guid>
      <description>&lt;p&gt;I inherited a black box.&lt;/p&gt;

&lt;p&gt;Three VMs. A hundred-something microservices. Redis, ClickHouse, MySQL, some homegrown database nobody could name. Kafka and Zookeeper thrown in because of course they were.&lt;/p&gt;

&lt;p&gt;Nobody knew how the services connected. The original team was gone. The architecture lived entirely in oral tradition, and the last person who could recite it had left six months ago.&lt;/p&gt;

&lt;p&gt;This is not a metaphor. This is Tuesday for anyone who's done SRE work long enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  Setup: 30 seconds, zero footprint
&lt;/h2&gt;

&lt;p&gt;I already had Teleport for daily ops. SSH access, session recording. It worked, I didn't want to break it.&lt;/p&gt;

&lt;p&gt;What I did:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Installed &lt;code&gt;knoxd&lt;/code&gt; on my Teleport proxy (not on the servers)&lt;/li&gt;
&lt;li&gt;AI agent team auto-configured a Teleport connector&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That's it. Nothing new on my production machines. The agents ride the Teleport session I already had, with the permissions I'd already defined.&lt;/p&gt;

&lt;p&gt;Non-invasive — not in the "we promise it's lightweight" sense. In the "there is literally nothing new running on your production machines" sense.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gugbyk4g6kcqrbvyqqf.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6gugbyk4g6kcqrbvyqqf.jpg" alt="Available connectors, more is coming soon" width="800" height="872"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How it actually works
&lt;/h2&gt;

&lt;p&gt;The agents SSH in through Teleport. Plain SSH commands, same ones you'd type yourself.&lt;/p&gt;

&lt;p&gt;What makes this safe rather than terrifying:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Auto-run&lt;/th&gt;
&lt;th&gt;Requires human approval&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Read-only&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;ps&lt;/code&gt;, &lt;code&gt;ss&lt;/code&gt;, &lt;code&gt;cat /proc/net/tcp&lt;/code&gt;, &lt;code&gt;nginx -T&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mutating&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kill&lt;/code&gt;, &lt;code&gt;systemctl restart&lt;/code&gt;, &lt;code&gt;rm&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The sandbox: strict AST parsing + default-deny whitelist. The agents can look at everything but touch nothing without asking.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the agents discovered
&lt;/h2&gt;

&lt;p&gt;Step 1: OS inventory — kernel, distro, packages. All 3 VMs in parallel.&lt;/p&gt;

&lt;p&gt;Step 2: Process mapping — &lt;code&gt;ps aux&lt;/code&gt;, parsed. Hundreds of processes tagged with binary path, resource footprint, parent-child relationships.&lt;/p&gt;

&lt;p&gt;Step 3: Process → Service resolution&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Check name service first&lt;/li&gt;
&lt;li&gt;If unregistered (most weren't — legacy system), infer from install path&lt;/li&gt;
&lt;li&gt;Flag for human confirmation before writing anything back&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The AI doesn't hallucinate service names into your architecture map. It asks.&lt;/p&gt;

&lt;p&gt;Step 4: Service → Business Island grouping&lt;/p&gt;

&lt;p&gt;A business island = logical grouping by business function (billing, user auth, order processing). The thing that exists in every architect's head but never in any document.&lt;/p&gt;

&lt;p&gt;Step 5: Connection mapping — four evidence sources, cross-referenced:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Source&lt;/th&gt;
&lt;th&gt;What it reveals&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Network connections (&lt;code&gt;ss -tnp&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Live TCP dependencies&lt;/td&gt;
&lt;td&gt;Port 6379 → Redis, port 9092 → Kafka&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Config files&lt;/td&gt;
&lt;td&gt;Declared dependencies&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;kafka.brokers: kafka-01:9092&lt;/code&gt; in YAML&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Access logs&lt;/td&gt;
&lt;td&gt;Actual call patterns&lt;/td&gt;
&lt;td&gt;Who calls whom, how often&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LB configs (nginx)&lt;/td&gt;
&lt;td&gt;Ingress chain&lt;/td&gt;
&lt;td&gt;Domain → LB → real server&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Cross-reference. Resolve conflicts. Draw edges.&lt;/p&gt;

&lt;p&gt;One hour.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftyw0q28c9kd6znd8dv3x.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftyw0q28c9kd6znd8dv3x.jpg" alt=" " width="800" height="769"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I got
&lt;/h2&gt;

&lt;p&gt;Architecture diagrams — topology maps of each business island, services as nodes, dependencies as edges, data flows labeled. The kind of diagram you'd pay a consultant a week to produce.&lt;/p&gt;

&lt;p&gt;High-risk report:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Single points of failure&lt;/li&gt;
&lt;li&gt;Circular dependencies&lt;/li&gt;
&lt;li&gt;Kafka topics with no visible consumer group&lt;/li&gt;
&lt;li&gt;One Redis instance holding session state for 6 business islands, zero isolation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Things I needed to know. Things dashboards would never show me.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq40ekxe9nkheh2l3yfn.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frq40ekxe9nkheh2l3yfn.jpg" alt=" " width="800" height="781"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The cost
&lt;/h2&gt;

&lt;p&gt;Zero.&lt;/p&gt;

&lt;p&gt;Knox gives free credits on signup. Enough for a small cluster for a long time. No credit card. No trial-that-converts-to-paid. One binary on a jump host.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this matters
&lt;/h2&gt;

&lt;p&gt;Most AIOps tools treat metrics as the final answer. They're not. They're the starting point.&lt;/p&gt;

&lt;p&gt;Real outages hide in blind spots:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;System logs nobody tails&lt;/li&gt;
&lt;li&gt;Manual changes nobody tracked&lt;/li&gt;
&lt;li&gt;Config drift APM tools don't see&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To find root cause, you have to log into machines and build an evidence chain. That's what humans do. That's what these agents do.&lt;/p&gt;

&lt;p&gt;Monitoring tells you a metric crossed a threshold. It doesn't tell you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Service X and Y form a circular dependency that will cascade&lt;/li&gt;
&lt;li&gt;Your session store is a single point of failure for half the platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those aren't metric problems. They're structure problems. LLMs are uniquely good at structure — if you give them a way to see it without breaking anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety model
&lt;/h2&gt;

&lt;p&gt;Letting AI touch production should sound terrifying. That's why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AST-parsed command validation — not string matching, actual syntax tree analysis&lt;/li&gt;
&lt;li&gt;Default-deny whitelist — everything blocked unless explicitly allowed&lt;/li&gt;
&lt;li&gt;Human-in-the-loop — any destructive action requires a plan + approval&lt;/li&gt;
&lt;li&gt;Connector model — agents use paths you already trust (Teleport, SSH, AWS, Prometheus)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agents never need their own access path. They never open a new hole in your security posture.&lt;/p&gt;

&lt;p&gt;That's the difference between an agent you'd let near production and one you wouldn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm building
&lt;/h2&gt;

&lt;p&gt;It's called KnoxOps. Core idea: infrastructure is an object graph, not a flat list of resources. Model it that way and LLMs can reason like a senior SRE — tracing dependencies, calculating blast radius, finding what dashboards miss.&lt;/p&gt;

&lt;p&gt;The goal: delegate routine SRE toil so developers can focus on building.&lt;/p&gt;

&lt;p&gt;More connectors coming. The principle stays the same: use the access paths you already trust.&lt;/p&gt;

&lt;p&gt;If you've inherited a system nobody understands — I'd like to hear from you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm the founder of &lt;a href="https://knoxops.app" rel="noopener noreferrer"&gt;KnoxOps&lt;/a&gt;. Currently in open beta — use code DEVTO26 for 10,000 free credits on signup.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>devops</category>
      <category>aiops</category>
      <category>sre</category>
      <category>welcome</category>
    </item>
  </channel>
</rss>
