DEV Community: paul_h

Loop Engineering for Ops

paul_h — Fri, 26 Jun 2026 11:11:33 +0000

Loop Engineering is the phrase everyone's been throwing around lately. It sits after Prompt, Context, and Harness — pushing agent coordination one step further.

Google's Addy Osmani published a long piece a while back that formalized the whole thing. Worth a read. But before him, Peter Steinberger, founder of OpenClaw, said something that nailed it even harder.

"You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

What he means is, you stop being the person manually driving the agent turn by turn. You build a system. The system runs, inspects, fixes, records. You go from the person turning the wrench to the person who designed the assembly line.

One thing to keep in mind: in operations, safety is everything. Without guardrails and human-in-the-loop, an autonomous loop can cause real damage. I'm not a fan of "fully automated" ops loops, and I'll show you why.

This article walks through building a safe ops loop using a host health check example. Most of the routine ops work we do every day can be packaged this way.

Loop Engineering, Broken Down

If you take Loop Engineering apart, you get six pieces.

Piece	What it does
Automations	Scheduled or conditional triggers. The loop runs itself.
Worktrees	Multiple agents working in parallel, isolated checkouts, no stepping on each other.
Skills	Project knowledge written down so you don't re-explain everything every session.
Connectors	Hooks into real systems — SSH, APIs, databases. Actually executes.
Sub-agents	Separate the builder from the reviewer. The one who wrote it is too nice grading it.
State	Remembers what happened across runs. The agent forgets. Files don't.

Claude Code and Codex ship most of these built-in now. You don't build from scratch.

But there's one piece the tools can't do for you.

Writing down the loop's structure and constraints.

Sure, you can type a one-off command (like Claude Code's /goal). Two problems with that.

Prompts leave gaps. How many steps per round? How do you know it's done? How do steps hand off to each other? What stops it from going off the rails? Nobody forces you to think about these things. You can skip them. With agent-runbook, the framework forces you to answer every one of them. Skip one, and the compiler yells at you.
Not reusable. You run it once, it's gone. Next cluster, next person, start over.

You need something that turns your loop's steps, end conditions, and guardrails into a file. Not a prompt you type and forget — a contract you commit to the repo. Next run, next person, same file, same result.

agent-runbook: Writing Loops as Contracts

agent-runbook is an open-source project on GitHub: github.com/KnoxOps/agent-runbook. Its entire philosophy fits in one sentence: constrain agents with contracts, not prompts and hope.

A few key design decisions.

Contract-first. You write a YAML file that declares your loop's steps, outputs, dependencies, and guardrails. Not a vague "go check the servers" prompt — every step's input/output format, execution order, and boundary conditions are locked down. The compiler validates before generating the skill file: missing schemas, circular dependencies, references to non-existent outputs — all caught at build time. Nothing blows up halfway through a run.

Human in the loop. Ops isn't testing. In staging, the agent can auto-fix everything. In production? No. Write operations stay behind a human gate. The agent can SSH in and investigate. It can draft a plan. It can list the commands. But you decide whether those commands actually run. This isn't a suggestion tucked into a prompt. It's a hard node in the step structure. Not approved, not proceeding.

External memory. The agent's context resets after every turn. Expecting it to remember what happened last round is wishful thinking. The loop stores every round's output, iteration history, and overall progress in files. A status file tracks which steps are done. An iteration log records what got fixed each round. The agent forgets. The files don't.

Mandatory guardrails. Writing "don't touch the config" in a prompt is a request. The agent can ignore it. agent-runbook guardrails are enforced checks — after every round, an independent review verifies the agent didn't overstep. Modified files it shouldn't have? Ran commands outside the plan? That round doesn't count as complete.

Built-in brakes. Every loop must declare what "done" looks like, plus a hard iteration cap. The loop stops. It doesn't run forever.

Files over context for step communication. Don't count on the agent remembering what the previous step said. Every step's output follows a JSON Schema. The next step validates as it reads. A field mismatch fails immediately — no silent data corruption.

Hands-on: A Host Health Check Loop

Real scenario. You've got a fleet of production servers. Regular health checks: disk, memory, load, critical services. When something's wrong, you can't just let the agent fix it automatically — production write operations require human approval.

Each round of this loop runs five steps:

Pick the most critical issue from the inspection results
SSH in, investigate the root cause, draft a fix plan
Present the plan, wait for approval
Execute the approved fix
Re-inspect to confirm the issue is resolved

Here's what it looks like as an Agent Runbook YAML:

name: host-health-loop
description: Ansible health check discovers issues → fix one by one (write ops require human approval) → re-inspect → repeat until all healthy → generate HTML report

input_params:
  - name: inventory
    type: string
    required: false
    description: Path to Ansible inventory file, defaults to ansible/inventory.ini
  - name: playbook
    type: string
    required: false
    description: Path to health check playbook, defaults to ansible/health_check.yml

steps:
  - id: inspect
    type: script
    description: Run Ansible playbook to inspect all hosts and generate issue list
    command: |
      ansible-playbook -i {inventory} {playbook} 2>&1
      mv /tmp/host_issues.json host_issues.json
    output:
      - file: host_issues.json
        schema: schemas/host_issues.schema.json
    depends_on: []

  - id: fix_loop
    type: loop
    description: >
      Fix loop: each round picks the most critical issue, produces a plan,
      waits for human approval, executes the fix, then re-inspects all hosts.
      Repeats until no issues remain or max iterations reached.
    goal: "host_issues.json has total_issues equal to 0 (all hosts healthy)"
    max_iterations: 10
    depends_on: [inspect]
    body:
      - id: select_issue
        type: inline
        description: Select the single most critical issue from host_issues.json
        prompt: |
          Read host_issues.json and schemas/selected_issue.schema.json.

          Pick the SINGLE most critical issue using this priority order:
          1. critical disk (highest priority)
          2. critical service_down
          3. critical memory
          4. critical load
          5. warning disk
          6. warning service_down
          7. warning memory
          8. warning load (lowest priority)

          Write the selected issue to selected_issue.json following the schema.

          If host_issues.json has total_issues == 0, write {"done": true} instead.
        depends_on: []
        output:
          - file: selected_issue.json
            schema: schemas/selected_issue.schema.json

      - id: plan_action
        type: agent
        description: Analyze the selected issue and create a concrete remediation plan (no execution)
        prompt: |
          Read selected_issue.json and schemas/pending_action.schema.json.

          STEP 1 — Investigate: SSH to the target host and check the actual situation.
          Understand the root cause before writing any plan. Do NOT guess or write generic commands.

          STEP 2 — Plan: Based on your findings, write a concrete remediation plan
          to pending_action.json following the schema.

          Do NOT execute anything. Only investigate and write the JSON file.
          If selected_issue.json contains {"done": true}, write {"done": true} to pending_action.json.
        depends_on: [select_issue]
        output:
          - file: pending_action.json
            schema: schemas/pending_action.schema.json

      - id: approve
        type: inline
        description: Human approval gate — present the remediation plan and wait for confirmation
        prompt: |
          Read pending_action.json.

          If it contains {"done": true}, skip this step.

          Otherwise, PRESENT the remediation plan from pending_action.json to the human:

          ---
          ## Awaiting Approval
          Present the host, issue, risk level, and commands from pending_action.json.

          Type "approve" to execute, or "reject" to skip this issue.
          ---

          WAIT for the human to respond. Do NOT proceed without explicit approval.
          If approved, copy pending_action.json to approved_action.json.
          If rejected, write skip_action.json with {"status": "rejected", "reason": "rejected by human"}.
        depends_on: [plan_action]

      - id: execute
        type: agent
        description: Execute the approved remediation
        prompt: |
          Check for approved_action.json. If it exists, read it.

          If skip_action.json exists instead, do NOT execute — write execute_result.json
          following schemas/execute_result.schema.json with status "skipped".

          Execute the remediation: SSH into the target host and run the commands
          listed in approved_action.json.

          Rules:
          - Use the exact commands from the plan, do not improvise
          - If a command fails, do NOT retry
          - After execution, verify the result (e.g., check service status, check disk usage)
          - Write execute_result.json following schemas/execute_result.schema.json

          If pending_action.json had {"done": true}, write {"status": "all_done"} to execute_result.json.
        depends_on: [approve]
        output:
          - file: execute_result.json
            schema: schemas/execute_result.schema.json
        quality_check:
          blocking: true
          rules:
            - "Commands were executed exactly as specified in approved_action.json"
            - "No destructive commands were improvised beyond the plan"
            - "Verification was performed after execution"

      - id: re_inspect
        type: script
        description: Re-run inspection to refresh the issue list
        command: |
          ansible-playbook -i {inventory} {playbook} 2>&1
          mv /tmp/host_issues.json host_issues.json
        output:
          - file: host_issues.json
            schema: schemas/host_issues.schema.json
        depends_on: [execute]

  - id: generate_report
    type: agent
    description: Generate a polished HTML inspection and remediation report
    prompt: |
      Read the final host_issues.json to get the current health status.
      Also read all execute_result.json files from the workspace to understand what was fixed.

      Generate a single, self-contained, beautiful HTML report: health_report.html

      The report should include:
      - Overall health status (ALL CLEAR or ISSUES REMAINING)
      - Summary stats: hosts checked, total issues found and resolved
      - Remediation timeline: each iteration with host, issue, investigation findings,
        commands executed, and before/after comparison
      - Final per-host status

      Design requirements:
      - Dark theme, modern dashboard style
      - CSS grid/flexbox, no external dependencies
      - Mobile-responsive
      - Professional and visually impressive — this is a production report
      - Include all CSS inline in a <style> tag
    depends_on: [fix_loop]

Write the YAML, compile it with one command:

python3 -m agent_runbook generate runbook.yaml -o output/

Drop it into Claude Code or Codex, and it runs.

The generated SKILL.md is about 250 lines. Here's a condensed version of the key sections (full version here):

---
name: host-health-loop
description: Ansible health check → fix → re-inspect → repeat until all healthy → HTML report
user-invocable: true
---

## Execution Flow

### Task Context
Initializes task_context.json to track every step's status. Update after each step.
Crash mid-run? Resume from the last completed step.

### Step 1: inspect
**Type:** script
Runs ansible-playbook to inspect all hosts, produces host_issues.json.

### Step 2: fix_loop
**Type:** loop
**Goal:** host_issues.json total_issues equals 0
**Max Iterations:** 10

Each round:
1. select_issue — pick the most critical issue by priority
2. plan_action — SSH investigation, draft a fix plan (no execution)
3. approve — present the plan, wait for human approval
4. execute — run approved commands, quality_check verifies boundaries
5. re_inspect — re-run inspection, refresh the issue list

## Goal Evaluation
1. Goal met → mark complete, proceed to next step
2. Goal not met, iterations remain → start next round
3. Max iterations reached → mark complete, report remaining issues

Append to iteration_history after each round.

### Step 3: generate_report
**Type:** agent
Reads final inspection results and fix records. Generates a dark-themed dashboard HTML report.

...

Here's the thing: that YAML you just read is literally declaring all six pieces of Loop Engineering. You declare it. Claude Code or Codex runs it.

What you declare in YAML	What Claude Code / Codex does automatically
Who runs each step, what they do, in what order	Spawns agents in declared order, each in an isolated worktree, cleaned up after
Output format requirements for each step	Validates outputs after each step. Wrong format? Immediate error.
How to connect external systems (SSH, Ansible, APIs)	Connects via MCP to real servers. Actually executes.
What "done" means, max rounds	Evaluates goal after every round. Hits the cap? Stops. No wasted tokens.
Approval required, guardrails enforced	Shows write commands. Waits for your approval. Checks boundaries after every round. Overstep? Round doesn't count.

Five Rounds, All Green

This example was tested on three bare-metal machines. The initial inspection found multiple issues spread across all three hosts.

Round one hit eval-bare-vm-3. Root disk at 93%, only 1.4G left. The agent SSH'd in and found the culprits: a Docker JSON log at 5.4G, 4.5G of temp files in /tmp, 215M of application logs, ~800M of container cache. After approval, it cleaned up — truncated the Docker log, cleared /tmp, cleaned apt cache, vacuumed journald to 50M. Freed up about 10G. Disk dropped from 93% to 38%.

The next few rounds were nginx and docker down across the three machines — services someone had manually stopped and never brought back. Each round: investigate → plan → approve → execute → verify. Five rounds total. All green. Loop stopped on its own.

Generated a dark dashboard HTML report.

Notice every single write operation went through approval. No automatic execution. A human saw every command before it ran. In production, you don't skip this.

How to Design a Good Ops Loop

Pick the right task. Good ops loops have objective feedback signals — inspection results, metrics, health check endpoints. Each round can build on the last. Host health checks, certificate expiry scans, K8s pod restart loops, Prometheus alert storm classification, log anomaly pattern matching — all solid candidates. Capacity planning and architecture changes need global judgement. Don't shove those into a loop.

Write "done" so a machine can decide. "Issue list is empty" is a good end condition. "The cluster is healthy" is not. The agent has to read a file or run a command and get a true/false answer. When you're writing the condition, ask yourself: can a script determine this in one line? If not, your agent probably can't either.

Pass data through files, not memory. A step's output is the next step's input, format-constrained. A mistake gets caught immediately. Agents have terrible memory — long context, they forget. Files don't.

Write operations need a human gate. Staging can be fully automated. Production cannot. The approval step is the safety valve for the entire loop. Write it into the contract. A hundred times more reliable than writing it in a prompt.

Set a ceiling. Max iterations isn't your target — it's the circuit breaker that says "something's wrong." A healthy loop converges well below the limit. If you hit the cap, some issue can't be fixed or the inspection keeps false-flagging. Time for a human.

Wrapping Up

agent-runbook is intentionally lightweight. It's not a full Loop Engineering implementation. It does one thing: writes your loop structure as a declarative file. Claude Code or Codex handles the rest.

You don't need to start from scratch. The examples directory has the full host health check loop — runbook file, inspection scripts, actual fix history from real runs.

Not just host health checks. Certificate expiry scans, K8s node health checks, log archival cleanup, database backup verification, middleware config compliance checks — if your ops task breaks down into "steps + contracts + dependencies," it can be a loop.

Repo at github.com/KnoxOps/agent-runbook. If you've got routines where you SSH into servers, check for problems, and fix them by hand — try writing that flow as a declaration and let the tool run it. It'll be more disciplined than you are.

AI Scanned My Infra — 67% Were Dead Weight on My AWS Bill

paul_h — Tue, 23 Jun 2026 08:53:32 +0000

Last weekend I used AI to scan my AWS account for idle resources. Here's what I found:

Scanned 3 EC2 instances, flagged 2 as suspicious. One of them was running an entire microservice stack — with zero business traffic.

The whole thing was done by 10 AI agents working together. I wasn't typing commands in a terminal. I defined a process contract, then let go.

Let me start with what was discovered.

Discovery: 3 EC2 Instances Scanned, 2 Suspicious

The scope was small: 3 t3.xlarge EC2 instances, us-east-1 region. No Prometheus or Datadog. ICO relied on SSH to grab real-time snapshots and process details.

After the first round of scoring and screening, .42 was excluded — someone had logged in 26 days ago, within the 30-day active threshold. A live machine. Two remained:

Resource	zombie_score	Level	CPU	Network	Last Login
ec2-172.30.0.41	0.35	LOW	6.4%	1.82 GB/day	41 days ago
ec2-172.30.0.43	0.35	LOW	8.4%	1.18 GB/day	99 days ago

The three-signal filter is straightforward: CPU daily avg > 20% = active, network > 2 GB/day = active, human login within 30 days = active. All three must be inactive to become a candidate. .42 got caught by the login signal. .41 and .43 passed none of them — but their network was close to the threshold (1.82, 1.18), so they only scored 0.35.

My first instinct was to skip them. A score of 0.35, LOW label — not worth the time. But ICO's process doesn't let you draw conclusions at this stage. Scoring is just a coarse filter. The next phase is deep scanning, and it requires human confirmation to proceed.

I selected both. Deep scan it is.

Deep Scan: What Is a 0.35-Score Instance Actually Running?

The deep scan phase launched agents that SSH'd in concurrently. Each machine was checked across 14 signals: process table, listening ports, crontab, systemd timers, disk usage, external connections, real-time traffic topology.

The deep scan result for ec2-172.30.0.41 came back as 317 lines of JSON. Here are the key findings:

What was running:

Nginx reverse proxy (:80), routing to multiple backends
Redis 7.0 MASTER (:6379), read-write mode, bound to 0.0.0.0
Redis Sentinel (:26379), in a cluster across three machines
Nacos standalone (:8848), Java process eating 512MB RAM
inventory-service (:8081) and warehouse-service (:8083), two Python HTTP services
Full Datadog agent stack (6 processes)
Docker installed but zero running containers

Traffic topology:

Redis Sentinel interconnection with ec2-172.30.0.43 (bidirectional, a few Kbps)
Redis client connections from 172.30.0.25 (8+ connections)
Datadog agent continuously sending metrics outbound

Looking at this report, you wouldn't think this machine is a zombie. Redis cluster, Nacos service registry, two Python services, Nginx reverse proxy — this looks like a full microservice setup.

But look closer at the traffic topology. All external connections are from Datadog and within the Redis cluster. No real business traffic coming in. All services are running, but nobody is using them.

That's the truth about this instance: an abandoned microservice setup. A zombie. Without the deep scan, with a 0.35 score on the scorecard, nobody would have given it a second look.

Scoring tells you "which ones might be idle." Deep scan tells you "what they're actually doing." Not the same question.

How It Works: 10 Agents, 4 Human Decision Gates

ICO covers compute instances, Kubernetes workloads, databases, object storage, and network resources — across AWS, GCP, Azure, or on-prem via SSH. This case focused on EC2, but the same pipeline handles all of them.

ICO is not "one AI that deletes your resources." It's 10 independent skill agents, each responsible for one link in the chain, passing data through structured files:

At the four BLOCKING checkpoints, the agent must stop and wait for a human. Before anything gets deleted, a human confirms three times:

Phase C — Review the scorecard, select which resources enter deep scan
Phase E — Review the deep scan report, select which enter isolation
Phase G — Approve the isolation plan (method, rollback script, observation period)
Phase J — Final deletion approval

This is not about writing "be careful" in a prompt. You can't get safety from prompts — the model might ignore what you said, or forget it once the context fills up. Safety must be hardcoded into the process: if a phase doesn't pass, the agent cannot jump to the next step on its own.

Agents don't pass data through context either. Scoring produces suspect_assessment.json, deep scan produces deep_scan_{id}.json, isolation produces isolation_plan_{id}.json — each constrained by a Schema. If the previous agent's output doesn't match the format, the next agent errors out. No improvising.

This is what agent-runbook is about: constrain agent collaboration with contracts. Relying on prompts is gambling.

Three Hard Lessons

1. Scoring can't see the service stack. Deep scan can.

ec2-172.30.0.41 scored 0.35, LOW. Based on the score alone, you'd skip it. But the deep scan found Redis MASTER, Nacos, two Python services, Nginx — an entire microservice infrastructure stack sitting there idling. The three coarse signals — CPU, network, login — completely fail to capture "what's actually running." Scoring points the way. Deep scan shows you what's actually there.

2. Without historical data, real-time snapshots have blind spots

This case had no Prometheus or Datadog. ICO relied on SSH real-time snapshots. Monthly jobs, quarterly reports, on-demand batch processing — snapshots will never see them. If a crontab has a "run at 1 AM on the 1st of every month" entry, scanning a hundred times in real time won't catch it. Historical monitoring data is the most reliable signal source. Without it, deep scanning carries double the weight.

3. Internal traffic doesn't mean business usage

.41 and .43 had Redis Sentinel interconnections. .25 was connecting to .41's Redis. The topology graph had plenty of edges — it looked busy. But it was all infrastructure internal communication — services probing each other, syncing state, with not a single edge from an external user. The scorecard treats any network traffic as an active signal, but not all traffic is the same. Only the deep scan's traffic topology can distinguish "machines talking to each other" from "users making requests."

Conclusion

The value of AI agents in operations is not "they can delete resources automatically" — that's called danger.

The real value is: you codify a verified safety procedure into an auditable, reusable agent skill anyone can run, and get the same result every time. Not typing commands each time and hoping, but a contract file committed to a repo.

ICO applies this approach to cloud cost optimization. If it works in dev, it's only more valuable in production.

The code:

open-devops-skills — directly installable ICO skill library: github.com/KnoxOps/open-devops-skills

Install with one line:

claude plugin install ico@open-devops-skills

Then:

/ico:orchestrator Scan my cloud for idle resources

If you've got EC2 instances you're not sure are still in use, give it a try. Let the agents scan and analyze — you make the call.

Loop Engineering: Building an Agent Loop with agent-runbook

paul_h — Wed, 17 Jun 2026 09:02:40 +0000

Recently, another interesting new term has appeared in the AI industry.

Loop Engineering.

If you follow the AI space, you've probably seen it everywhere in the past couple of days. It's all over X, all over various social media, and quite a few people are discussing it in group chats too.

Recently Addy Osmani formally organized this concept into Loop Engineering — the fourth Engineering after Prompt Engineering, Context Engineering, and Harness Engineering.

What is a Loop? Here's a concrete scenario:

You have a project with 16 failing tests. Previously you'd do this: run the tests, see what failed, tell Claude "fix this", it fixes it, you run the tests again, find new issues, say something again... back and forth, you are the person driving the loop.

The idea behind Loop Engineering is: you no longer manually drive it round by round. You define the goal (all tests pass), define what to do each round (run tests → fix code), define constraints (can't modify test files), then let go. The system runs on its own until the goal is met.

/goal Is Not Enough

At this point you might say: doesn't Claude Code already have the /goal command? Can't I just /goal "all tests pass" and be done?

On the surface, yes. /goal gives you a completion condition, and Claude works on its own until it's satisfied. But after using it a few times you'll notice the problem — the goal is defined, but the agent still won't work properly. Because you only told it "what counts as done", you didn't tell it "what to do each round".

/goal "all tests pass" — what did it do:

Tells the agent "keep going until this condition is met"
At the end of each round, an independent model judges whether the goal is satisfied
The agent has complete freedom in what it does each round

What it doesn't do:

Doesn't define the internal structure of each round. In /goal the agent does whatever it wants each round. Maybe the first round it runs tests + fixes code, the second round it suddenly goes refactoring, the third round it modifies test files.
No iteration-level constraints. /goal only has a termination condition. There's no guardrail like "only modify one file per round", and you can't control when the agent goes out of bounds.
Not reusable. /goal "all tests pass" is gone once you type it. Next time you switch repos or switch people, you have to type it all over again.
Not auditable. When your boss asks "what's the logic of this automated fix workflow", you can't show them /goal.

To summarize: /goal solves "keeping the agent from stopping", but doesn't solve "making the agent follow the rules".

What you need is a place to write down the loop's structure, constraints, and goals — not a one-time command typed into the terminal, but a file that can be committed to the repo, where anyone who gets it can run it and get the same behavior.

agent-runbook: The Contract Format for Loops

This is what agent-runbook does.

agent-runbook is an open source project (github.com/KnoxOps/agent-runbook), it's not the execution engine for loops, but rather the contract format for loops. You use YAML to declare "what to iterate on, when to stop, what the constraints are for each round", and the compiler generates a SKILL.md for you — this is the reusable instruction format for Claude Code and Codex, put it in your project and it can be directly invoked with claude --skill.

A loop step has three elements:

body: what to do each round (the rhythm of observe → act → verify)
goal: when to stop (must be a machine-verifiable condition)
max_iterations: safety boundary (exceeding this number means the design has a problem, prevents burning tokens)

There's also one more key thing: quality_check. This is an iteration-level guardrail — after each round it checks whether the agent went out of bounds (e.g. modified files it shouldn't have). If blocking: true, the round doesn't count as complete if the check fails.

Hands-on: Building an Automated Test Fix Loop

Here's a simple example to show you how we use agent-runbook to build an agent loop.

We're going to build an automated test fix Loop. This loop is simple, the goal is 100% unit test pass rate. Each iteration has only two steps:

run_tests - run the tests, see which ones are still failing
fix - launch a clean context agent to fix the discovered issues

Beyond that, we also need to define our safety boundary: max_iterations. I wonder if any readers here have had the experience of burning through all their tokens with the /goal command — max_iterations is what prevents that.

Here's the full runbook, defined in structured YAML:

name: fix-failing-tests
description: Iteratively fix all failing tests until the test suite is green

steps:
  - id: fix_loop
    type: loop
    description: "Run tests, analyze failures, fix source code, repeat until green"
    goal: "pytest exits with 0 failures (all tests pass)"
    max_iterations: 10
    depends_on: []
    body:
      - id: run_tests
        type: script
        command: "cd examples/fix-loop && python3 -m pytest tests/ --tb=short 2>&1 | tail -60"
        depends_on: []
      - id: fix
        type: agent
        prompt: |
          Look at the pytest failures from run_tests.
          Pick ONE source file that has failing tests and fix the bugs in that file.

          Rules:
            - Only modify files in src/, NEVER modify test files
            - Fix exactly ONE file, then stop immediately
            - Do NOT read or modify any other source files
        depends_on: [run_tests]
        quality_check:
          blocking: true
          rules:
            - "Only files in src/ were modified, not test files"
            - "Exactly one source file was modified"

  - id: present
    type: inline
    prompt: |
      Generate a markdown report summarizing the fix loop results.
      Include:
        - Total iterations taken
        - What was fixed in each iteration (file + bug description)
        - Final test results
        - How cascading dependencies caused failures to clear automatically
      Write the report to fix_report.md
    depends_on: [fix_loop]

From YAML to Executable SKILL.md

Next we need to compile the YAML into a SKILL.md that Claude Code/Codex can directly execute. The generation command is simple:

python3 -m agent_runbook generate runbook.yaml -o output/

The generated SKILL.md looks like this:

---
name: fix-failing-tests
description: ">-"
  Iteratively fix all failing tests until the test suite is green
user-invocable: true
---

## Execution Flow

### Task Context

Before starting execution, initialize `task_context.json`:

```json
{
  "task_id": "<task_id from input>",
  "current_step": 0,
  "current_step_id": null,
  "status": "running",
  "steps": {
    "fix_loop": "pending",
    "present": "pending"
  },
  "updated_at": "<ISO timestamp>"
}
```

Update this file after each step completes. On error, set step status to `"failed"` and overall `status` to `"failed"`.

### Step 1: fix_loop

**Type:** loop
**Description:** Run tests, analyze failures, fix source code, repeat until green

## Iteration Loop

**Goal:** pytest exits with 0 failures (all tests pass)
**Max Iterations:** 10

> This step executes as a loop. The body steps repeat until the goal is met or max iterations reached.

## Loop Body (repeats each iteration)

#### Body Step 1: run_tests

**Type:** script

**Execution:** Execute the following command:
```bash
cd examples/fix-loop && python3 -m pytest tests/ --tb=short 2>&1 | tail -60
```

#### Body Step 2: fix

**Type:** agent

**Execution:** Launch an independent agent with the following prompt file:

Look at the pytest failures from run_tests.
Pick ONE source file that has failing tests and fix the bugs in that file.

Rules:
  - Only modify files in src/, NEVER modify test files
  - Fix exactly ONE file, then stop immediately
  - Do NOT read or modify any other source files


## Goal Evaluation

After all body steps complete, evaluate:

**Goal:** pytest exits with 0 failures (all tests pass)

1. If goal IS met → mark this step completed, proceed to next step.
2. If goal NOT met and iterations remain → reset body steps, start next iteration.
3. If max iterations reached → mark step completed with status "max_iterations_reached", report what remains.

Append a summary to `iteration_history` after each iteration.

### Progress Tracking

After completing this step, update `task_context.json`:
- Set `current_step_id` to `"fix_loop"`
- Set `steps.fix_loop` to `"completed"`
### Step 2: present

**Type:** inline

## Execution
Follow these instructions:

Generate a markdown report summarizing the fix loop results.
Include:
  - Total iterations taken
  - What was fixed in each iteration (file + bug description)
  - Final test results
  - How cascading dependencies caused failures to clear automatically
Write the report to fix_report.md


### Progress Tracking

After completing this step, update `task_context.json`:
- Set `current_step_id` to `"present"`
- Set `steps.present` to `"completed"`

What does the generated SKILL.md contain? It translates the contracts you declared in YAML into execution instructions that the agent can understand:

iteration_history: requires the agent to record what was done each round and whether the goal was met, forming structured iteration memory
goal evaluation: the judgment logic after each round — if met then stop, if not met then continue, if limit reached then report
progress tracking: tracks overall progress through task_context.json, supports checkpoint resume

Running It: 3-Round Convergence

Now we can trigger this skill to run in Claude Code:

The run included three iterations:

Iteration 1: calculator fix → 6 failures disappeared

Iteration 2: validator fix → 5 failures disappeared

Iteration 3: formatter fix → all green

Finally, this is also what we defined earlier in the runbook — a fix_report.md to be produced after the loop.

Key Points for Designing a Good Loop

Choose the right task. Not all tasks are suitable for loops. A good loop task has two characteristics: objective feedback signals (test results, lint output, whether compilation passes), and the ability to make incremental progress building on the previous round. Fixing tests, code migration, and performance optimization are all good candidates. Tasks requiring one-time creative decisions (architecture choices, naming) are not suitable.
Write the goal as a decidable end state. "pytest exit 0" is a good goal, "better code quality" is not. The agent must be able to determine true or false on its own through tool output, otherwise the loop never knows whether it should stop.
Keep the body in an "observe—act" rhythm. First use script steps to see the current state clearly (run tests, run lint), then use agent steps to make decisions and modifications. Don't let the agent observe, act, and verify all in one round — split them up, each step has clear responsibilities, and when something goes wrong it's easier to locate.
Leave an exit for failure. max_iterations is not the number of rounds you expect, but a safety valve for "exceeding this number means the approach has a problem". A normal loop should converge well below the upper limit. If it maxes out, it means the goal is too hard or the body design has flaws, and human intervention is needed.

agent-runbook: More Than Just Loops

Due to the AI product I'm developing, I frequently need to write many long-running, as-error-free-as-possible DevOps skills for SREs.

During debugging I often encounter two types of problems:

One is agents not following instructions — you tell it to only restart the service, and it goes ahead and changes the configuration too.
The other is in a complex multi-step skill, agents not collaborating according to the established norms, where the output from the previous step isn't read by the next step at all, or it's read but the format is wrong.

Based on these problems, I developed agent-runbook: a contract-based skill generation tool, where the generated SKILL.md can be directly used as a skill integrated into the Claude Code/Codex ecosystem.

Its core philosophy is: use contracts to constrain agent collaboration, instead of relying on prompts and hoping for the best.

This table gives you a quick sense of how agent-runbook differs from /goal:

	/goal	agent-runbook
Per-round structure	Agent does whatever it wants	Body declaratively defines each round's steps
Iteration constraints	None, only a termination condition	quality_check guardrails, out-of-bounds doesn't count as complete
Inter-step communication	Relies on LLM context passing	JSON Schema files, inspectable, parallel-readable
Error recovery	Start over	Checkpoint & Resume, pick up from where it crashed
Build-time checks	None	DAG cycle detection, schema reference validation, contract closure checks
Reusability	Gone once you type it	Commit to repo, anyone can run it with the same behavior

Loop is a step type added on top of this foundation — when your task requires iteration, use the same contract-based approach to define the loop's body, goal, and constraints.

You don't have to start from scratch either. open-devops-skills is a production-grade DevOps skill library built on agent-runbook, currently featuring infrastructure/cloud resource cost optimization skills, with more DevOps scenarios to be expanded in the future. You can use them directly, or use them as reference for designing your own skills.

It's also worth mentioning that agent-runbook itself is not limited to DevOps. Any scenario requiring multi-step orchestration, inter-agent collaboration, and long-term reliable operation is suitable — code migration, security auditing, documentation generation, data pipeline validation. As long as your task can be broken down into "steps + contracts + dependencies", it can be expressed with a runbook.

The repo is at github.com/KnoxOps/agent-runbook, feel free to try it out and give feedback. If you have a workflow where you're repeatedly prompting agents manually, try writing it as a runbook — you'll find that once it becomes a contract, the cost of debugging and reuse drops significantly.

I Asked Claude to Map My Infrastructure. Then I Asked a Purpose-Built Tool.

paul_h — Mon, 15 Jun 2026 07:13:09 +0000

I manage a small stack. Three Linux VMs, one Kubernetes cluster, maybe 20-something services total. Not big. But underdocumented — the kind of environment where you SSH in and discover things you forgot were running.

Last week I ran the same task through two different AI tools: "tell me what's running, how it connects, and what looks risky." One is a general-purpose LLM (Claude). The other is a purpose-built AI SRE tool. Same environment, same ask. The results were... instructive.

The task

Simple brief: infrastructure discovery. I want a full picture — services, dependencies, topology, risks. The kind of thing a new hire would spend their first week piecing together from wikis that haven't been updated since 2023.

Claude Code (Opus model)

My prompt:

"I manage a small infrastructure — 3 Linux VMs (172.30.0.41, 172.30.0.42, 172.30.0.43) and a Kubernetes cluster. SSH access is already configured. Help me understand what's running across this environment — I want a full picture of my services, dependencies, and topology."

I'm running Claude Code locally with the Opus model — their flagship tier. Claude didn't ask questions. It just started SSH-ing in.

" width="800" height="354">

Five minutes later it handed me a report. And honestly? It was better than I expected.

What Claude delivered:

Identified all three VM roles correctly (API Gateway, Order Processing, Data Tier)
Drew an ASCII topology showing Nginx routing to backend services with canary weights
Built a full service table — host, port, tech stack, notes
Mapped the Redis Sentinel cluster including a stale replica on a decommissioned node
Enumerated every K8s namespace and workload
Traced the observability pipeline (node_exporter → Prometheus, OTel → Jaeger, Datadog agents)
Flagged four real issues: dead Redis replica, broken image pulls in aigc-app, active canary split, multiple knoxd versions

Five minutes. No hand-holding. For a "quick, what's running here?" sweep, this is genuinely useful.

Where it stops

Here's what I noticed after the initial "wow, that was fast" wore off.

The output is a wall of markdown. Accurate, mostly. But flat. Everything has the same weight — a critical single-point-of-failure sits next to a cosmetic naming inconsistency. No severity. No priority.

More specifically:

No topology visualization. I got an ASCII diagram. It's readable for 6 machines. At 60 machines, it's unreadable. At 600, impossible.

No business grouping. Claude listed every service but couldn't tell me which ones form the e-commerce flow vs. the logistics flow vs. the platform layer. That requires domain context it doesn't have.

No risk assessment. Four issues found, but no severity classification. The dead Redis replica and the cosmetic knoxd naming thing are presented with equal weight.

No quality gate. Nobody verified whether Claude's topology was actually correct. It connected things confidently — but was the canary weight really 90/10? I'd need to go check.

No persistence. Close the chat window. The report is gone. Tomorrow I'd run it again and get a slightly different exploration path, slightly different findings.

No depth control. I can't say "that Business Island looks risky, go deeper on it." It's all-or-nothing.

This maps to a pattern I keep seeing across industries. In legal tech, people noticed the same thing — general LLMs are good at summarizing contracts but can't do precision clause verification. In finance, ChatGPT can describe how to post a journal entry but can't actually post one. The dividing line is consistent: general AI is a thinking tool; specialized AI is an acting tool.

When the task is "reason about this data and explain it to me" — general tools are great. When the task shifts to "build a structured, persistent, verifiable model of my environment" — you've crossed into territory they weren't designed for.

Purpose-built tool, same task

For comparison, here's what happens when I send one line to Knox (our purpose-built AI SRE tool — yes, this is our product, stating that upfront):

"Run a full infrastructure discovery on our production environment."

Shorter prompt. No need to explain the environment — it already has connectors configured.

Twenty minutes later:

" width="800" height="450">

The differences that matter:

Visual topology — not ASCII art, an interactive service relationship graph
Business Islands — services auto-grouped by business function with criticality labels
Risk Triage — findings ranked by severity with a distribution chart
Persistence — results stored in a graph database, queryable later
Depth on demand — "Deep Analysis Available" button for any Business Island

How it got there — a team of agents, not a single model:

8" width="800" height="450">

This is the work process, not a deliverable. Multiple specialized agents collaborated — one coordinated the task, one did the actual discovery, one quality-checked the findings — flagging 9 uncertain items for human review instead of presenting everything with equal confidence.

The scale question

We ran this on 5-6 machines. The gap is already visible. But this is the minimum-gap scenario.

At 60 servers across multiple environments, Claude's context window fills up. You'd need multiple sessions, manual stitching, and the "flat markdown" problem becomes unbearable. The gap doesn't grow linearly — it compounds.

That's not a knock on Claude. A Swiss Army knife is great. But when you need surgery, you reach for a scalpel.

What's your environment look like? At what scale did you find general AI tools hitting their ceiling for ops work?

If you want to try the purpose-built approach: knoxops.app

Agentic Ops: How I Shipped My Vibe-Coded Game to Production

paul_h — Sat, 30 May 2026 07:31:08 +0000

Over the weekend, I vibe coded a cooking game. You combine random ingredients, and the game generates a dish with a score and a snarky review — stuff like "This tastes like regret and too much butter." I'd wanted to build this for a while. Eventually I'll hook it up to an AI model to generate more combinations and even harsher critiques.

One Prompt, One Hour

I opened Claude Code and typed a single prompt:

"Create a cooking game where players combine ingredients to discover recipes..."

An hour of coding and debugging later, I had a working version running on localhost.

The Wall

Then came the real problem: deploying it so my friends could actually play.

AI has collapsed the barrier to building software. But no matter how low the entry gets, even the most seasoned SRE can't rattle off HTTPS configs, domain setups, and nginx routing rules from memory. As a vibe coder, what was I supposed to do next?

The Plan

I spun up an AWS VM, installed a Knox Daemon (Knox is an AIOps product), and connected it to my GitHub repo. Then I told it:

"How I Shipped My Vibe-Coded Code to Production"

It started exploring my codebase. It discussed the task with me, asked clarifying questions, and came back with a full plan — five stages covering pre-checks, building the game, requesting certificates, updating nginx routes, final verification, and documenting what it learned for next time. Nothing would execute until I approved it.

The Execution

I reviewed the plan and hit approve. The agents kicked off in parallel — one checking the environment, one executing changes, another validating the output of each stage. They ran efficiently, every step visible. It looked exactly like a human SRE team at work.

When it was done, the agent handed me a report. I clicked the URL in the report and — there it was. My game. Live. Someone could play it.

30 Minutes

I was doing other things throughout the deployment, so I wasn't always quick to respond when the agent needed input — requirement discussions, plan approval, execution confirmations on my AWS box. Total time from start to live: about an hour. If I'd been fully focused, probably 30 minutes.

The whole experience was striking. More and more people are building things in the AI era. They think about product design and development, but then what? How do you deploy? How do you keep the service running?

I think this is what agentic ops means.

Agentic ops gives you the same answer: describe what you want, and an agent operates the server. Same loop as vibe coding. The output just isn't code anymore — it's a running service.

The endpoint of vibe coding shouldn't be localhost:3000. It should be a link you can drop in a group chat.

AI Agents Mapped My Legacy Production Environment in One Hour.

paul_h — Thu, 28 May 2026 03:55:11 +0000

I inherited a black box.

Three VMs. A hundred-something microservices. Redis, ClickHouse, MySQL, some homegrown database nobody could name. Kafka and Zookeeper thrown in because of course they were.

Nobody knew how the services connected. The original team was gone. The architecture lived entirely in oral tradition, and the last person who could recite it had left six months ago.

This is not a metaphor. This is Tuesday for anyone who's done SRE work long enough.

Setup: 30 seconds, zero footprint

I already had Teleport for daily ops. SSH access, session recording. It worked, I didn't want to break it.

What I did:

Installed knoxd on my Teleport proxy (not on the servers)
AI agent team auto-configured a Teleport connector

That's it. Nothing new on my production machines. The agents ride the Teleport session I already had, with the permissions I'd already defined.

Non-invasive — not in the "we promise it's lightweight" sense. In the "there is literally nothing new running on your production machines" sense.

How it actually works

The agents SSH in through Teleport. Plain SSH commands, same ones you'd type yourself.

What makes this safe rather than terrifying:

	Auto-run	Requires human approval
Read-only	`ps`, `ss`, `cat /proc/net/tcp`, `nginx -T`	—
Mutating	—	`kill`, `systemctl restart`, `rm`

The sandbox: strict AST parsing + default-deny whitelist. The agents can look at everything but touch nothing without asking.

What the agents discovered

Step 1: OS inventory — kernel, distro, packages. All 3 VMs in parallel.

Step 2: Process mapping — ps aux, parsed. Hundreds of processes tagged with binary path, resource footprint, parent-child relationships.

Step 3: Process → Service resolution

Check name service first
If unregistered (most weren't — legacy system), infer from install path
Flag for human confirmation before writing anything back

The AI doesn't hallucinate service names into your architecture map. It asks.

Step 4: Service → Business Island grouping

A business island = logical grouping by business function (billing, user auth, order processing). The thing that exists in every architect's head but never in any document.

Step 5: Connection mapping — four evidence sources, cross-referenced:

Source	What it reveals	Example
Network connections (`ss -tnp`)	Live TCP dependencies	Port 6379 → Redis, port 9092 → Kafka
Config files	Declared dependencies	`kafka.brokers: kafka-01:9092` in YAML
Access logs	Actual call patterns	Who calls whom, how often
LB configs (nginx)	Ingress chain	Domain → LB → real server

Cross-reference. Resolve conflicts. Draw edges.

One hour.

What I got

Architecture diagrams — topology maps of each business island, services as nodes, dependencies as edges, data flows labeled. The kind of diagram you'd pay a consultant a week to produce.

High-risk report:

Single points of failure
Circular dependencies
Kafka topics with no visible consumer group
One Redis instance holding session state for 6 business islands, zero isolation

Things I needed to know. Things dashboards would never show me.

The cost

Zero.

Knox gives free credits on signup. Enough for a small cluster for a long time. No credit card. No trial-that-converts-to-paid. One binary on a jump host.

Why this matters

Most AIOps tools treat metrics as the final answer. They're not. They're the starting point.

Real outages hide in blind spots:

System logs nobody tails
Manual changes nobody tracked
Config drift APM tools don't see

To find root cause, you have to log into machines and build an evidence chain. That's what humans do. That's what these agents do.

Monitoring tells you a metric crossed a threshold. It doesn't tell you:

Service X and Y form a circular dependency that will cascade
Your session store is a single point of failure for half the platform

Those aren't metric problems. They're structure problems. LLMs are uniquely good at structure — if you give them a way to see it without breaking anything.

Safety model

Letting AI touch production should sound terrifying. That's why:

AST-parsed command validation — not string matching, actual syntax tree analysis
Default-deny whitelist — everything blocked unless explicitly allowed
Human-in-the-loop — any destructive action requires a plan + approval
Connector model — agents use paths you already trust (Teleport, SSH, AWS, Prometheus)

The agents never need their own access path. They never open a new hole in your security posture.

That's the difference between an agent you'd let near production and one you wouldn't.

What I'm building

It's called KnoxOps. Core idea: infrastructure is an object graph, not a flat list of resources. Model it that way and LLMs can reason like a senior SRE — tracing dependencies, calculating blast radius, finding what dashboards miss.

The goal: delegate routine SRE toil so developers can focus on building.

More connectors coming. The principle stays the same: use the access paths you already trust.

If you've inherited a system nobody understands — I'd like to hear from you.

I'm the founder of KnoxOps. Currently in open beta — use code DEVTO26 for 10,000 free credits on signup.