paul_h

Posted on Jun 26

Loop Engineering for Ops

#devops #aiops #ai #loopengineering

Loop Engineering is the phrase everyone's been throwing around lately. It sits after Prompt, Context, and Harness — pushing agent coordination one step further.

Google's Addy Osmani published a long piece a while back that formalized the whole thing. Worth a read. But before him, Peter Steinberger, founder of OpenClaw, said something that nailed it even harder.

"You shouldn't be prompting coding agents anymore. You should be designing loops that prompt your agents."

What he means is, you stop being the person manually driving the agent turn by turn. You build a system. The system runs, inspects, fixes, records. You go from the person turning the wrench to the person who designed the assembly line.

One thing to keep in mind: in operations, safety is everything. Without guardrails and human-in-the-loop, an autonomous loop can cause real damage. I'm not a fan of "fully automated" ops loops, and I'll show you why.

This article walks through building a safe ops loop using a host health check example. Most of the routine ops work we do every day can be packaged this way.

Loop Engineering, Broken Down

If you take Loop Engineering apart, you get six pieces.

Piece	What it does
Automations	Scheduled or conditional triggers. The loop runs itself.
Worktrees	Multiple agents working in parallel, isolated checkouts, no stepping on each other.
Skills	Project knowledge written down so you don't re-explain everything every session.
Connectors	Hooks into real systems — SSH, APIs, databases. Actually executes.
Sub-agents	Separate the builder from the reviewer. The one who wrote it is too nice grading it.
State	Remembers what happened across runs. The agent forgets. Files don't.

Claude Code and Codex ship most of these built-in now. You don't build from scratch.

But there's one piece the tools can't do for you.

Writing down the loop's structure and constraints.

Sure, you can type a one-off command (like Claude Code's /goal). Two problems with that.

Prompts leave gaps. How many steps per round? How do you know it's done? How do steps hand off to each other? What stops it from going off the rails? Nobody forces you to think about these things. You can skip them. With agent-runbook, the framework forces you to answer every one of them. Skip one, and the compiler yells at you.
Not reusable. You run it once, it's gone. Next cluster, next person, start over.

You need something that turns your loop's steps, end conditions, and guardrails into a file. Not a prompt you type and forget — a contract you commit to the repo. Next run, next person, same file, same result.

agent-runbook: Writing Loops as Contracts

agent-runbook is an open-source project on GitHub: github.com/KnoxOps/agent-runbook. Its entire philosophy fits in one sentence: constrain agents with contracts, not prompts and hope.

A few key design decisions.

Contract-first. You write a YAML file that declares your loop's steps, outputs, dependencies, and guardrails. Not a vague "go check the servers" prompt — every step's input/output format, execution order, and boundary conditions are locked down. The compiler validates before generating the skill file: missing schemas, circular dependencies, references to non-existent outputs — all caught at build time. Nothing blows up halfway through a run.

Human in the loop. Ops isn't testing. In staging, the agent can auto-fix everything. In production? No. Write operations stay behind a human gate. The agent can SSH in and investigate. It can draft a plan. It can list the commands. But you decide whether those commands actually run. This isn't a suggestion tucked into a prompt. It's a hard node in the step structure. Not approved, not proceeding.

External memory. The agent's context resets after every turn. Expecting it to remember what happened last round is wishful thinking. The loop stores every round's output, iteration history, and overall progress in files. A status file tracks which steps are done. An iteration log records what got fixed each round. The agent forgets. The files don't.

Mandatory guardrails. Writing "don't touch the config" in a prompt is a request. The agent can ignore it. agent-runbook guardrails are enforced checks — after every round, an independent review verifies the agent didn't overstep. Modified files it shouldn't have? Ran commands outside the plan? That round doesn't count as complete.

Built-in brakes. Every loop must declare what "done" looks like, plus a hard iteration cap. The loop stops. It doesn't run forever.

Files over context for step communication. Don't count on the agent remembering what the previous step said. Every step's output follows a JSON Schema. The next step validates as it reads. A field mismatch fails immediately — no silent data corruption.

Hands-on: A Host Health Check Loop

Real scenario. You've got a fleet of production servers. Regular health checks: disk, memory, load, critical services. When something's wrong, you can't just let the agent fix it automatically — production write operations require human approval.

Each round of this loop runs five steps:

Pick the most critical issue from the inspection results
SSH in, investigate the root cause, draft a fix plan
Present the plan, wait for approval
Execute the approved fix
Re-inspect to confirm the issue is resolved

Here's what it looks like as an Agent Runbook YAML:

name: host-health-loop
description: Ansible health check discovers issues → fix one by one (write ops require human approval) → re-inspect → repeat until all healthy → generate HTML report

input_params:
  - name: inventory
    type: string
    required: false
    description: Path to Ansible inventory file, defaults to ansible/inventory.ini
  - name: playbook
    type: string
    required: false
    description: Path to health check playbook, defaults to ansible/health_check.yml

steps:
  - id: inspect
    type: script
    description: Run Ansible playbook to inspect all hosts and generate issue list
    command: |
      ansible-playbook -i {inventory} {playbook} 2>&1
      mv /tmp/host_issues.json host_issues.json
    output:
      - file: host_issues.json
        schema: schemas/host_issues.schema.json
    depends_on: []

  - id: fix_loop
    type: loop
    description: >
      Fix loop: each round picks the most critical issue, produces a plan,
      waits for human approval, executes the fix, then re-inspects all hosts.
      Repeats until no issues remain or max iterations reached.
    goal: "host_issues.json has total_issues equal to 0 (all hosts healthy)"
    max_iterations: 10
    depends_on: [inspect]
    body:
      - id: select_issue
        type: inline
        description: Select the single most critical issue from host_issues.json
        prompt: |
          Read host_issues.json and schemas/selected_issue.schema.json.

          Pick the SINGLE most critical issue using this priority order:
          1. critical disk (highest priority)
          2. critical service_down
          3. critical memory
          4. critical load
          5. warning disk
          6. warning service_down
          7. warning memory
          8. warning load (lowest priority)

          Write the selected issue to selected_issue.json following the schema.

          If host_issues.json has total_issues == 0, write {"done": true} instead.
        depends_on: []
        output:
          - file: selected_issue.json
            schema: schemas/selected_issue.schema.json

      - id: plan_action
        type: agent
        description: Analyze the selected issue and create a concrete remediation plan (no execution)
        prompt: |
          Read selected_issue.json and schemas/pending_action.schema.json.

          STEP 1 — Investigate: SSH to the target host and check the actual situation.
          Understand the root cause before writing any plan. Do NOT guess or write generic commands.

          STEP 2 — Plan: Based on your findings, write a concrete remediation plan
          to pending_action.json following the schema.

          Do NOT execute anything. Only investigate and write the JSON file.
          If selected_issue.json contains {"done": true}, write {"done": true} to pending_action.json.
        depends_on: [select_issue]
        output:
          - file: pending_action.json
            schema: schemas/pending_action.schema.json

      - id: approve
        type: inline
        description: Human approval gate — present the remediation plan and wait for confirmation
        prompt: |
          Read pending_action.json.

          If it contains {"done": true}, skip this step.

          Otherwise, PRESENT the remediation plan from pending_action.json to the human:

          ---
          ## Awaiting Approval
          Present the host, issue, risk level, and commands from pending_action.json.

          Type "approve" to execute, or "reject" to skip this issue.
          ---

          WAIT for the human to respond. Do NOT proceed without explicit approval.
          If approved, copy pending_action.json to approved_action.json.
          If rejected, write skip_action.json with {"status": "rejected", "reason": "rejected by human"}.
        depends_on: [plan_action]

      - id: execute
        type: agent
        description: Execute the approved remediation
        prompt: |
          Check for approved_action.json. If it exists, read it.

          If skip_action.json exists instead, do NOT execute — write execute_result.json
          following schemas/execute_result.schema.json with status "skipped".

          Execute the remediation: SSH into the target host and run the commands
          listed in approved_action.json.

          Rules:
          - Use the exact commands from the plan, do not improvise
          - If a command fails, do NOT retry
          - After execution, verify the result (e.g., check service status, check disk usage)
          - Write execute_result.json following schemas/execute_result.schema.json

          If pending_action.json had {"done": true}, write {"status": "all_done"} to execute_result.json.
        depends_on: [approve]
        output:
          - file: execute_result.json
            schema: schemas/execute_result.schema.json
        quality_check:
          blocking: true
          rules:
            - "Commands were executed exactly as specified in approved_action.json"
            - "No destructive commands were improvised beyond the plan"
            - "Verification was performed after execution"

      - id: re_inspect
        type: script
        description: Re-run inspection to refresh the issue list
        command: |
          ansible-playbook -i {inventory} {playbook} 2>&1
          mv /tmp/host_issues.json host_issues.json
        output:
          - file: host_issues.json
            schema: schemas/host_issues.schema.json
        depends_on: [execute]

  - id: generate_report
    type: agent
    description: Generate a polished HTML inspection and remediation report
    prompt: |
      Read the final host_issues.json to get the current health status.
      Also read all execute_result.json files from the workspace to understand what was fixed.

      Generate a single, self-contained, beautiful HTML report: health_report.html

      The report should include:
      - Overall health status (ALL CLEAR or ISSUES REMAINING)
      - Summary stats: hosts checked, total issues found and resolved
      - Remediation timeline: each iteration with host, issue, investigation findings,
        commands executed, and before/after comparison
      - Final per-host status

      Design requirements:
      - Dark theme, modern dashboard style
      - CSS grid/flexbox, no external dependencies
      - Mobile-responsive
      - Professional and visually impressive — this is a production report
      - Include all CSS inline in a <style> tag
    depends_on: [fix_loop]

Write the YAML, compile it with one command:

python3 -m agent_runbook generate runbook.yaml -o output/

Drop it into Claude Code or Codex, and it runs.

The generated SKILL.md is about 250 lines. Here's a condensed version of the key sections (full version here):

---
name: host-health-loop
description: Ansible health check → fix → re-inspect → repeat until all healthy → HTML report
user-invocable: true
---

## Execution Flow

### Task Context
Initializes task_context.json to track every step's status. Update after each step.
Crash mid-run? Resume from the last completed step.

### Step 1: inspect
**Type:** script
Runs ansible-playbook to inspect all hosts, produces host_issues.json.

### Step 2: fix_loop
**Type:** loop
**Goal:** host_issues.json total_issues equals 0
**Max Iterations:** 10

Each round:
1. select_issue — pick the most critical issue by priority
2. plan_action — SSH investigation, draft a fix plan (no execution)
3. approve — present the plan, wait for human approval
4. execute — run approved commands, quality_check verifies boundaries
5. re_inspect — re-run inspection, refresh the issue list

## Goal Evaluation
1. Goal met → mark complete, proceed to next step
2. Goal not met, iterations remain → start next round
3. Max iterations reached → mark complete, report remaining issues

Append to iteration_history after each round.

### Step 3: generate_report
**Type:** agent
Reads final inspection results and fix records. Generates a dark-themed dashboard HTML report.

...

Here's the thing: that YAML you just read is literally declaring all six pieces of Loop Engineering. You declare it. Claude Code or Codex runs it.

What you declare in YAML	What Claude Code / Codex does automatically
Who runs each step, what they do, in what order	Spawns agents in declared order, each in an isolated worktree, cleaned up after
Output format requirements for each step	Validates outputs after each step. Wrong format? Immediate error.
How to connect external systems (SSH, Ansible, APIs)	Connects via MCP to real servers. Actually executes.
What "done" means, max rounds	Evaluates goal after every round. Hits the cap? Stops. No wasted tokens.
Approval required, guardrails enforced	Shows write commands. Waits for your approval. Checks boundaries after every round. Overstep? Round doesn't count.

Five Rounds, All Green

This example was tested on three bare-metal machines. The initial inspection found multiple issues spread across all three hosts.

Round one hit eval-bare-vm-3. Root disk at 93%, only 1.4G left. The agent SSH'd in and found the culprits: a Docker JSON log at 5.4G, 4.5G of temp files in /tmp, 215M of application logs, ~800M of container cache. After approval, it cleaned up — truncated the Docker log, cleared /tmp, cleaned apt cache, vacuumed journald to 50M. Freed up about 10G. Disk dropped from 93% to 38%.

The next few rounds were nginx and docker down across the three machines — services someone had manually stopped and never brought back. Each round: investigate → plan → approve → execute → verify. Five rounds total. All green. Loop stopped on its own.

Generated a dark dashboard HTML report.

Notice every single write operation went through approval. No automatic execution. A human saw every command before it ran. In production, you don't skip this.

How to Design a Good Ops Loop

Pick the right task. Good ops loops have objective feedback signals — inspection results, metrics, health check endpoints. Each round can build on the last. Host health checks, certificate expiry scans, K8s pod restart loops, Prometheus alert storm classification, log anomaly pattern matching — all solid candidates. Capacity planning and architecture changes need global judgement. Don't shove those into a loop.

Write "done" so a machine can decide. "Issue list is empty" is a good end condition. "The cluster is healthy" is not. The agent has to read a file or run a command and get a true/false answer. When you're writing the condition, ask yourself: can a script determine this in one line? If not, your agent probably can't either.

Pass data through files, not memory. A step's output is the next step's input, format-constrained. A mistake gets caught immediately. Agents have terrible memory — long context, they forget. Files don't.

Write operations need a human gate. Staging can be fully automated. Production cannot. The approval step is the safety valve for the entire loop. Write it into the contract. A hundred times more reliable than writing it in a prompt.

Set a ceiling. Max iterations isn't your target — it's the circuit breaker that says "something's wrong." A healthy loop converges well below the limit. If you hit the cap, some issue can't be fixed or the inspection keeps false-flagging. Time for a human.

Wrapping Up

agent-runbook is intentionally lightweight. It's not a full Loop Engineering implementation. It does one thing: writes your loop structure as a declarative file. Claude Code or Codex handles the rest.

You don't need to start from scratch. The examples directory has the full host health check loop — runbook file, inspection scripts, actual fix history from real runs.

Not just host health checks. Certificate expiry scans, K8s node health checks, log archival cleanup, database backup verification, middleware config compliance checks — if your ops task breaks down into "steps + contracts + dependencies," it can be a loop.

Repo at github.com/KnoxOps/agent-runbook. If you've got routines where you SSH into servers, check for problems, and fix them by hand — try writing that flow as a declaration and let the tool run it. It'll be more disciplined than you are.

Top comments (1)

paul_h • Jun 26

Hi, dev.to, this is a reader perk for you.
Knox is currently in open beta, If you want to try mapping your own environment, use code DEVTO26 for 10,000 free credits at knoxops.app — enough to manage a small cluster for a month.