DEV Community

~K¹yle Million
~K¹yle Million

Posted on

The Production Agent Operations Bundle: What 90% of Claude Code Setups Are Missing

The Five Failure Modes That Hit Real Production Setups

1. Context collapse mid-task

Your agent is 35 steps into a 60-step task. It hits context limit. Compaction kicks in. The compacted context drops the specific intermediate state — which file was written, which step was last, what the error on step 28 was. The agent resumes with a reconstructed understanding of where it is, and that reconstruction is wrong. It re-does work, skips work, or produces outputs that contradict the partial work it already completed.

The compaction is not the problem. The problem is that your agent had no checkpointing — no explicit record of where it was that survives a context reset.

2. Infinite loops with no circuit breaker

The task fails. The agent retries. Same failure. Retry. Same failure. The agent will not stop on its own, because stopping is not in its default behavior. It will retry until context exhausts, then compact and retry again. A permission denied error on step 3 will get retried 80 times before the run terminates. You pay for all 80 retries.

3. Shell injection via unvalidated tool calls

Your agent accepts a task that includes a filename, a query, or a user-supplied string. It passes that string directly to a bash call: os.system(f"process_file.sh {filename}"). If filename is file.txt; rm -rf outputs/, your agent just destroyed your output directory. If it's piped from an external source, the attack surface is real.

Most Claude Code bash usage never validates inputs before shell execution. Most demos don't catch this because the inputs are controlled. Production inputs are not.

4. Concurrent agents corrupting shared state

You have two agents running in parallel. Both are writing to outputs/weekly_report.md. Agent A writes its section. Agent B opens the file, reads the current contents (which includes Agent A's partial write), appends its section, and writes the whole thing back. Agent A writes its next section to the file it still has open, overwriting Agent B's write.

Non-atomic writes with no locking produce corrupted output with no error. No exception is raised. The file exists. The contents are wrong.

5. Coordinator handoff losing task state

Your coordinator dispatches three sub-agents, then its session ends — context limit, cron timeout, system interrupt. A new coordinator starts on the next cron tick. It has no idea which sub-agents already completed. It re-dispatches all three. Sub-agent 1 runs again, producing duplicate output. Sub-agent 2 conflicts with its own still-running previous instance. Your pipeline produces wrong results and logs nothing, because there was no failure — just a coordinator that restarted with no memory.


What Doesn't Work and Why

The instinct when any of these hits is to add error handling. Wrap things in try/except, add a retry loop, restart on failure. These are patches, not fixes. Here's why each one falls short:

"Just add error handling" catches exceptions but doesn't solve loop termination. Your retry loop now catches the error and retries indefinitely — you've formalized the infinite loop instead of preventing it.

"Restart on failure" is the coordinator pattern that causes state loss. Each restart wipes context. Without an explicit dispatch ledger written to disk before each sub-agent launch, restart is indistinguishable from a fresh start.

"Check output file existence" to infer completion has multiple failure modes: partial writes leave valid-looking files, a previous interrupted run may have left a file from a different context, and the same task may need to run multiple times. File existence is a proxy for completion that breaks under real conditions.

"Sanitize inputs in the prompt" relies on the model to perform security validation. That's not the right layer. Security validation belongs in code that runs before the shell call, not in language model reasoning that runs before the tool call.

"Use a lock file" for concurrent writes is the right idea but is almost always implemented incorrectly — lock files that survive crashes leave all subsequent agents blocked, and there's no cleanup logic because the crash that created the problem also prevented the cleanup.

The common thread: these fixes address symptoms at the wrong layer. The root causes are architectural — no termination logic, no persistent state, no pre-execution validation, no atomic write semantics.


The Five Architecture Patterns That Fix It

1. Loop Termination with Circuit Breakers

Every production agent needs termination logic at three levels: a hard step limit, an error accumulation counter, and a goal proximity check.

The hard limit is the blunt instrument that catches runaway loops:

MAX_STEPS = 50
step_count = 0

def execute_step(action):
    global step_count
    step_count += 1
    if step_count >= MAX_STEPS:
        write_state_checkpoint(reason=f"max steps ({MAX_STEPS}) reached")
        raise TerminationError("Hard limit reached")
    return perform_action(action)
Enter fullscreen mode Exit fullscreen mode

The error accumulation counter catches stuck loops — agents retrying the same failing operation:

error_counts = {}
ERROR_THRESHOLD = 3

def handle_error(error_type: str, context: str):
    error_counts[error_type] = error_counts.get(error_type, 0) + 1
    if error_counts[error_type] >= ERROR_THRESHOLD:
        write_escalation(f"BLOCKED: {error_type} failed {error_counts[error_type]}x. Context: {context}")
        raise TerminationError(f"Repeated failure: {error_type}")
    return retry_with_backoff()
Enter fullscreen mode Exit fullscreen mode

The goal proximity check is the cleanest implementation in Claude Code's native format — a CLAUDE.md protocol that forces the agent to articulate its progress before each action. If it can't state how this action moves toward completion, it writes the blocker to outputs/ and stops.

Clean termination writes current state, names the blocker, and exits 0 — stopped is not the same as failed.

2. Memory Isolation for Concurrent Agents

When multiple agents need to read and write shared state, the architecture needs to prevent reads of stale data and prevent concurrent writes from producing corrupted output.

The pattern is task-local working directories with a merge step, not shared output paths:

import os, uuid, shutil

def agent_working_dir(agent_id: str) -> str:
    """Each agent gets its own isolated scratch space."""
    base = os.path.expanduser("~/intuitek/coordination/scratch")
    path = os.path.join(base, agent_id)
    os.makedirs(path, exist_ok=True)
    return path

def merge_agent_outputs(agent_ids: list, output_path: str):
    """Coordinator merges after all agents complete — no concurrent writes."""
    sections = []
    for agent_id in agent_ids:
        scratch = agent_working_dir(agent_id)
        result_file = os.path.join(scratch, "result.md")
        if os.path.exists(result_file):
            with open(result_file) as f:
                sections.append(f.read())
    with open(output_path, "w") as f:
        f.write("\n\n---\n\n".join(sections))
Enter fullscreen mode Exit fullscreen mode

Agents write to their scratch directory. The coordinator merges when all agents report completion. No two agents write to the same path. No locks needed.

For shared state that agents genuinely need to read and update concurrently, the pattern is append-only event logs with a read-once merge, not mutable shared files.

3. Coordinator Resume Integrity

Coordinator state must be written to disk before every sub-agent dispatch. Not after — before. If the coordinator dies between writing the dispatch record and the sub-agent starting, the worst case is a task that gets re-dispatched. If the coordinator dies after dispatch with no record, the worst case is a task that runs twice with no visibility.

dispatch_task() {
    local TASK_ID="$1"
    local TASK_PROMPT="$2"

    # Write to ledger before dispatch — not after
    python3 -c "
import json, datetime
with open('$LEDGER') as f: ledger = json.load(f)
ledger['tasks'].append({
    'task_id': '$TASK_ID',
    'status': 'IN_PROGRESS',
    'dispatched_at': datetime.datetime.utcnow().isoformat() + 'Z',
    'completed_at': None
})
with open('$LEDGER', 'w') as f: json.dump(ledger, f, indent=2)
"
    bash ~/intuitek/run_task.sh "$TASK_PROMPT" &
}

startup_coordinator() {
    if [[ -f "$LEDGER" ]]; then
        # Skip tasks already marked COMPLETE
        PENDING=$(python3 -c "
import json
with open('$LEDGER') as f: ledger = json.load(f)
pending = [t['task_id'] for t in ledger['tasks'] if t['status'] != 'COMPLETE']
print('\n'.join(pending))
")
    fi
}
Enter fullscreen mode Exit fullscreen mode

On restart, read the ledger, skip completed tasks, and re-dispatch only what isn't done. Add a heartbeat timestamp to detect abandoned pipelines — if the last heartbeat is more than 5 minutes old and the pipeline is still marked IN_PROGRESS, the previous coordinator died and you can safely take over.

4. Bash Security Validation Before Shell Execution

Every string that comes from outside your agent's direct control — task inputs, file paths, query parameters, content extracted from external sources — must be validated before it touches a shell call.

The validation layer runs in Python before the subprocess call:

import re, subprocess, shlex

SAFE_FILENAME = re.compile(r'^[\w\-\.]+$')
SAFE_PATH_COMPONENT = re.compile(r'^[\w\-\./]+$')

def safe_shell_exec(command_template: str, **kwargs):
    """Validate all interpolated values before shell execution."""
    for key, value in kwargs.items():
        if 'path' in key or 'file' in key:
            if not SAFE_PATH_COMPONENT.match(str(value)):
                raise SecurityError(f"Unsafe path in {key}: {value!r}")
        elif 'name' in key:
            if not SAFE_FILENAME.match(str(value)):
                raise SecurityError(f"Unsafe filename in {key}: {value!r}")

    cmd = command_template.format(**kwargs)
    result = subprocess.run(
        shlex.split(cmd),
        capture_output=True, text=True, timeout=30
    )
    return result
Enter fullscreen mode Exit fullscreen mode

The important detail is shlex.split() rather than passing the command string directly to shell=True. shell=True is the vector. shlex.split() with shell=False tokenizes the command safely and passes it as an argument list, which prevents shell metacharacter injection even if a value slips through validation.

For agent-facing tools that accept arbitrary inputs, add a denylist for shell metacharacters as a second layer: ;, |, &&, $(), backticks, and > in unexpected positions are all injection indicators.

5. Context Compaction Checkpointing

When an agent runs a task that requires more steps than a single context window, it needs to write explicit checkpoints — structured state records that survive compaction and allow resumption at the right point.

The checkpoint is written before any operation that changes state, and read at session start:

import json, os
from datetime import datetime

CHECKPOINT_PATH = "outputs/checkpoint_{task_id}.json"

def write_checkpoint(task_id: str, state: dict):
    """Call before any state-changing operation."""
    checkpoint = {
        "task_id": task_id,
        "checkpoint_at": datetime.utcnow().isoformat() + "Z",
        "completed_steps": state.get("completed_steps", []),
        "current_step": state.get("current_step"),
        "outputs_written": state.get("outputs_written", []),
        "context_summary": state.get("context_summary", ""),
    }
    path = CHECKPOINT_PATH.format(task_id=task_id)
    with open(path, "w") as f:
        json.dump(checkpoint, f, indent=2)

def load_checkpoint(task_id: str) -> dict | None:
    path = CHECKPOINT_PATH.format(task_id=task_id)
    if os.path.exists(path):
        with open(path) as f:
            return json.load(f)
    return None
Enter fullscreen mode Exit fullscreen mode

In Claude Code's native CLAUDE.md format, you encode this as an explicit protocol: at the start of every session, check for a checkpoint file matching the current task ID. If found, read it, report where execution left off, and continue from current_step rather than from the beginning.

The context_summary field is the most important part. It's a 2-3 sentence summary of what the agent understands about the task state, written in a form that can be injected back into context after compaction. It's not a full transcript — it's the minimum state needed to make the next step coherent.


When to Use the Bundle vs. Building From Scratch

Build from scratch if:

  • Your agent runs a single short task (under 20 steps) with no concurrent instances
  • All inputs are fully controlled — no external sources, no user-supplied strings reaching shell calls
  • The agent runs once and terminates; no scheduled re-runs, no coordinator/sub-agent pattern

Use the bundle if:

  • You're running agents on a cron schedule where each run may pick up from where the last one left off
  • You're running two or more agents in parallel that share any output paths or state
  • Any task input — including file paths, query parameters, or content the agent reads from external sources — reaches a bash or subprocess call
  • You're building a coordinator that dispatches sub-agents
  • You've already hit any of the five failure modes described above

The patterns aren't complicated individually. The difficulty is in the details: the exact order of operations for a write-before-dispatch ledger, the edge cases in lock file cleanup, the difference between shell=True and argument list subprocess calls that actually blocks injection. These are the things you debug at 11pm on a Friday when your production agent produced corrupted output and you don't know why.


The Honest Take

None of this is new architecture. Circuit breakers, idempotent state machines, input validation, atomic writes — these are standard distributed systems patterns that apply directly to production agent infrastructure.

The reason most Claude Code setups don't have them is not complexity. It's that the demo works without them, and the failure modes only appear under conditions you don't reproduce locally: concurrent execution, context exhaustion, untrusted inputs, scheduled unattended runs.

If you're at the point where Claude Code agents are part of your production infrastructure and not just experiments, these patterns are not optional. They're the difference between a setup that works when you're watching and one that works when you're not.


I packaged all five as a single ClawMart skill bundle — ready to drop into any Claude Code project: https://www.shopclawmart.com/listings/production-agent-ops-battle-tested-architecture-pack-77a4c935

$69. Instant download. One-time purchase.


Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Tags: claudecode, devtools, aiagents, productivity

Top comments (0)