Manoranjan Rajguru

Posted on May 21

Harness Engineering: How to Build Production-Ready LLM Agents That Actually Work

#agents #llm #python #ai

Harness Engineering: How to Build Production-Ready LLM Agents That Actually Work

Published: May 21, 2026 · 15 min read · Deep Dive

💡 TL;DR: An 8B model fails 47% of multi-step agentic tasks out of the box. Add a reliability harness (guardrails + context manager + step enforcer) and it succeeds 99% of the time. The bottleneck was never the model. This post teaches you to build that harness from scratch in Python.

The Benchmark That Should Embarrass Everyone
The Paradigm Shift: Everything Is Agentic Now
What Is a Harness? The Model Is Not the Agent
Anatomy of the Agent Loop
The Context Window: Your Scarcest Resource
Guardrails: The Reliability Stack That Changes Everything
Memory Across Sessions: CLAUDE.md, AGENTS.md, and Handoff Artifacts
Tool Permissions and Security Enforcement
Multi-Agent Orchestration with SlotWorker Patterns
Benchmarking Your Harness: Building a 26-Scenario Eval Suite
Conclusion: The Harness Is the Product

1. The Benchmark That Should Embarrass Everyone {#the-benchmark}

Here's a number worth sitting with: 53%.

That's the baseline task-completion rate of a state-of-the-art 8-billion-parameter model — Ministral-3 8B Instruct — on a 26-scenario multi-step agentic evaluation suite. More than half of all complex, multi-tool workflows fail outright. Not degrade gracefully. Fail.

Now here's the follow-up number: 99%.

That's what the same model achieves after adding a thin reliability layer — a harness — that handles rescue parsing, retry nudges, step enforcement, and context budget management. No model fine-tuning. No bigger GPU. No API call to GPT-5. Same weights, same hardware, a near-perfect success rate.

This is the finding dominating AI developer discussions this week, driven by the open-source Forge framework going viral on Hacker News (652 upvotes, 239 comments). It crystallises something the industry has been circling around for months: the bottleneck in agentic AI is not the model — it's the engineering around the model.

That engineering discipline has a name now: LLM agent harness engineering. And if you're building anything agentic in 2026, it is the most important skill you need to develop.

2. The Paradigm Shift: Everything Is Agentic Now {#paradigm-shift}

Before we go deep on harnesses, it's worth understanding why this matters right now and not two years ago.

In 2024, building an AI agent was an exceptional architectural choice — you'd reach for LangChain or AutoGPT when your use case genuinely required multi-step reasoning with tool use. The default was still prompt-in, response-out.

In 2026, agentic capability is baked into the models themselves. GPT-5.2 Thinking, Claude Opus 4.5, Gemini 3 Pro, and Qwen3.7-Max all generate structured plans in their reasoning traces, maintain goal states across turns, and autonomously select tools without being explicitly instructed to. The model is the agent now. What you're shipping when you deploy one of these models is, by default, an agentic system.

The proof is in this week's release of Qwen3.7-Max, which Alibaba explicitly frames as "The Agent Frontier" — not "the chat frontier," not "the coding frontier." The agent frontier. The framing is intentional: they're signalling that the primary consumption surface for frontier models is autonomous task execution, not conversational Q&A.

Meanwhile, DeepSeek just announced they're standing up a dedicated "Harness" team to build DeepSeek Code — a Claude Code / OpenAI Codex competitor. Their job listings explicitly require knowledge of "agent loops, MCP, multi-agent systems, and context engineering." The most formidable research lab in open-source AI is betting that the harness is where the value gets created, not the weights.

The implication for developers is clear: you are no longer building on top of AI, you are building the architecture around AI. The model handles intelligence; you handle everything else.

The shift isn't from dumb tools to smart tools. It's from tools that answer questions to tools that complete tasks. That distinction demands completely different engineering.

3. What Is a Harness? The Model Is Not the Agent {#what-is-harness}

Let's get precise about terminology, because sloppy language leads to sloppy architecture.

A model is a stateless function: tokens in, tokens out. It has no persistent memory, no ability to execute code, no awareness of time, no access to external systems. Left alone, it is an extraordinarily sophisticated text predictor.

An agent is a system that perceives its environment, plans steps toward a goal, uses tools to act on the world, and adapts based on feedback. An agent has agency.

The harness is everything that transforms the model into the agent:

Agent = Model + Harness

This deceptively simple equation — popularised by DeepSeek's Harness team and echoed throughout Anthropic's engineering blog — is the conceptual foundation of LLM agent harness engineering. The harness is the scaffolding layer consisting of:

Prompt Assembly Engine — constructs the prompt stack from system rules, conversation history, tool results, project instructions, and environmental context
Tool Execution Layer — maps model-generated tool calls to real function executions, captures results, handles errors
Context Manager — enforces token budgets, compacts history, caches reusable prompt segments
Guardrails Stack — validates model outputs, rescues malformed JSON, enforces required workflow steps, triggers retries
Memory System — persists state across sessions via files, databases, or structured handoff artifacts
Security Enforcer — manages permissions: which directories are writable, whether network access is allowed, when user approval is required

None of these components are provided by the model. All of them determine whether your agent succeeds or fails in production.

4. Anatomy of the Agent Loop {#agent-loop}

Every agentic system — from the simplest single-tool assistant to a complex multi-agent research pipeline — implements the same fundamental loop. Understanding this loop is prerequisite knowledge for everything that follows.

Here's the canonical loop in Python (requires Python 3.12+):

# agent_loop.py — The ReAct-style Agent Loop
# Requires Python 3.12+
from __future__ import annotations
from typing import Optional

async def agent_loop(
    harness: "Harness",
    workflow: "Workflow",
    user_input: str,
    max_iterations: int = 20,
) -> str:
    """
    Core agent loop: runs until the model produces a final answer
    or max_iterations is reached (a hard safety guardrail).
    """
    # Step 1: Initialise context with system prompt + user input
    context = harness.context_manager.initialize(
        system_prompt=workflow.system_prompt_template,
        user_message=user_input,
        tool_definitions=workflow.get_tool_definitions(),
    )

    for iteration in range(max_iterations):
        # Step 2: Assemble the full prompt from context layers
        prompt = harness.prompt_assembler.build(context)

        # Step 3: Call the model (stateless inference)
        raw_response = await harness.client.chat(prompt)

        # Step 4: Run guardrails on the raw response
        validated_response = await harness.guardrails.validate(
            raw_response,
            expected_step=workflow.get_expected_step(iteration),
        )

        # Step 5: Final answer or tool call?
        if validated_response.is_final_answer:
            return validated_response.content

        # Step 6: Execute the tool call via the harness (not the model)
        tool_result = await harness.tool_executor.execute(
            tool_name=validated_response.tool_name,
            tool_args=validated_response.tool_args,
            permissions=workflow.permissions,
        )

        # Step 7: Append result to context and loop again
        context = harness.context_manager.append_tool_result(
            context,
            tool_call=validated_response,
            tool_result=tool_result,
        )

    raise MaxIterationsExceeded(
        f"Agent did not complete within {max_iterations} iterations"
    )

The loop looks deceptively simple. The subtlety lives inside every component it delegates to. Let's examine the critical ones.

The Prompt Stack

When harness.prompt_assembler.build(context) runs, it's not concatenating strings — it's assembling a layered prompt with strict ordering:

# prompt_assembler.py
from __future__ import annotations

class PromptAssembler:
    def build(self, context: "AgentContext") -> list[dict]:
        """
        Assembles the message list in the correct order.
        Order matters enormously for model attention and instruction following.
        System prompt MUST come first — it establishes the authority hierarchy.
        """
        messages = []

        # Layer 1: System rules (highest authority, set once)
        messages.append({
            "role": "system",
            "content": self._build_system_block(context),
        })

        # Layer 2: Conversation history (may be compressed by ContextManager)
        messages.extend(context.message_history)

        # Note: tool definitions are passed as the `tools` API parameter,
        # NOT injected into message content — keeps the prompt clean.
        return messages

    def _build_system_block(self, context: "AgentContext") -> str:
        """
        The system block itself is layered:
        global rules → project rules → session constraints → environment info.
        Each layer can override the previous; later layers are more specific.
        """
        parts = [
            context.global_system_prompt,     # "You are a helpful engineer..."
            context.project_instructions,     # Contents of AGENTS.md / CLAUDE.md
            context.session_constraints,      # Tool permissions, sandbox mode
            context.environment_info,         # CWD, open files, git branch
        ]
        return "\n\n---\n\n".join(p for p in parts if p)

This ordering isn't arbitrary. The model's attention mechanism weights earlier tokens more heavily in long contexts. System instructions placed at the top maintain their authority even as the conversation grows to tens of thousands of tokens; placed elsewhere, they get effectively "forgotten" past a certain context depth.

5. The Context Window: Your Scarcest Resource {#context-window}

Every multi-step agentic workflow faces the same thermodynamic inevitability: the context window fills up.

With each iteration of the agent loop, you're appending tool call results, model responses, and new observations. A 128K-token context window sounds enormous until you're running a 30-step research workflow where each web search returns 2,000 tokens of results. You'll hit the wall around step 15.

Without countermeasures, the consequences are severe and often silent: the agent "forgets" constraints defined in the system prompt, contradicts earlier decisions, abandons half-completed implementations, or starts hallucinating state that was pushed out of the window. This is why production-grade LLM agent harness engineering treats context budgeting as a first-class concern — not an afterthought.

Tiered Compaction: Keep Recent, Summarise Old

The Forge framework's ContextManager implements a tiered compaction strategy:

# context_manager.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Optional
import subprocess

@dataclass
class TieredCompact:
    keep_recent: int = 4          # Last N tool-call/response pairs kept verbatim
    summary_model: str = "qwen3:1.7b"  # Lightweight model for summarisation
    max_summary_tokens: int = 512      # Hard cap on compressed history block


class ContextManager:
    def __init__(
        self,
        strategy: TieredCompact,
        budget_tokens: int = 8192,
        reserve_tokens: int = 1024,   # Reserved headroom for model output
    ):
        self.strategy = strategy
        self.budget_tokens = budget_tokens
        self.reserve_tokens = reserve_tokens

    async def maybe_compact(
        self, messages: list[dict]
    ) -> list[dict]:
        """
        Compacts message history if the token budget is exceeded.
        Called before every model invocation.
        """
        effective_budget = self.budget_tokens - self.reserve_tokens
        current_tokens = self._count_tokens(messages)

        if current_tokens <= effective_budget:
            return messages  # No compaction needed

        s = self.strategy
        # Split: keep recent N exchanges verbatim, compress the rest
        recent = messages[-(s.keep_recent * 2):]   # *2: user + assistant pairs
        to_compress = messages[:-(s.keep_recent * 2)]

        if not to_compress:
            # Even recent-only history exceeds budget — hard truncate oldest
            return self._truncate_to_budget(recent, effective_budget)

        # Summarise older history with the lightweight model
        summary_text = await self._summarise(to_compress)
        summary_message = {
            "role": "system",
            "content": (
                "[CONTEXT SUMMARY — earlier conversation compressed]\n"
                + summary_text
            ),
        }
        return [summary_message] + recent

    def _count_tokens(self, messages: list[dict]) -> int:
        """
        Rough token estimate: 1 token ≈ 4 characters.
        Replace with tiktoken or your model's tokeniser for precision.
        """
        total_chars = sum(len(m.get("content", "")) for m in messages)
        return total_chars // 4

VRAM-Aware Budgets for Local Inference

For developers running local backends (llama.cpp, Ollama, Llamafile), context management has a hardware dimension. The KV cache grows linearly with context length and lives in VRAM. Exceed your VRAM budget and either the server OOMs or starts offloading to RAM — at which point inference slows to a crawl.

# vram_budget.py
from __future__ import annotations
import subprocess

def estimate_safe_token_budget(quantisation: str = "Q8_0") -> int:
    """
    Estimates a safe context token budget based on available GPU VRAM.

    Rule of thumb for KV cache memory (approximate):
      Q4_K_M  →  ~0.35 MB per 1K tokens
      Q8_0    →  ~0.65 MB per 1K tokens
      FP16    →  ~1.30 MB per 1K tokens
    """
    MB_PER_1K_TOKENS = {"Q4_K_M": 0.35, "Q8_0": 0.65, "FP16": 1.30}
    mb_rate = MB_PER_1K_TOKENS.get(quantisation, 0.65)

    try:
        result = subprocess.run(
            [
                "nvidia-smi",
                "--query-gpu=memory.free",
                "--format=csv,noheader,nounits",
            ],
            capture_output=True,
            text=True,
            timeout=5,
        )
        free_mb = int(result.stdout.strip().split("\n")[0])
        # Use 70% of free VRAM to avoid thrashing
        usable_mb = free_mb * 0.70
        estimated_tokens = int((usable_mb / mb_rate) * 1000)
        return min(estimated_tokens, 128_000)  # Cap at model max
    except Exception:
        return 8_192  # Conservative safe default

6. Guardrails: The Reliability Stack That Changes Everything {#guardrails}

This is the section that explains the 53% → 99% jump. Guardrails are a composable middleware stack sitting between the raw model output and your application logic. Think of them as circuit breakers, validators, and auto-recovery mechanisms combined.

Component 1: Rescue Parsing

The single largest cause of agentic failures in local models is malformed tool-call JSON. The model generates output close to valid JSON — trailing commas, single quotes, unescaped characters, or truncated output from hitting max-token limits. Without rescue parsing, this is an unrecoverable hard crash.

# rescue_parser.py
from __future__ import annotations
import json
import re
from typing import Optional

class RescueParser:
    """
    Attempts to recover valid tool-call JSON from malformed model output.
    Applies a cascade of increasingly aggressive recovery strategies.
    """

    async def parse(self, raw_output: str) -> Optional[dict]:
        # Strategy 1: Direct parse (happy path ~80% of the time)
        try:
            return json.loads(raw_output)
        except json.JSONDecodeError:
            pass

        # Strategy 2: Extract the first JSON object from surrounding text
        json_match = re.search(r"\{.*\}", raw_output, re.DOTALL)
        if json_match:
            try:
                return json.loads(json_match.group())
            except json.JSONDecodeError:
                pass

        # Strategy 3: Fix common syntactic issues
        cleaned = self._clean_json_string(raw_output)
        try:
            return json.loads(cleaned)
        except json.JSONDecodeError:
            pass

        # Strategy 4 (last resort): Ask a lightweight model to reformat
        return await self._llm_reformat(raw_output)

    def _clean_json_string(self, s: str) -> str:
        """Fixes the most common JSON issues from local model outputs."""
        # Remove trailing commas before closing brackets/braces
        s = re.sub(r",\s*([}\]])", r"\1", s)
        # Replace smart quotes with straight quotes
        s = s.replace("\u201c", '"').replace("\u201d", '"')
        # Strip non-printable control characters (except newlines/tabs)
        s = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f]", "", s)
        return s

    async def _llm_reformat(self, malformed: str) -> Optional[dict]:
        """
        Final fallback: ask a fast lightweight model to fix the JSON.
        Uses a strict prompt to prevent the reformatter from adding content.
        """
        prompt = (
            "Fix the following malformed JSON. "
            "Return ONLY valid JSON — no explanation, no markdown fences.\n\n"
            f"{malformed}"
        )
        try:
            fixed = await self.lightweight_client.complete(prompt, max_tokens=512)
            return json.loads(fixed)
        except (json.JSONDecodeError, Exception):
            return None  # Truly unrecoverable — triggers retry nudge upstream

Component 2: Retry Nudges

When a model produces invalid output even after rescue parsing, a naive retry sends the same prompt — producing the same bad output. A retry nudge appends a targeted correction to the context before retrying:

# retry_nudge.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Optional

@dataclass
class ValidationFailure:
    type: str          # e.g. "invalid_json", "wrong_tool", "missing_field"
    details: dict      # Context-specific metadata for nudge templating


class RetryNudgeMiddleware:
    """
    Modifies context on retry to explicitly address the specific failure mode.
    Appending as a 'user' role message — models respond better to user-role corrections.
    """
    MAX_RETRIES = 3

    _NUDGE_TEMPLATES: dict[str, str] = {
        "invalid_json": (
            "Your previous response contained invalid JSON. "
            "You MUST respond with a valid JSON tool call. "
            'Required format: {{"name": "tool_name", "parameters": {{"key": "value"}}}}'
        ),
        "wrong_tool": (
            "You attempted to call '{called}' but the next required step is '{expected}'. "
            "Call '{expected}' now — do not skip required steps."
        ),
        "missing_required_field": (
            "Your tool call is missing the required field '{field}'. "
            "Include '{field}' in your parameters and try again."
        ),
    }

    async def handle(
        self,
        context: "AgentContext",
        failure: ValidationFailure,
        retry_count: int,
    ) -> "AgentContext":
        if retry_count >= self.MAX_RETRIES:
            raise MaxRetriesExceeded(
                f"Agent failed after {self.MAX_RETRIES} retries. "
                f"Last failure: {failure.type} — {failure.details}"
            )

        template = self._NUDGE_TEMPLATES.get(
            failure.type, "Your response was invalid. Please try again carefully."
        )
        nudge_text = template.format(**failure.details)

        return context.append_message({
            "role": "user",
            "content": f"[CORRECTION REQUIRED] {nudge_text}",
        })

Component 3: Required Step Enforcement

Agents skip steps when they think they already know the answer — and they are almost always wrong. Step enforcement ensures the model completes required tool calls in the correct order before producing a final answer:

# step_enforcer.py
from __future__ import annotations
from dataclasses import dataclass, field
from typing import Optional

@dataclass
class WorkflowStep:
    tool_name: str
    required: bool = True
    description: str = ""


class StepEnforcer:
    """
    Ensures the agent follows required workflow steps in order.
    Prevents premature task completion and step-skipping.
    """

    def __init__(self, required_steps: list[WorkflowStep]):
        self.required_steps = required_steps
        self.completed_steps: list[str] = []

    def validate_step(
        self, proposed_tool: str
    ) -> tuple[bool, Optional[ValidationFailure]]:
        """
        Returns (is_valid, failure_or_None).
        Call this before every tool execution.
        """
        next_required = self._get_next_required_step()

        if next_required is None:
            # All required steps complete — model has free choice
            return True, None

        if proposed_tool == next_required.tool_name:
            self.completed_steps.append(proposed_tool)
            return True, None

        # Model tried to skip a required step
        return False, ValidationFailure(
            type="wrong_tool",
            details={
                "called": proposed_tool,
                "expected": next_required.tool_name,
            },
        )

    def _get_next_required_step(self) -> Optional[WorkflowStep]:
        for step in self.required_steps:
            if step.required and step.tool_name not in self.completed_steps:
                return step
        return None

The combination of these three components creates a nearly-impenetrable reliability floor. Each recovers from a distinct failure mode. Together they account for the entire 46-percentage-point improvement from 53% to 99%.

7. Memory Across Sessions: CLAUDE.md, AGENTS.md, and Handoff Artifacts {#memory}

Local context management solves the within-session problem. But production agents often span multiple sessions — a multi-day refactor, a long-running research pipeline, a continuous integration agent. Each new session starts cold. Without a memory system, the agent either tries to do too much in one session or repeats work already completed by a previous session.

The industry has converged on a simple, file-based solution.

The AGENTS.md Standard

OpenAI's Codex CLI introduced AGENTS.md — a Markdown file at the project root that the harness automatically injects into every session's system prompt. Adopted by Google Jules, Cursor, and managed by the Linux Foundation as an open standard, it solves the "stateless model + stateful project" mismatch:

# AGENTS.md — Project Instructions for AI Agents

## Architecture Overview
- FastAPI application, PostgreSQL backend, Redis cache
- All database access goes through `src/db/repository.py`
- Never write raw SQL outside repository methods

## Coding Standards
- Run `make lint` (ruff) before any commit — zero tolerance for lint errors
- All new API endpoints require an integration test in `tests/integration/`
- Type hints are mandatory on all public functions and methods

## Prohibited Actions
- Never modify `migrations/` directly — use `alembic revision --autogenerate`
- Never commit `.env` files, secrets, or credentials under any circumstances
- Do not push directly to `main` — all changes go through PRs

## Current Work Context
- Refactoring auth system per `docs/auth-refactor-plan.md`
- Active branch: `feature/auth-v2`

## Definition of Done
- All tests green: `make test`
- Zero lint errors: `make lint`
- PR description updated with a clear summary of changes made

Progress File Handoffs

For multi-session tasks, the harness creates a structured progress file before any work begins, and updates it after every meaningful step:

# session_handoff.py
from __future__ import annotations
import json
from datetime import datetime, timezone
from pathlib import Path
from typing import Optional

HANDOFF_FILE = Path(".agent_progress.json")


def initialise_session(task: str) -> dict:
    """
    Call this at the very start of a new task.
    Creates a progress file that subsequent sessions can resume from.
    """
    progress = {
        "task": task,
        "started_at": datetime.now(timezone.utc).isoformat(),
        "status": "in_progress",
        "completed_steps": [],
        "artifacts_created": [],
        "notes": "",
        "last_updated": datetime.now(timezone.utc).isoformat(),
    }
    HANDOFF_FILE.write_text(json.dumps(progress, indent=2))
    return progress


def resume_session() -> Optional[dict]:
    """
    Loads in-progress task state for a resuming session.
    Returns None if no task is in flight.
    """
    if not HANDOFF_FILE.exists():
        return None
    progress = json.loads(HANDOFF_FILE.read_text())
    return progress if progress.get("status") == "in_progress" else None


def update_progress(step: str, artifact: Optional[str] = None) -> None:
    """Call after every meaningful step to persist progress."""
    progress = json.loads(HANDOFF_FILE.read_text())
    progress["completed_steps"].append({
        "step": step,
        "timestamp": datetime.now(timezone.utc).isoformat(),
    })
    if artifact:
        progress["artifacts_created"].append(artifact)
    progress["last_updated"] = datetime.now(timezone.utc).isoformat()
    HANDOFF_FILE.write_text(json.dumps(progress, indent=2))


def build_memory_block(progress: dict) -> str:
    """
    Renders progress state as a system prompt block
    injected at the start of a resuming session.
    """
    completed = "\n".join(
        f"  ✓ {s['step']}" for s in progress["completed_steps"]
    )
    artifacts = ", ".join(progress["artifacts_created"]) or "none"
    return (
        f"## Session Memory (resumed from previous work)\n"
        f"Task: {progress['task']}\n\n"
        f"Completed steps:\n{completed}\n\n"
        f"Artifacts created: {artifacts}\n\n"
        f"**IMPORTANT: Do not redo completed steps. "
        f"Continue from where the previous session left off.**"
    )

8. Tool Permissions and Security Enforcement {#tools-security}

This is where production deployments fail most dangerously. The model decides what to do; the harness decides what it is allowed to do. These are entirely separate concerns and must be enforced separately.

The failure mode is not malicious intent — it's optimisation. Given broad filesystem access, a model tasked with "cleaning up the project" will sometimes delete things it shouldn't. The model isn't broken; it's doing its job. Without a permission enforcer, nothing stops it.

# permission_enforcer.py
from __future__ import annotations
import json
from dataclasses import dataclass, field
from pathlib import Path
from typing import Optional

@dataclass
class PermissionConfig:
    allowed_read_dirs: list[str] = field(default_factory=list)
    allowed_write_dirs: list[str] = field(default_factory=list)
    network_allowed: bool = False
    require_approval_for: list[str] = field(default_factory=list)


@dataclass
class PermissionResult:
    allowed: bool
    reason: str = ""


class ToolPermissionEnforcer:
    """
    Hard enforcement layer — the model cannot override this.
    Every tool call passes through here before execution.
    """

    def __init__(self, config: PermissionConfig):
        self._read_dirs = [Path(d).resolve() for d in config.allowed_read_dirs]
        self._write_dirs = [Path(d).resolve() for d in config.allowed_write_dirs]
        self.network_allowed = config.network_allowed
        self.approval_required = set(config.require_approval_for)

    def check(self, tool_name: str, tool_args: dict) -> PermissionResult:
        # High-risk tools require explicit human approval
        if tool_name in self.approval_required:
            if not self._request_human_approval(tool_name, tool_args):
                return PermissionResult(allowed=False, reason="Human denied approval")

        # Filesystem write check
        if tool_name in {"write_file", "edit_file", "delete_file", "run_shell"}:
            path = Path(tool_args.get("path", ".")).resolve()
            if not any(path.is_relative_to(d) for d in self._write_dirs):
                return PermissionResult(
                    allowed=False,
                    reason=f"Write denied: '{path}' is outside allowed write directories.",
                )

        # Filesystem read check
        if tool_name in {"read_file", "list_directory"}:
            path = Path(tool_args.get("path", ".")).resolve()
            allowed = self._read_dirs + self._write_dirs
            if not any(path.is_relative_to(d) for d in allowed):
                return PermissionResult(
                    allowed=False,
                    reason=f"Read denied: '{path}' is outside allowed directories.",
                )

        # Network check
        if tool_name in {"http_request", "web_search", "fetch_url"}:
            if not self.network_allowed:
                return PermissionResult(
                    allowed=False,
                    reason="Network access is disabled in this sandbox.",
                )

        return PermissionResult(allowed=True)

    def _request_human_approval(self, tool_name: str, tool_args: dict) -> bool:
        """Pause agent execution and request human approval."""
        print(f"\n⚠️  Agent requests permission to run: {tool_name}")
        print(f"   Args: {json.dumps(tool_args, indent=2)}")
        return input("Approve? [y/N]: ").strip().lower() == "y"

Belt-and-Suspenders: Docker Sandbox

For production deployments, process-level checks aren't enough. The belt-and-suspenders approach wraps tool executions in Docker:

# docker_sandbox.py
from __future__ import annotations
import docker

class DockerSandbox:
    """
    Executes agent-generated code in an isolated container.
    The host filesystem is never touched directly.
    """

    def __init__(
        self,
        image: str = "python:3.12-slim",
        workspace_path: str = "/tmp/agent_workspace",
    ):
        self.client = docker.from_env()
        self.image = image
        self.workspace_path = workspace_path

    def execute_code(self, code: str, timeout: int = 30) -> str:
        """
        Runs Python code in an isolated container with:
          • No network access (network_mode="none")
          • Read-only root filesystem
          • Writable /tmp only (tmpfs, 100 MB cap)
          • 512 MB memory limit
          • 50% single-CPU quota
          • Hard timeout
        """
        try:
            output = self.client.containers.run(
                self.image,
                command=["python", "-c", code],
                volumes={
                    self.workspace_path: {"bind": "/workspace", "mode": "rw"}
                },
                network_mode="none",
                read_only=True,
                tmpfs={"/tmp": "size=100m"},
                mem_limit="512m",
                cpu_quota=50_000,   # 50% of one core
                timeout=timeout,
                remove=True,
                stdout=True,
                stderr=True,
            )
            return output.decode("utf-8")
        except docker.errors.ContainerError as e:
            return f"ExecutionError: {e.stderr.decode('utf-8')}"
        except docker.errors.APIError as e:
            return f"DockerAPIError: {e}"

9. Multi-Agent Orchestration with SlotWorker Patterns {#multi-agent}

Single-agent systems have a ceiling. The most capable agentic architectures divide labour across specialist agents that verify each other's work. The engineering challenge: on local hardware, you don't have multiple GPUs. Multiple agents must share one inference slot without starvation.

Forge's SlotWorker solves this with priority-queued shared inference:

# slot_worker.py
from __future__ import annotations
import asyncio
from dataclasses import dataclass, field
from typing import Optional, Any

@dataclass(order=True)
class AgentJob:
    priority: int                          # Lower = higher priority (Unix convention)
    agent_id: str = field(compare=False)
    workflow: Any = field(compare=False)   # Type: Workflow
    user_input: str = field(compare=False)
    future: asyncio.Future = field(compare=False)


class SlotWorker:
    """
    Shared inference slot for multi-agent architectures on a single GPU.
    Implements priority queuing — high-priority jobs preempt queued lower-priority ones.

    Usage: instantiate ONE SlotWorker per GPU, inject into all WorkflowRunners.
    """

    def __init__(self, client: Any, max_queue_size: int = 50):
        self.client = client
        self._queue: asyncio.PriorityQueue[AgentJob] = asyncio.PriorityQueue(
            max_queue_size
        )
        self._worker_task: Optional[asyncio.Task] = None

    async def start(self) -> None:
        """Start the background processing loop."""
        self._worker_task = asyncio.create_task(self._process_queue())

    async def submit(
        self,
        agent_id: str,
        workflow: Any,
        user_input: str,
        priority: int = 5,
    ) -> str:
        """
        Submit a job. Blocks until the job completes and returns the result.
        Priority 1 = highest urgency, 10 = background/low priority.
        """
        loop = asyncio.get_running_loop()
        future: asyncio.Future[str] = loop.create_future()
        await self._queue.put(
            AgentJob(
                priority=priority,
                agent_id=agent_id,
                workflow=workflow,
                user_input=user_input,
                future=future,
            )
        )
        return await future

    async def _process_queue(self) -> None:
        """
        Worker loop: dequeues jobs in priority order, runs them sequentially
        on the shared inference slot, resolves futures with results.
        """
        while True:
            job = await self._queue.get()
            try:
                from forge import WorkflowRunner
                runner = WorkflowRunner(client=self.client)
                result = await runner.run(job.workflow, job.user_input)
                job.future.set_result(result)
            except Exception as exc:
                job.future.set_exception(exc)
            finally:
                self._queue.task_done()


# ── Example: Parallel specialist code review ──────────────────────────────────
async def multi_agent_code_review(pr_diff: str) -> dict:
    """
    Three specialist agents analyse a PR diff in parallel,
    then a Critic agent synthesises their findings.
    All four share a single GPU slot via SlotWorker.
    """
    from forge import OllamaClient
    slot = SlotWorker(client=OllamaClient(model="ministral-3:8b-instruct"))
    await slot.start()

    # Specialists run concurrently (they queue on the shared slot internally)
    planner_out, security_out, perf_out = await asyncio.gather(
        slot.submit("planner",  planner_workflow,  pr_diff, priority=2),
        slot.submit("security", security_workflow, pr_diff, priority=2),
        slot.submit("perf",     perf_workflow,     pr_diff, priority=2),
    )

    # Critic synthesises — high priority, runs after specialists complete
    synthesis = await slot.submit(
        "critic",
        critic_workflow,
        f"Planner findings:\n{planner_out}\n\n"
        f"Security findings:\n{security_out}\n\n"
        f"Performance findings:\n{perf_out}",
        priority=1,
    )

    return {
        "synthesis": synthesis,
        "specialist_outputs": {
            "planner": planner_out,
            "security": security_out,
            "performance": perf_out,
        },
    }

The SlotWorker pattern mirrors how effective engineering teams operate: specialists work in parallel on their domains, a synthesiser integrates their output. The harness provides the coordination layer the model cannot provide for itself.

10. Benchmarking Your Harness: Building a 26-Scenario Eval Suite {#benchmarking}

The 53% → 99% story is compelling only because there is a rigorous benchmark behind it. You cannot improve what you don't measure, and you cannot trust improvements you haven't validated. Here's how to build your own eval suite.

Anatomy of a Good Agentic Benchmark Scenario

Each scenario should isolate and test one capability or one failure mode:

# benchmark.py
from __future__ import annotations
import asyncio
from dataclasses import dataclass
from typing import Callable, Optional, Any


@dataclass
class BenchmarkScenario:
    name: str
    tier: str                              # "easy" | "medium" | "hard"
    user_input: str
    required_tools_in_order: list[str]     # Must be called in this sequence
    success_criteria: Callable[[str], bool]  # Validates the final output
    max_steps: int = 10
    description: str = ""


# Example scenarios covering the most common failure modes
BENCHMARK_SCENARIOS: list[BenchmarkScenario] = [
    # Tier 1 — Basic single-tool call
    BenchmarkScenario(
        name="single_tool_weather",
        tier="easy",
        user_input="What's the weather in Tokyo right now?",
        required_tools_in_order=["get_weather"],
        success_criteria=lambda r: "tokyo" in r.lower() and any(
            c.isdigit() for c in r
        ),
        description="Verifies basic tool selection and execution.",
    ),
    # Tier 2 — Multi-step with data dependency
    BenchmarkScenario(
        name="search_then_summarise",
        tier="medium",
        user_input="Find recent papers on transformer attention and summarise the key findings.",
        required_tools_in_order=["web_search", "fetch_url", "summarise"],
        success_criteria=lambda r: len(r) > 200 and "attention" in r.lower(),
        description="Verifies step ordering and data-flow between tools.",
    ),
    # Tier 3 — Error recovery under tool failure
    BenchmarkScenario(
        name="recover_from_api_error",
        tier="hard",
        user_input="Fetch and process the latest sales data from the internal API.",
        required_tools_in_order=["http_request", "process_data"],
        # Scenario fixture: first http_request returns HTTP 500
        success_criteria=lambda r: (
            "retry" in r.lower() or "error" in r.lower()
        ),
        description="Verifies graceful handling of tool-level failures.",
    ),
]


class HarnessBenchmark:
    def __init__(self, harness: Any, scenarios: list[BenchmarkScenario]):
        self.harness = harness
        self.scenarios = scenarios
        self.results: list[dict] = []

    async def run(self) -> dict:
        """Execute all scenarios and return a structured report."""
        for scenario in self.scenarios:
            result = await self._run_one(scenario)
            self.results.append(result)

        passed = sum(1 for r in self.results if r["passed"])
        return {
            "total": len(self.scenarios),
            "passed": passed,
            "success_rate": f"{passed / len(self.scenarios):.1%}",
            "by_tier": self._aggregate_by_tier(),
            "details": self.results,
        }

    async def _run_one(self, scenario: BenchmarkScenario) -> dict:
        start = asyncio.get_event_loop().time()
        try:
            output = await self.harness.run(
                workflow=self._workflow_from_scenario(scenario),
                user_input=scenario.user_input,
            )
            return {
                "name": scenario.name,
                "tier": scenario.tier,
                "passed": scenario.success_criteria(output),
                "elapsed_s": round(asyncio.get_event_loop().time() - start, 2),
                "output_preview": output[:200],
            }
        except Exception as exc:
            return {
                "name": scenario.name,
                "tier": scenario.tier,
                "passed": False,
                "error": str(exc),
            }

    def _aggregate_by_tier(self) -> dict[str, str]:
        tiers: dict[str, list[bool]] = {}
        for r in self.results:
            tiers.setdefault(r["tier"], []).append(r["passed"])
        return {
            tier: f"{sum(results)}/{len(results)}"
            for tier, results in tiers.items()
        }

Run this benchmark before and after every harness change. Even 10 well-chosen scenarios is infinitely more valuable than shipping on vibes.

11. Conclusion: The Harness Is the Product {#conclusion}

Let's return to where we started: an 8-billion-parameter model, 53% task success, then 99% after adding a harness. Same model. Same hardware. Different engineering.

The lesson isn't "guardrails are a nice-to-have." The lesson is that LLM agent harness engineering is the core product engineering discipline of 2026.

The model is a commodity. GPT-5.x, Claude Opus 4.x, Qwen3.7-Max — they are all extraordinarily capable foundations, and they get cheaper every quarter. DeepSeek is hiring an entire team to build a harness because they understand that the model alone doesn't ship value; the harness does. OpenAI formalised AGENTS.md. Anthropic published their harness engineering playbook. The industry has spoken.

Five things to do this week:

Install Forge — pip install forge-guardrails. Run its 26-scenario eval suite against your current stack. Confront your real baseline number.
Add AGENTS.md to every project — Five minutes of setup. Every future agent session gets the project context for free.
Instrument your context usage — Log token counts at every loop iteration. Know your ceiling before you hit it.
Build 10 benchmark scenarios — Cover the three tiers: basic tool call, multi-step dependency, error recovery. Run them on every harness change.
Separate the permission layer from the prompt layer — Prompts are suggestions. The harness permission enforcer is law. Never conflate them.

The engineers who master LLM agent harness engineering in the next 12 months will have a fundamental structural advantage over those who keep treating the model as the product.

The model is the engine. The harness is the car. Nobody buys an engine.

Resources & Further Reading

Forge Framework (GitHub) — Open-source harness with guardrails, ContextManager, SlotWorker, and eval suite
AGENTS.md Specification — OpenAI's open standard for agent project instructions (Linux Foundation)
Anthropic: Building Effective Agents — Engineering principles from the Claude team
Anthropic: Effective Harnesses for Long-Running Agents — Session continuity and handoff artifact patterns
The Decoder: State of AI Agents in 2026 — Comprehensive industry overview of the agentic shift
Qwen3.7-Max: The Agent Frontier — Alibaba's agent-native model release notes

Found this useful? Follow me for weekly deep dives into production AI engineering. Have a harness pattern that's worked well in your stack? Drop it in the comments — I read every one.

DEV Community

Harness Engineering: How to Build Production-Ready LLM Agents That Actually Work

Harness Engineering: How to Build Production-Ready LLM Agents That Actually Work

Table of Contents

1. The Benchmark That Should Embarrass Everyone {#the-benchmark}

2. The Paradigm Shift: Everything Is Agentic Now {#paradigm-shift}

3. What Is a Harness? The Model Is Not the Agent {#what-is-harness}

4. Anatomy of the Agent Loop {#agent-loop}

The Prompt Stack

5. The Context Window: Your Scarcest Resource {#context-window}

Tiered Compaction: Keep Recent, Summarise Old

VRAM-Aware Budgets for Local Inference

6. Guardrails: The Reliability Stack That Changes Everything {#guardrails}

Component 1: Rescue Parsing

Component 2: Retry Nudges

Component 3: Required Step Enforcement

7. Memory Across Sessions: CLAUDE.md, AGENTS.md, and Handoff Artifacts {#memory}

The AGENTS.md Standard

Progress File Handoffs

8. Tool Permissions and Security Enforcement {#tools-security}

Belt-and-Suspenders: Docker Sandbox

9. Multi-Agent Orchestration with SlotWorker Patterns {#multi-agent}

10. Benchmarking Your Harness: Building a 26-Scenario Eval Suite {#benchmarking}

Anatomy of a Good Agentic Benchmark Scenario

11. Conclusion: The Harness Is the Product {#conclusion}

Resources & Further Reading

Top comments (0)