DEV Community

kination
kination

Posted on

Part 1 - Case for Harness Engineering

Probably you've check the news, that one SaaS startup lost three months of customer data in less than a minute.

Their AI coding agent ran inside "Cursor" editor, powered by "Claude Opus" (actually this is not a matter of tool or model). Depend on report, it was done while resolving a minor staging credential mismatch. The agent scanned codebase, located an overly permissive cloud infrastructure API token, and executed API call volumeDelete to production cluster.

Because the cloud provider store backups on the same volume as primary data, the single mutation destroyed the database and all backups.

The team spent thirty hours recovering critical customer records. They cross-referenced payment logs, email invitations, and calendar entries.

Logs later revealed a detailed "confession" from the agent. It listed the safety rules it knew it was violating, including instructions explicitly forbidding database deletion. Yet, it ran the mutation.

Blaming model intelligence is missing the point. In this case, root cause lies in system architecture. Developers grant autonomous agents raw API keys and shell access, then only relies on 'security prompts'(such as 'done't delete table') to enforce security boundaries.

Prompts are soft. They fail under pressure. So AI agents need more strict, physical boundary. This is execution harness.


Why Models forgetting Negative Constraints

We expect models to follow negative constraints, such as "never delete the database" or "do not modify production credentials". This expectation ignores how large language models process information.

Two structural factors explain why agents fail these constraints:

1. Instruction Drift in Probabilistic Loops

Language models process inputs via attention. As an agent run progresses, the context window fills with terminal logs, error messages, and code snippets. The negative constraint, originally placed in system prompt, but while model operates on next-token probability, it focuses on resolving the immediate compilation error or credential mismatch.

If a destructive command appears to clear the blocker, the model executes it. It does not think about the physical consequences, and focusing on resolving 'pressure' to resolve the current error string.

2. Missing Infrastructure Maps

Models do not know about semantic map of physical topology. They does not have understanding of staging and production, and are physically isolated environments separated by database credentials.

It treats a database volume as a variable in a script, and deleting and recreating a volume is just logical troubleshooting step to the model, equivalent to a local server restart.


Defining the Harness

In software testing, a harness is the scaffolding that runs code under controlled conditions. It feeds inputs, captures outputs, and isolates the system under test.

┌────────────────────────────────────────────────────────┐
│                    Agent Harness                       │
│                                                        │
│   ┌─────────────┐    Harness View    ┌─────────────┐   │
│   │             ├───────────────────>│             │   │
│   │  Agent LLM  │                    │ Environment │   │
│   │             │<───────────────────┤ (Files/DB)  │   │
│   └─────────────┘    Targeted Edit   └─────────────┘   │
│                                                        │
└────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

For AI agents, a harness is boundary layer wrapping the model. It separates reasoning from execution. Instead of giving a model direct access to files, databases, or API endpoints, the harness presents a restricted interface.

The harness acts as a security gate, and a validator. When the agent attempts an action, the request does not go directly to the operating system or the network. Instead, the harness intercepts it, validating/checking the input data, confirm the constraints, and decides whether the action is safe and aligned with the current workflow.

By placing a strict execution layer between the model and the environment, the harness guarantees that:

  • Context Control: The agent only sees curated, safe inputs, preventing prompt injection or memory leakage.
  • Action Isolation: The environment is protected from raw commands; the harness converts model intent into safe, deterministic API calls.
  • Drift Detection: The target system state is verified, preventing stale code edits or outdated assumptions.
  • State Auditing: All changes are logged sequentially, enabling automatic rollbacks if the agent runs into a dead end.

Algorithmic Correctness via Self-Correction Harnesses

While negative constraints prevent raw destructive actions, we need a way to verify that the agent actually does its job correctly. To see the value of a harness in its simplest, most practical form, we look at how a harness enforces algorithmic correctness. When asked to write an algorithm, a model often outputs code with syntax errors or logical bugs.

If we run the agent in a raw loop, it writes the buggy file and exits, leaving the user with broken code.

An algorithmic harness introduces an execution and evaluation loop. The harness does not simply write the generated function. It runs the code in an isolated environment, executes a battery of test assertions, and captures the tracebacks.

If a test fails, the harness does not quit. It grabs the exact error traceback, feeds it back to the model, and instructs the agent to self-correct. The loop repeats until the code passes all tests. The harness converts a probabilistic generator into a deterministic, verified output.


Code Example: The Self-Correction Loop

To demonstrate this concept, we built a test environment that generates a Python algorithm. The task is simple: write a function parse_and_sum(text) that finds all integers in a string (including negative numbers) and returns their sum.

The verification tests are defined as:

assert parse_and_sum("1 2 3") == 6
assert parse_and_sum("The temperature is -5 degrees") == -5
assert parse_and_sum("No numbers here") == 0
Enter fullscreen mode Exit fullscreen mode

The Without-Harness Attempt

Without a harness, the agent generates the code once. In its first draft, the model commonly uses a naive regular expression that ignores negative signs:

import re

def parse_and_sum(text):
    numbers = re.findall(r'\d+', text)
    return sum(int(num) for num in numbers)
Enter fullscreen mode Exit fullscreen mode

The function executes, but fails the second assertion because it parses -5 as 5, returning 5. Without a harness to catch this, the broken file is written, and the application fails in production.

The With-Harness Attempt

With our harness, the initial code is run inside a test runner. The assertion fails, throwing an AssertionError.

The harness intercepts the traceback and sends it back to the model as feedback:

Your previous Python function failed verification tests. 
The execution harness returned the following error:
AssertionError: Test case verification failed.
Failing Test: assert parse_and_sum("The temperature is -5 degrees") == -5
Enter fullscreen mode Exit fullscreen mode

The model reads the traceback, realizes its regular expression failed to capture the negative sign, and outputs the corrected code:

import re

def parse_and_sum(text):
    numbers = re.findall(r'-?\d+', text)
    return sum(int(num) for num in numbers)
Enter fullscreen mode Exit fullscreen mode

The harness executes the new code, verifies that it passes all assertions, and saves the verified script. The correction loop succeeds.


Focus moves from 'Prompts' to 'Harnesses'

Prompt engineering is soft. It attempts to control agent behavior inside the context window. This approach treats safety as an alignment problem and relies on the model to behave.

Harness engineering is hard. It controls agent behavior at the boundary. This approach treats safety as an architectural problem and makes destructive actions physically impossible.

Now let's stop relying on instructions to secure AI agents. As agents gain autonomy, the safety of our infrastructure depends on 'constraints' we build outside the model, not the prompts we write inside.


References

Top comments (0)