vishalmysore

Posted on Jun 2

Why Your AI Agents Fail in Production: What Harness Engineering Is NOT

#agents #ai #llm #softwareengineering

A technical introduction, grounded in code

If you've been building AI agents, you've probably felt the gap between "the model works in a notebook" and "the model works reliably in production." Harness engineering is the discipline that closes that gap. But it's widely misunderstood — often confused with things it is not.

It is NOT the model

The most common mistake is treating the LLM as the unit of engineering. Swap GPT-4 for Claude, tune a prompt, and call it done.

In this demo, the same orchestrator loop drives four domains — healthcare, insurance, career counselling, and drug discovery — using either a local Llama 3.2, Phi-3.5, Qwen 2.5, or a mock simulation with no model at all. The core agentic loop in src/execution/orchestrator.js is model-agnostic. Whether a <tool_call> response comes from a 3B quantized model running on WebGPU or a deterministic mock, the harness processes it identically.

The model is a component. The harness is the system.

It is NOT prompt engineering

Prompt engineering is about what you say to the model. Harness engineering is about what you do with the model's outputs — and what you do before the model ever sees a query.

In src/information/memoryManager.js, past clinician corrections stored in localStorage are retrieved via keyword matching and injected into the system prompt before each run. The model doesn't know this is happening. It just receives a richer context. The retrieval, filtering, and injection logic is harness work — not prompt work.

Prompt engineering operates on one turn. Harness engineering operates on the full trajectory.

It is NOT a pipeline

A pipeline is a linear sequence: input → model → output. That's not what an agent harness is.

The execution loop in orchestrator.js runs up to 10 iterations. On each iteration it calls the LLM, extracts tool calls from the response, executes the tool, runs a guardrail check, and either appends the result and continues or forces a revision and loops. The path through that loop is not predetermined — it depends on what the model calls, what the tool returns, and whether the guardrail passes.

The harness is a control structure, not a pipeline. It has branches, retries, and termination conditions.

It is NOT optional validation bolted on at the end

Every domain in this project enforces guardrails at three distinct points: before tool execution (validateToolCall), after tool execution (validateToolOutput), and before the final plan is returned (validateFinalPlan).

In the drug discovery domain (src/domains/drugDiscovery.js), if a compound's hepatotoxicity score is ≥0.7, the guardrail sets safe: false and the orchestrator appends a correction message and re-enters the loop — the IND filing is blocked before it ever reaches the user. The guardrail doesn't annotate a bad answer; it prevents the bad answer from being produced.

In the insurance domain, a fraud risk score ≥0.7 triggers an SIU referral and blocks settlement — not as a UI warning, but as an execution-layer intervention that forces plan revision.

Guardrails are not postprocessing. They are load-bearing logic inside the execution loop.

It is NOT framework-agnostic glue code

The harness in this project has explicit architectural layers with defined responsibilities:

Information layer (src/information/): memory retrieval, tool schemas, context assembly
Execution layer (src/execution/): agentic loop, tool dispatch, guardrail enforcement
Feedback layer (src/feedback/): schema verification, event tracing, HITL correction capture

These aren't just directories — each layer has a specific job that the others do not do. The orchestrator never touches localStorage. The memory manager never calls a tool. The tracer never modifies execution state. This separation is what makes the harness maintainable and independently testable.

The tool schemas in src/information/tools.js are exported in both OpenAI and Anthropic formats — the harness doesn't assume a provider. The contract between orchestrator and model is an explicit JSON schema, not implicit string matching.

It is NOT a static configuration

The harness in this project learns at runtime. When a clinician rejects a plan and types a correction — "Patient X is allergic to penicillin" — that correction is structured and written to localStorage via saveCorrection(). On the next run, retrieveRelevantMemories() splits the query into tokens, matches against stored correction text and tags, and injects the relevant ones into the system prompt.

No redeployment. No fine-tuning. No model update. The harness changed behavior based on human feedback within the same session.

This is distinct from prompt engineering (which is static) and fine-tuning (which requires a training run). It is runtime adaptation through structured memory — a harness-level capability.

It is NOT the same thing across domains

The orchestrator loop is domain-agnostic, but the domain modules (src/domains/) are not interchangeable black boxes — they each define their own tools, guardrail thresholds, mock patients or compounds, and scenario sets.

Healthcare enforces weight-based dosage caps and allergy checks. Drug discovery blocks IND filings on positive Ames tests (genotoxicity: POSITIVE → safe: false, blockIND: true). Career counselling flags recommendations that guarantee salary figures or advice given to applicants over 50 without age-neutral framing.

The harness provides the execution container. Domain logic provides the constraints. Neither replaces the other — and the quality of the overall system depends on both being correct.

What it actually is

Harness engineering is the practice of building the execution container that surrounds an LLM: the control flow that drives multi-step agent behavior, the guardrails that enforce domain constraints mid-execution, the memory system that persists and retrieves human corrections, the schema validation that verifies structured outputs, and the tracing infrastructure that makes all of it observable.

It is the engineering layer between "a model that can answer questions" and "a system that reliably makes correct decisions in a specific domain."

The model is one component. The harness is the product.

Live Demo https://vishalmysore.github.io/harnessEngineeringDemo/
Code - https://github.com/vishalmysore/harnessEngineeringDemo

Explore the full implementation at src/execution/orchestrator.js, src/domains/, and src/feedback/ — the three layers are readable in under 600 lines of code.

Top comments (1)

Gilder Miller • Jun 12

This is a really clear explanation. The distinction between prompt work and what actually happens in the harness makes a lot of sense. It’s interesting to see how guardrails act as part of the execution instead of just warnings after the fact. The layered approach feels much more manageable for real production use.
Have you noticed the harness getting more complicated when agents start handling multiple tools at once?