DEV Community

Richard Dillon
Richard Dillon

Posted on

Primitive Shifts: The Harness-as-Primitive Shift

Primitive Shifts: The Harness-as-Primitive Shift

External Verification Loops Are Becoming Non-Negotiable Infrastructure

Every few months, the baseline of how AI systems work quietly moves. Engineers who noticed early weren't smarter — they were just paying attention to the right signals. The shift from "AI generates, humans review" to "AI generates within executable constraints" is one of those moves. If your mental model still treats verification as something that happens after AI output, you're already behind.

What Is It?

A harness is an external verification layer that wraps LLM execution — not prompt engineering, not fine-tuning, but deterministic constraints enforced outside the model's reasoning loop. The pattern is deceptively simple: the LLM generates output, the harness validates against executable specifications (tests, type checks, physics constraints, domain invariants), a feedback signal loops back, and the LLM iterates until convergence or rejection.

This inverts the 2023-2024 paradigm where validation happened after AI output reached humans. Now verification is a runtime primitive that gates AI execution before it ever surfaces.

The research driving adoption is unambiguous: LLMs cannot reliably self-correct intrinsic reasoning failures without external grounding. The Convergent AI Agent Framework (CAAF) makes this explicit — the "verification gap" is structural, not a capability limitation to be trained away. When an LLM hallucinates incorrect code, no amount of "think step by step" prompting fixes it; only external execution feedback does.

The IACDM methodology formalizes this as "Interactive Adversarial Convergence" — treating the harness as an adversarial validator that pressure-tests outputs. Production systems like Claude Code's built-in safety checkpoints, detailed in recent architectural analyses, implement variants of this pattern with execution-based verification loops.

Here's the mental model shift: the harness isn't scaffolding you remove later. It's the actual product, with the LLM as a component inside it.

Why It's Flying Under the Radar

Engineers see "add tests" and think they already do this — but harness-as-primitive means tests run during generation, not after merge. Your CI pipeline catches bugs post-commit; a harness catches them mid-generation, before the code ever exists in your repository. This distinction sounds subtle but changes everything about how AI integrates into development workflows.

The pattern looks like "just good engineering" rather than a new AI primitive, so it doesn't get labeled or discussed as such. Framework marketing emphasizes agent autonomy and capability benchmarks; verification infrastructure is unglamorous plumbing that doesn't demo well.

Early adopters discovered it through failure. A 2025 study by METR showed experienced developers using frontier models were measurably slower despite believing they were faster — the verification gap made them confident and wrong. They trusted model output, shipped bugs, and spent debugging time that exceeded any generation speedup.

Multi-agent architectures get attention at conferences; single-agent-with-harness quietly outperforms in production. Both OpenAI Codex and Claude Code run single ReAct loops with heavy external verification, not the multi-agent swarms that dominate research papers.

The shift is happening inside build systems and CI pipelines, not in prompts or model configs. If you're not touching infrastructure, you're not seeing it happen.

Hands-On: Try It Today

Here's a production-ready harness implementation that wraps any code generation task with pytest verification:

# harness.py - Minimal verification harness for LLM code generation
# Requires: anthropic>=0.34.0, pytest>=8.0.0
# pip install anthropic pytest

import subprocess
import tempfile
import os
from pathlib import Path
from anthropic import Anthropic

# Configuration
MAX_ITERATIONS = 5
MODEL = "claude-sonnet-4-20250514"

def run_tests(code: str, test_code: str, work_dir: Path) -> tuple[bool, str]:
    """Execute pytest against generated code, return (passed, output)."""
    # Write the implementation file
    impl_path = work_dir / "implementation.py"
    impl_path.write_text(code)

    # Write the test file
    test_path = work_dir / "test_implementation.py"
    test_path.write_text(test_code)

    # Run pytest with captured output
    result = subprocess.run(
        ["python", "-m", "pytest", str(test_path), "-v", "--tb=short"],
        capture_output=True,
        text=True,
        cwd=work_dir,
        timeout=30  # Hard timeout prevents infinite loops
    )

    passed = result.returncode == 0
    output = result.stdout + result.stderr
    return passed, output

def generate_with_harness(
    client: Anthropic,
    task_description: str,
    test_code: str,
    initial_code: str = ""
) -> tuple[str, int]:
    """
    Generate code that passes tests, iterating until success or budget exhaustion.
    Returns (final_code, iterations_used).
    """

    current_code = initial_code
    iteration = 0

    with tempfile.TemporaryDirectory() as temp_dir:
        work_dir = Path(temp_dir)

        while iteration < MAX_ITERATIONS:
            iteration += 1

            # First iteration: generate from scratch
            # Subsequent iterations: fix based on test failures
            if current_code == "":
                prompt = f"""Write Python code to solve this task:

{task_description}

The code will be tested against these tests:
Enter fullscreen mode Exit fullscreen mode


python
{test_code}


Output ONLY the implementation code, no markdown fencing."""
            else:
                prompt = f"""The following code failed tests:

Enter fullscreen mode Exit fullscreen mode


python
{current_code}


Test output:
{last_test_output}

Fix the code to pass all tests. Output ONLY the fixed implementation code, no markdown fencing."""

            # Generate candidate solution
            response = client.messages.create(
                model=MODEL,
                max_tokens=2048,
                messages=[{"role": "user", "content": prompt}]
            )

            current_code = response.content[0].text.strip()

            # Strip markdown code fences if model included them anyway
            if current_code.startswith("```

"):
                lines = current_code.split("\n")
                current_code = "\n".join(lines[1:-1])

            # Run harness validation
            passed, last_test_output = run_tests(current_code, test_code, work_dir)

            if passed:
                print(f"✓ Tests passed on iteration {iteration}")
                return current_code, iteration
            else:
                print(f"✗ Iteration {iteration} failed, retrying...")

    # Budget exhausted
    raise RuntimeError(f"Failed to generate passing code after {MAX_ITERATIONS} iterations")

# Example usage
if __name__ == "__main__":
    client = Anthropic()

    # The harness spec (your tests) IS the requirement
    test_code = """
from implementation import merge_sorted_lists

def test_basic_merge():
    assert merge_sorted_lists([1, 3, 5], [2, 4, 6]) == [1, 2, 3, 4, 5, 6]

def test_empty_lists():
    assert merge_sorted_lists([], [1, 2, 3]) == [1, 2, 3]
    assert merge_sorted_lists([1, 2, 3], []) == [1, 2, 3]

def test_duplicates():
    assert merge_sorted_lists([1, 2, 2], [2, 3]) == [1, 2, 2, 2, 3]

def test_single_elements():
    assert merge_sorted_lists([1], [2]) == [1, 2]
"""

    task = "Implement merge_sorted_lists(list1, list2) that merges two sorted lists into one sorted list."

    code, iterations = generate_with_harness(client, task, test_code)
    print(f"\nGenerated in {iterations} iteration(s):\n{code}")


```typescript

For TypeScript projects, apply the same pattern with Zod schema validation as the harness:

Enter fullscreen mode Exit fullscreen mode


typescript
// harness.ts - Schema validation harness for structured generation
// Requires: zod@3.23.0, @anthropic-ai/sdk@0.30.0
// npm install zod @anthropic-ai/sdk

import Anthropic from "@anthropic-ai/sdk";
import { z } from "zod";

// Define your domain schema - this IS your harness
const OrderSchema = z.object({
orderId: z.string().uuid(),
customerId: z.string().min(1),
items: z.array(
z.object({
sku: z.string().regex(/^[A-Z]{3}-\d{4}$/),
quantity: z.number().int().positive(),
unitPrice: z.number().positive(),
})
).min(1),
// Domain invariant: total must equal sum of (quantity * unitPrice)
total: z.number().positive(),
}).refine(
(order) => {
const calculatedTotal = order.items.reduce(
(sum, item) => sum + item.quantity * item.unitPrice,
0
);
return Math.abs(order.total - calculatedTotal) < 0.01;
},
{ message: "Total must equal sum of item prices" }
);

type Order = z.infer;

const MAX_ITERATIONS = 3;

async function generateWithSchemaHarness(
client: Anthropic,
prompt: string
): Promise {
let lastError = "";

for (let i = 0; i < MAX_ITERATIONS; i++) {
const fullPrompt = lastError
? ${prompt}\n\nPrevious attempt failed validation: ${lastError}\n\nFix the JSON and try again.
: prompt;

const response = await client.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  messages: [{ role: "user", content: fullPrompt }],
});

const text = response.content[0].type === "text" 
  ? response.content[0].text 
  : "";

// Extract JSON from response (handle markdown fencing)
const jsonMatch = text.match(/
Enter fullscreen mode Exit fullscreen mode

/) || 
                      text.match(/(\{[\s\S]*\})/);

    if (!jsonMatch) {
      lastError = "No valid JSON found in response";
      continue;
    }

    try {
      const parsed = JSON.parse(jsonMatch[1]);
      // Harness validation - schema + domain invariants
      const validated = OrderSchema.parse(parsed);
      console.log(`✓ Validation passed on iteration ${i + 1}`);
      return validated;
    } catch (e) {
      if (e instanceof z.ZodError) {
        lastError = e.errors.map((err) => 
          `${err.path.join(".")}: ${err.message}`
        ).join("; ");
        console.log(`✗ Iteration ${i + 1}: ${lastError}`);
      } else {
        lastError = `JSON parse error: ${e}`;
      }
    }
  }

  throw new Error(`Failed after ${MAX_ITERATIONS} iterations: ${lastError}`);
}

// Usage
const client = new Anthropic();

generateWithSchemaHarness(
  client,
  `Generate a sample e-commerce order as JSON with:
   - A valid UUID for orderId
   - SKUs in format ABC-1234
   - At least 2 items
   - Correctly calculated total`
).then(console.log);


Enter fullscreen mode Exit fullscreen mode

The key insight from both implementations: your test suite or schema is the specification. BDD/TDD-first workflows write Gherkin specs or failing tests before prompting, treating them as the harness signal rather than human review.

What This Means for Your Stack

Test coverage becomes AI capability. Teams with comprehensive test suites get dramatically better AI output; teams without them hit a ceiling no prompt engineering crosses. This isn't a metaphor — the harness literally cannot validate what you haven't specified. Research on AI agent failures shows specification completeness directly correlates with generation success rates.

CI/CD pipelines become AI infrastructure. Your existing verification tooling (linters, type checkers, integration tests) is now part of your AI system's runtime, not just your human workflow. The Shift-Up framework explicitly positions software engineering guardrails as AI-native infrastructure.

"Vibe coding" produces technical debt faster. A large-scale empirical study found AI-generated code without harness validation accumulated 484,366 distinct issues across 302.6k commits — code smells at 89.3%. The speed advantage of AI generation becomes negative if you're generating bugs faster than you fix them.

Architecture decision: harness logic belongs in the orchestration layer. Separate "what the AI does" from "what constraints it operates under." Recent analysis of agentic systems argues this separation is essential for maintainability and auditability.

Human review shifts from gatekeeping to harness design. Engineers spend time writing better constraints, not reviewing more AI output. The Agent Skills specification assumes skills come with verification criteria — skills without validators are incomplete primitives.

The Infrastructure Signal

The convergent evolution tells the story. CAAF, IACDM, the Shift-Up framework, and Anthropic's internal practices all independently arrived at "external verification as first-class primitive." When multiple research groups solving different problems converge on the same pattern, it's usually load-bearing.

The tooling investment pattern is revealing. Letta's 74% LoCoMo score came from filesystem-based memory with validation, not sophisticated retrieval — simple harnesses beat complex memory architectures. Platform engineering integration follows: IDPs projected to reach 80% adoption are natural homes for harness infrastructure, with "golden paths" essentially functioning as pre-validated execution corridors.

Benchmark evolution provides another signal. Terminal-Bench, SWE-bench, and similar evaluations are harness-native — they measure agent performance inside verification loops, not raw generation quality. When the benchmarks assume harnesses, the production systems will too.

The quiet deprecation is already visible in the literature. Prompt-only approaches are being called "anti-patterns" in 2025-2026 publications; "unstructured vibe coding" is explicitly positioned as the thing harnesses fix.

Shift Rating

🟢 Adopt Now — Teams without harness infrastructure are already accumulating technical debt faster than they realize. The primitive is production-ready, framework-agnostic, and builds on existing testing/CI investments. The implementation cost is low (you likely have most of the pieces already), and the payoff compounds: better AI output today, less debt tomorrow, and infrastructure that scales as models improve.

Engineers who internalize "verification is runtime infrastructure, not post-hoc review" will feel the gap close. Those who don't will wonder why their AI tooling plateaued while others kept accelerating.


Sources

- Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework

This is part of **Primitive Shifts* — a monthly series tracking when new AI building blocks
move from novel experiments to infrastructure you'll be expected to know.*

Follow the Next MCP Watch series on Dev.to to catch every edition.

Spotted a shift happening in your stack? Drop it in the comments.

Top comments (0)