Truong Phung

Posted on Apr 15 • Edited on Apr 26

🏗️ 📐 Harness Engineering: The Emerging Discipline of Making AI Agents Reliable 🤖

#ai #llm #programming #agents

A comprehensive guide to the practice of shaping the environment around AI agents so they can work dependably — based on references from the Awesome Harness Engineering collection.

🏗️ What Is Harness Engineering?
⚡ Why It Matters Now
🧮 The Core Equation: Agent = Model + Harness
🧠 Foundations & Key Mental Models
💭 Context Engineering: The Working Memory Budget
🛡️ Constraints, Guardrails & Safe Autonomy
📋 Specs, Agent Files & Workflow Design
📊 Evals & Observability
🚀 Runtimes, Harnesses & Reference Implementations
🎯 Benchmarks: Measuring Harness Quality
🛠️ Practical Playbook: Engineering Your Own Harness
🔮 The Future of Harness Engineering
✅ Conclusion
📚 References & Further Reading

1. 🏗️ What Is Harness Engineering?

Harness engineering is the practice of designing, building, and iterating on the environment, tooling, constraints, and feedback loops that surround an AI agent — everything that isn't the model itself. The term gained widespread traction in early 2026, popularized by field reports from OpenAI, Anthropic, LangChain, Thoughtworks, and HumanLayer, all converging on the same insight: the reliability of an AI agent depends less on the model and more on the system wrapped around it.

As LangChain's Vivek Trivedy crystallized it:

Agent = Model + Harness. If you're not the model, you're the harness.

A harness includes:

System prompts — the instructions that shape the agent's persona and constraints
Tools, skills, and MCP servers — capabilities the agent can invoke
Bundled infrastructure — filesystem, sandboxes, browsers, observability stacks
Orchestration logic — sub-agent spawning, handoffs, model routing
Hooks and middleware — deterministic control flow for compaction, continuation, lint checks, and verification
Memory and state management — progress files, git history, structured knowledge bases

A raw model is not an agent. It becomes one only when a harness gives it state, tool execution, feedback loops, and enforceable constraints. Harness engineering is the discipline of making all of that work well.

2. ⚡ Why It Matters Now

🔍 The "Skill Issue" Realization

As HumanLayer argued, teams that blame weak agent results on model limitations are usually wrong. After hundreds of agent sessions across dozens of projects, the pattern is consistent:

It's not a model problem. It's a configuration problem.

Every time a team instinctively says "GPT-6 will fix it" or "we just need better instruction-following," the real fix is almost always in the harness — better context management, smarter tool selection, proper verification loops, or cleaner handoff artifacts.

🤖 The OpenAI Proof Point

OpenAI's flagship field report provided dramatic evidence. A three-person engineering team built and shipped an internal product with zero manually-written code — roughly a million lines across application logic, tests, CI, documentation, and tooling — all generated by Codex agents. The team averaged 3.5 merged PRs per engineer per day, and Codex runs regularly worked autonomously for six hours or more.

The key insight was that early progress was slower than expected — not because Codex was incapable, but because the environment was underspecified. The primary job of human engineers became enabling agents to do useful work: building the harness.

"Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: 'what capability is missing, and how do we make it both legible and enforceable for the agent?'"

📈 Harness Changes Move Benchmarks

LangChain demonstrated that harness changes alone can significantly improve benchmark performance — moving their coding agent from Top 30 to Top 5 on Terminal-Bench 2.0 by only changing the harness, not the model. Anthropic showed that infrastructure configuration can move coding benchmark scores by more than many leaderboard gaps. The implication is profound: benchmarks often measure harness quality as much as — or more than — model quality.

3. 🧮 The Core Equation: Agent = Model + Harness

LangChain's Anatomy of an Agent Harness provides the clearest decomposition. Working backwards from what models cannot do natively reveals why each harness component exists:

What We Want	What Models Can't Do Natively	Harness Solution
Persistent memory	Maintain durable state across interactions	Filesystem, git, progress files, AGENTS.md
Autonomous problem-solving	Execute arbitrary code	Bash tool, code execution sandboxes
Real-time knowledge	Access information beyond training cutoff	Web search, MCP tools, Context7
Safe operation	Understand risk boundaries	Sandboxes, allow-lists, network isolation
Long-horizon coherence	Work across multiple context windows	Compaction, Ralph Loops, planning files
Self-verification	Know if their work is correct	Test runners, browser automation, linters

The filesystem emerges as the most foundational harness primitive because it unlocks everything else: agents get a workspace, work can be incrementally persisted, and multiple agents can coordinate through shared files. Git adds versioning so agents can track work, rollback errors, and branch experiments.

4. 🧠 Foundations & Key Mental Models

4.1 🔄 Feedforward and Feedback (Thoughtworks)

Birgitta Böckeler's framework at Thoughtworks provides the most rigorous mental model for harness engineering. She frames it through two control mechanisms:

Guides (feedforward controls) — anticipate the agent's behavior and steer it before it acts. They increase the probability of good results on the first attempt. Examples: AGENTS.md files, architecture documentation, skills, coding conventions, reference applications.
Sensors (feedback controls) — observe after the agent acts and help it self-correct. Most powerful when they produce signals optimized for LLM consumption. Examples: linters with custom error messages, test suites, code review agents, browser screenshots.

Without guides, the agent keeps repeating mistakes. Without sensors, the agent encodes rules but never finds out whether they worked. A good harness requires both.

4.2 💻 Computational vs. Inferential

Each control can be either:

Computational — deterministic and fast, run by the CPU. Tests, linters, type checkers, structural analysis. Milliseconds to seconds; results are reliable.
Inferential — semantic analysis, AI code review, "LLM as judge." Slower, more expensive, non-deterministic — but capable of richer judgment.

Control	Direction	Type	Example
Coding conventions	Feedforward	Inferential	AGENTS.md, Skills
Structural tests	Feedback	Computational	ArchUnit tests checking module boundaries
Code review agent	Feedback	Inferential	A review skill using a strong model
Bootstrap scripts	Feedforward	Both	Skill with instructions and a bootstrap script
Code mods	Feedforward	Computational	OpenRewrite recipes

4.3 ⚙️ The Cybernetic Governor

The harness acts as a cybernetic governor — combining feedforward and feedback to regulate the codebase toward its desired state. Böckeler identifies three regulation dimensions:

Maintainability harness — internal code quality (linters, complexity checks, coverage). The most mature category with extensive pre-existing tooling.
Architecture fitness harness — system characteristics (performance, observability, security). Essentially architectural fitness functions.
Behaviour harness — functional correctness. The hardest category: how do we verify that the application does what we need? This remains the elephant in the room.

4.4 🏛️ The Three Pillars (Thoughtworks)

Thoughtworks frames harness work into three pillars:

Context engineering — managing what the agent knows and when
Architectural constraints — enforcing invariants mechanically
Garbage collection — fighting entropy and drift continuously

4.5 🚗 Control–Agency–Runtime (CAR) Decomposition

An academic position paper proposes treating the harness layer as a first-class research object with three dimensions:

Control — constraints, guardrails, permissions
Agency — planning, decision-making, self-evaluation
Runtime — execution environment, tools, infrastructure

5. 💭 Context Engineering: The Working Memory Budget

Context engineering is the practice of managing the agent's context window as a working memory budget rather than a dumping ground. It is arguably the most critical aspect of harness engineering.

5.1 🚫 The One Big File Anti-Pattern

OpenAI learned the hard way that a monolithic AGENTS.md doesn't scale:

Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs.

Too much guidance becomes non-guidance. When everything is "important," nothing is.

It rots instantly. A monolithic manual turns into a graveyard of stale rules.

It's hard to verify. A single blob doesn't lend itself to mechanical checks.

Their solution: treat AGENTS.md as a table of contents (~100 lines) that points to deeper sources of truth in a structured docs/ directory. This enables progressive disclosure — agents start with a small, stable entry point and are taught where to look next.

5.2 📖 Progressive Disclosure

Progressive disclosure is the principle that agents should only receive specific instructions, knowledge, or tools when they actually need them. Loading everything upfront pushes the agent into what HumanLayer calls "the dumb zone" — where context window fill degrades performance even on simple tasks.

Chroma's research on context rot provides empirical backing: models perform measurably worse at longer context lengths, and degradation is steeper when there's low semantic similarity between the query and the relevant information in context.

Skills solve this: they're activated on demand, bringing in focused knowledge only when needed.

5.3 🔧 Context-Efficient Backpressure

HumanLayer's backpressure philosophy is essential: verification mechanisms must be context-efficient. Running a full test suite after every change floods the context window with thousands of lines of passing tests. The agent loses track of its actual task.

The rule: success is silent, only failures produce output. Swallow the output of passing checks and only surface errors.

5.4 🧱 Sub-Agents as Context Firewalls

Sub-agents provide context isolation — each gets a fresh, small, high-relevance context window for its task, and only the condensed result flows back to the parent. This is far more powerful than simply making context windows bigger:

"A bigger context window doesn't make the model better at finding the needle — it just makes the haystack bigger."

Effective sub-agent tasks: codebase exploration, grep/search operations, tracing information flow, research tasks — anything with a straightforward question and simple answer that requires many intermediate tool calls.

5.5 📝 Lessons from Manus

Manus' playbook contributed specific techniques: KV-cache locality optimization, tool masking, filesystem memory, and keeping useful failures in-context while discarding noise.

5.6 🗜️ OpenHands Context Condensation

OpenHands' approach to bounded conversation memory preserves goals, progress, critical files, and failing tests while condensing everything else — keeping long-running coding sessions efficient without losing essential state.

6. 🛡️ Constraints, Guardrails & Safe Autonomy

6.1 ⚖️ Enforcing Invariants, Not Micromanaging

OpenAI's approach is instructive: enforce boundaries centrally, allow autonomy locally. They require Codex to parse data shapes at the boundary (parse, don't validate), but don't prescribe how. Each business domain follows a fixed layered architecture (Types → Config → Repo → Service → Runtime → UI) with strictly validated dependency directions enforced by custom linters and structural tests — all Codex-generated.

"This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it's an early prerequisite: the constraints are what allows speed without decay."

Custom linter error messages are written to inject remediation instructions into agent context — a positive form of prompt injection that guides self-correction.

6.2 📦 Sandboxing and Controlled Execution

Anthropic's work on sandboxing focuses on reducing approval friction without losing control. Rather than prompting humans for every action, better sandboxing and policy design allow agents to work more autonomously while staying within safe boundaries.

MCP-based code execution gives agents controlled execution power through explicit, inspectable tool boundaries — making it clear what the agent can and cannot do.

6.3 🔒 Tool Design for Safety

Anthropic's guidance on writing tools for agents emphasizes that tool interfaces should be easy for models to call correctly and safely. Poorly designed tools lead to misuse; well-designed tools guide the agent toward correct behavior.

6.4 💉 Prompt Injection Defense

OpenHands' practical guide covers confirmation mode, analyzers, sandboxing, and hard policies for reducing prompt-injection risk. This is especially important given that MCP server tool descriptions are injected into system prompts — never connect to one you don't trust.

6.5 ✔️ Quality Checks in the Loop

Rather than relying on after-the-fact manual review, Thoughtworks advocates moving quality checks into the agent's own loop. Anchoring agents to reference applications constrains output with concrete exemplars. The question for humans becomes not "how do I review every line?" but "where should I strengthen the harness?"

7. 📋 Specs, Agent Files & Workflow Design

7.1 📄 AGENTS.md and Agent Files

AGENTS.md is a lightweight open format for repo-local instructions that tell agents how to work inside a codebase. A related effort, agent.md, pursues machine-readable agent instructions across projects and tools.

The key insights from practitioners:

Keep it concise — under 60 lines is a good target. HumanLayer's CLAUDE.md is under 60 lines.
Avoid auto-generating it — LLM-generated agent files actually hurt performance while costing 20%+ more tokens.
Don't include directory listings — agents discover repository structure on their own just fine.
Use progressive disclosure — point to deeper resources rather than inlining everything.
Make instructions universally applicable — avoid conditional rules that confuse the model.

7.2 📐 Spec-Driven Development

GitHub's Spec Kit enables spec-driven development where agents execute against explicit product and engineering specifications. Thoughtworks' analysis explains why strong specs make AI-assisted delivery more dependable: they give agents unambiguous goals to work toward.

7.3 🔢 The 12-Factor Agent

HumanLayer's 12-Factor Agents establishes operating principles for production agents:

Explicit prompts over implicit behavior
State ownership — agents manage their own state
Clean pause-resume behavior
Context discipline
Reproducible workflows

The companion 12-Factor AgentOps extends these principles to operations.

7.4 📋 Feature Lists for Long-Running Work

Anthropic's approach to long-running agents uses an initializer agent that generates a comprehensive feature list (200+ features for a web app), all initially marked as "failing." Subsequent coding agents work through features one at a time, marking them as passing only after verification. This prevents two critical failure modes:

The one-shot attempt — the agent tries to build everything at once, runs out of context mid-implementation, and leaves a broken state.
Premature victory — the agent sees existing progress and declares the job done.

The feature list uses JSON rather than Markdown because models are less likely to inappropriately modify structured JSON files.

8. 📊 Evals & Observability

8.1 🤔 Why Evals Are Hard for Agents

Anthropic's guidance on demystifying evals highlights the fundamental challenge: agents have many possible trajectories to success or failure. A single task can be completed through vastly different tool-call sequences, making traditional input-output evaluation insufficient.

8.2 🗂️ Eval Taxonomies

OpenAI's eval guide introduces turning agent traces into repeatable evals with JSONL logs and deterministic checks. LangChain's breakdown distinguishes:

Single-step evals — does one tool call produce the right result?
Full-run evals — does the complete task get solved?
Multi-turn evals — does the agent handle conversations and evolving goals?

8.3 📊 Trace Grading

OpenAI's trace grading enables grading agent traces directly — especially helpful for long multi-step tasks where the final output alone doesn't reveal whether the agent's process was sound.

8.4 🔍 Verification Stacks

OpenHands' layered verification uses trajectory critics trained on production traces for reranking, early stopping, and review-time quality control. This goes beyond simple pass/fail testing to assess the quality of the agent's reasoning process.

8.5 🎯 Skill-Level Evals

OpenHands' playbook emphasizes measuring whether a specific skill actually helps using:

Bounded tasks
Deterministic verifiers
No-skill baselines
Trace review

8.6 📡 Infrastructure Noise

Anthropic's research on infrastructure noise shows that runtime configuration can move coding benchmark scores by more than many leaderboard gaps — meaning that benchmark results may reflect infrastructure choices more than model intelligence.

9. 🚀 Runtimes, Harnesses & Reference Implementations

9.1 🗺️ The Framework/Runtime/Harness Distinction

LangChain's decomposition clarifies what belongs where:

Framework — reusable abstractions for building agents (LangGraph, CrewAI)
Runtime — execution infrastructure (sandboxes, state management, scheduling)
Harness — the task-specific configuration and environment around a particular agent deployment

9.2 💡 Notable Implementations

Project	Description
SWE-agent	Mature research coding agent with inspectable harness, prompt, tools, and environment design
Claude Agent SDK	Production-oriented SDK with sessions, tools, and orchestration
deepagents	LangChain's open-source project for building deeper, longer-running agents with middleware and harness patterns
Citadel	Harness for Claude Code and Codex with isolated worktrees, multi-agent coordination, and persisted memory
AgentKit	TypeScript toolkit for building durable, workflow-aware agents on event-driven infrastructure
Harbor	Generalized harness for evaluating and improving agents at scale
Harness Evolver	Claude Code plugin that autonomously evolves agent harnesses using multi-agent proposers and evaluation
SWE-ReX	Sandboxed code execution infrastructure for AI agents
Uni-CLI	Universal CLI connecting agents to 134 sites via 711 declarative YAML pipelines with self-repair loop

9.3 🌐 Skills Ecosystem

skills.sh is a community marketplace for discovering, sharing, and installing reusable AI agent skills across runtimes like Claude Code — making harness capabilities portable and composable. However, caution is warranted: skill registries have already been caught distributing malicious skills, so treat them like npm install random-package and read what you're installing.

10. 🎯 Benchmarks: Measuring Harness Quality

Benchmarks are especially useful when you want to compare harness quality, not just model quality. They stress context handling, tool calling, environment control, verification logic, and the runtime scaffolding around the model.

10.1 💻 Software Engineering Benchmarks

Benchmark	Focus
SWE-bench Verified	Real GitHub issues and tests; makes harness choices around retrieval, patching, and validation highly visible
Terminal-Bench	Terminal-native agents in shells and filesystems; especially useful for comparing coding-agent harnesses
EvoClaw	Dependent milestone sequences from real repo history; surfaces regression accumulation

10.2 🌐 Web & Browser Benchmarks

Benchmark	Focus
WebArena	Self-hostable web environment for evaluating autonomous agents
VisualWebArena	Multimodal web agents with image and screenshot inputs
BrowserGym	Reproducible framework comparing harnesses across multiple web benchmarks
WorkArena	Enterprise-style web workflows

10.3 🌍 General & Multi-Domain Benchmarks

Benchmark	Focus
AgentBench	Cross-environment: OS, databases, knowledge graphs, web browsing
GAIA	General AI assistant tasks comparing harness-level choices
OSWorld	Real computer-use across Ubuntu, Windows, macOS
AppWorld	Interactive coding agents with state-based and execution-based unit tests
ClawBench	Search, reasoning, coding, safety, and multi-turn conversation

10.4 🔌 MCP-Specific Benchmarks

Benchmark	Focus
MCP Bench	Tool accuracy, latency, and token use across MCP server types
MCPMark	Stress-testing on real-world MCP tasks (Notion, GitHub, Postgres)
OSWorld-MCP	Real-world computer tasks using Model Context Protocol

10.5 🏆 Leaderboards

Leaderboard	Focus
Agent Arena	ELO-style ratings from head-to-head agent battles
HAL	Holistic agent evaluation with reliability, cost, and broad task coverage
Galileo Agent Leaderboard	LLM agents on task completion and tool calling across business domains

11. 🛠️ Practical Playbook: Engineering Your Own Harness

Drawing from all the sources in the Awesome Harness Engineering collection, here is a practical playbook for building an effective harness.

Step 1: 📄 Start with Agent Files (But Keep Them Lean)

Create a concise AGENTS.md or CLAUDE.md at the root of your repository:

Keep it under 60 lines
Treat it as a table of contents, not an encyclopedia
Include: build commands, test commands, key conventions, project structure pointers
Exclude: directory listings, conditional rules, auto-generated content
Point to deeper docs/ for architectural decisions, design principles, and domain knowledge

Step 2: 🔁 Set Up Computational Feedback Loops

These are your highest-leverage, lowest-cost sensors:

Type checking — catches structural errors deterministically
Linting with custom error messages that include remediation instructions
Fast test suites — run a targeted subset, not the full suite
Structural/architectural tests — enforce module boundaries and dependency directions

Critical rule: success is silent. Swallow output from passing checks; only surface errors to the agent.

Step 3: 🔬 Add Verification Tools

Give the agent ways to verify its own work as a human would:

Browser automation (Puppeteer, Playwright) for end-to-end UI verification
Screenshot capture so the agent can visually inspect results
Observability stack — expose logs via LogQL, metrics via PromQL, traces via TraceQL
Dev server per worktree — isolate each agent's environment

Step 4: 🪝 Implement Hooks for Control Flow

Use harness hooks to create deterministic checkpoints:

Pre-stop hooks: run formatter + type checker before the agent finishes; re-engage on failure
Notification hooks: alert humans when the agent needs attention
Approval hooks: auto-approve safe operations, deny dangerous ones (e.g., migrations)
Integration hooks: create PRs, post to Slack, set up preview environments

Step 5: 🧱 Use Sub-Agents for Context Control

Don't use sub-agents as "frontend engineer" or "backend engineer" personas — that doesn't work. Use them as context firewalls:

Delegate research, grep, and exploration to sub-agents
Use cheaper models (Sonnet, Haiku) for sub-agents; expensive models (Opus) for the parent
Return condensed results with source citations (filepath:line format)
Keep the parent thread in the "smart zone" with minimal context pollution

Step 6: 🏛️ Enforce Architecture Mechanically

Define layered architecture with fixed dependency directions
Write custom linters (the agent can write them!)
Add structural tests that check invariants on every commit
Encode "taste" as rules: structured logging, naming conventions, file size limits

Step 7: ⏳ Manage Long-Running Work

For tasks spanning multiple context windows:

Use an initializer agent to set up the environment: init.sh, progress file, feature list, initial git commit
Each subsequent coding agent reads progress, works on one feature, commits, and writes a summary
Always start a session by reading progress files and running a basic health check
Always end a session with a clean, mergeable state

Step 8: 🧹 Fight Entropy Continuously

OpenAI's "garbage collection" pattern:

Encode golden principles directly in the repository
Run recurring background agents that scan for deviations
Open targeted, small refactoring PRs that can be reviewed in under a minute
Track quality grades per domain and per architectural layer
Treat technical debt like a high-interest loan: pay it down daily

Step 9: 🗄️ Make Knowledge Repository-Local

Everything the agent needs must be in the repo:

Slack discussions about architecture? Encode them as markdown
Design decisions? Write ADRs
Onboarding knowledge? Put it in structured docs
Knowledge in people's heads? Doesn't exist for the agent

"From the agent's point of view, anything it can't access in-context while running effectively doesn't exist."

Step 10: 🔄 Iterate Based on Failures

The most important meta-principle:

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." — Mitchell Hashimoto

Don't design the ideal harness upfront. Bias toward shipping. Add configuration only when the agent actually fails. Throw away things that don't help. Distribute battle-tested configurations via repository-level config.

12. 🔮 The Future of Harness Engineering

🧬 Model-Harness Co-Evolution

Today's frontier coding agents (Claude Code, Codex) are post-trained with models and harnesses in the loop. This creates a feedback cycle where useful primitives are discovered, added to the harness, and then used when training the next generation of models. But this co-evolution has interesting side effects: models can become over-fitted to their training harness, performing worse in alternative harness configurations.

📐 Harnessability as a Design Criterion

Not every codebase is equally amenable to harnessing. Thoughtworks introduces "harnessability" — the structural properties that make a codebase governable:

Strongly typed languages naturally have type-checking as a sensor
Clearly definable module boundaries afford architectural constraint rules
Mature frameworks abstract away details agents don't need to worry about

Greenfield teams can bake harnessability in from day one. Legacy teams face the harder problem: the harness is most needed where it is hardest to build.

🧩 Harness Templates

Enterprises have a few common service topologies (CRUD APIs, event processors, data dashboards). These may evolve into harness templates — bundles of guides and sensors pre-configured for a topology. Teams may start picking tech stacks partly based on what harnesses are already available.

🤖 Autonomous Harness Evolution

Projects like Harness Evolver point toward a future where agents autonomously improve their own harnesses using multi-agent proposers, evaluation-backed selection, and git worktree isolation.

❓ Open Problems

How do we keep a harness coherent as it grows, with guides and sensors in sync?
How far can we trust agents to make trade-offs when instructions conflict?
If sensors never fire, is that high quality or inadequate detection?
How do we evaluate harness coverage similar to code coverage?
How does architectural coherence evolve over years in a fully agent-generated system?
Can we generalize these findings beyond coding to scientific research, financial modeling, and other domains?

13. ✅ Conclusion

Harness engineering represents a fundamental shift in how we think about AI-assisted software development. The discipline acknowledges a counterintuitive truth: the model is usually fine; the problem is the system around it.

The implications reshape what it means to be a software engineer:

Engineers become environment designers — specifying intent, building feedback loops, and shaping constraints rather than writing code directly.
Architecture becomes an early prerequisite — not a luxury for large teams, but a necessity for agent reliability.
Repository knowledge becomes the system of record — everything the agent needs must be versioned, discoverable, and mechanically verifiable.
Quality is enforced mechanically — once encoded, standards apply everywhere at once, at every hour.
Entropy is managed continuously — technical debt is treated as a high-interest loan, paid down daily by background agents.

As Thoughtworks' Böckeler frames it: building this outer harness is emerging as an ongoing engineering practice, not a one-time configuration. Harnesses externalize what human developer experience brings to the table — conventions, quality intuitions, architectural judgment, organizational alignment — making it explicit, verifiable, and continuously enforceable.

The field is young, evolving rapidly, and full of open questions. But one thing is clear: the teams that invest in harness engineering — shaping the environment around their agents — will get dramatically better results than those waiting for the next model release to solve their problems.

"Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code." — OpenAI

14. 📚 References & Further Reading

⚙️ Workflow Design

AGENTS.md — Open Standard
GitHub Spec Kit — GitHub
12 Factor Agents — HumanLayer
12-Factor AgentOps — AgentOps

📊 Evals & Observability

Testing Agent Skills Systematically with Evals — OpenAI
Demystifying Evals for AI Agents — Anthropic
Evaluating Deep Agents — LangChain
Improving Deep Agents with harness engineering — LangChain

🎓 Courses

walkinglabs/learn-harness-engineering — A project-based course on making Codex and Claude Code more reliable

📚 Curated Collection

walkinglabs/awesome-harness-engineering — The comprehensive, community-maintained list that informed this article

This article synthesizes insights from the Awesome Harness Engineering collection — a curated list maintained by Walking Labs. All referenced works are credited to their original authors and organizations.

If you found this helpful, let me know by leaving a 👍 or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! 😃

Top comments (2)

Raju Dandigam • May 14

“Harness engineering” is a useful framing because reliable agents depend heavily on everything around the model, not just the model itself. Context management, execution tooling, verification layers, and observability often determine whether systems survive production traffic. I also think harnesses should expose replayable execution artifacts by default so debugging becomes deterministic instead of investigative. That’s where developer productivity really improves.

Truong Phung • May 18

Yes, I agree with the point about replayable execution artifacts, thank you for your feedback and sharing 👍 😃

Table of Contents