DEV Community

Cover image for πŸ—οΈ πŸ“ Harness Engineering: The Emerging Discipline of Making AI Agents Reliable πŸ€–
Truong Phung
Truong Phung

Posted on

πŸ—οΈ πŸ“ Harness Engineering: The Emerging Discipline of Making AI Agents Reliable πŸ€–

A comprehensive guide to the practice of shaping the environment around AI agents so they can work dependably β€” based on references from the Awesome Harness Engineering collection.


Table of Contents

  1. What Is Harness Engineering?
  2. Why It Matters Now
  3. The Core Equation: Agent = Model + Harness
  4. Foundations & Key Mental Models
  5. Context Engineering: The Working Memory Budget
  6. Constraints, Guardrails & Safe Autonomy
  7. Specs, Agent Files & Workflow Design
  8. Evals & Observability
  9. Runtimes, Harnesses & Reference Implementations
  10. Benchmarks: Measuring Harness Quality
  11. Practical Playbook: Engineering Your Own Harness
  12. The Future of Harness Engineering
  13. Conclusion
  14. References & Further Reading

1. What Is Harness Engineering?

Harness engineering is the practice of designing, building, and iterating on the environment, tooling, constraints, and feedback loops that surround an AI agent β€” everything that isn't the model itself. The term gained widespread traction in early 2026, popularized by field reports from OpenAI, Anthropic, LangChain, Thoughtworks, and HumanLayer, all converging on the same insight: the reliability of an AI agent depends less on the model and more on the system wrapped around it.

As LangChain's Vivek Trivedy crystallized it:

Agent = Model + Harness. If you're not the model, you're the harness.

A harness includes:

  • System prompts β€” the instructions that shape the agent's persona and constraints
  • Tools, skills, and MCP servers β€” capabilities the agent can invoke
  • Bundled infrastructure β€” filesystem, sandboxes, browsers, observability stacks
  • Orchestration logic β€” sub-agent spawning, handoffs, model routing
  • Hooks and middleware β€” deterministic control flow for compaction, continuation, lint checks, and verification
  • Memory and state management β€” progress files, git history, structured knowledge bases

A raw model is not an agent. It becomes one only when a harness gives it state, tool execution, feedback loops, and enforceable constraints. Harness engineering is the discipline of making all of that work well.


2. Why It Matters Now

The "Skill Issue" Realization

As HumanLayer argued, teams that blame weak agent results on model limitations are usually wrong. After hundreds of agent sessions across dozens of projects, the pattern is consistent:

It's not a model problem. It's a configuration problem.

Every time a team instinctively says "GPT-6 will fix it" or "we just need better instruction-following," the real fix is almost always in the harness β€” better context management, smarter tool selection, proper verification loops, or cleaner handoff artifacts.

The OpenAI Proof Point

OpenAI's flagship field report provided dramatic evidence. A three-person engineering team built and shipped an internal product with zero manually-written code β€” roughly a million lines across application logic, tests, CI, documentation, and tooling β€” all generated by Codex agents. The team averaged 3.5 merged PRs per engineer per day, and Codex runs regularly worked autonomously for six hours or more.

The key insight was that early progress was slower than expected β€” not because Codex was incapable, but because the environment was underspecified. The primary job of human engineers became enabling agents to do useful work: building the harness.

"Because the only way to make progress was to get Codex to do the work, human engineers always stepped into the task and asked: 'what capability is missing, and how do we make it both legible and enforceable for the agent?'"

Harness Changes Move Benchmarks

LangChain demonstrated that harness changes alone can significantly improve benchmark performance β€” moving their coding agent from Top 30 to Top 5 on Terminal-Bench 2.0 by only changing the harness, not the model. Anthropic showed that infrastructure configuration can move coding benchmark scores by more than many leaderboard gaps. The implication is profound: benchmarks often measure harness quality as much as β€” or more than β€” model quality.


3. The Core Equation: Agent = Model + Harness

LangChain's Anatomy of an Agent Harness provides the clearest decomposition. Working backwards from what models cannot do natively reveals why each harness component exists:

What We Want What Models Can't Do Natively Harness Solution
Persistent memory Maintain durable state across interactions Filesystem, git, progress files, AGENTS.md
Autonomous problem-solving Execute arbitrary code Bash tool, code execution sandboxes
Real-time knowledge Access information beyond training cutoff Web search, MCP tools, Context7
Safe operation Understand risk boundaries Sandboxes, allow-lists, network isolation
Long-horizon coherence Work across multiple context windows Compaction, Ralph Loops, planning files
Self-verification Know if their work is correct Test runners, browser automation, linters

The filesystem emerges as the most foundational harness primitive because it unlocks everything else: agents get a workspace, work can be incrementally persisted, and multiple agents can coordinate through shared files. Git adds versioning so agents can track work, rollback errors, and branch experiments.


4. Foundations & Key Mental Models

4.1 Feedforward and Feedback (Thoughtworks)

Birgitta BΓΆckeler's framework at Thoughtworks provides the most rigorous mental model for harness engineering. She frames it through two control mechanisms:

  • Guides (feedforward controls) β€” anticipate the agent's behavior and steer it before it acts. They increase the probability of good results on the first attempt. Examples: AGENTS.md files, architecture documentation, skills, coding conventions, reference applications.

  • Sensors (feedback controls) β€” observe after the agent acts and help it self-correct. Most powerful when they produce signals optimized for LLM consumption. Examples: linters with custom error messages, test suites, code review agents, browser screenshots.

Without guides, the agent keeps repeating mistakes. Without sensors, the agent encodes rules but never finds out whether they worked. A good harness requires both.

4.2 Computational vs. Inferential

Each control can be either:

  • Computational β€” deterministic and fast, run by the CPU. Tests, linters, type checkers, structural analysis. Milliseconds to seconds; results are reliable.
  • Inferential β€” semantic analysis, AI code review, "LLM as judge." Slower, more expensive, non-deterministic β€” but capable of richer judgment.
Control Direction Type Example
Coding conventions Feedforward Inferential AGENTS.md, Skills
Structural tests Feedback Computational ArchUnit tests checking module boundaries
Code review agent Feedback Inferential A review skill using a strong model
Bootstrap scripts Feedforward Both Skill with instructions and a bootstrap script
Code mods Feedforward Computational OpenRewrite recipes

4.3 The Cybernetic Governor

The harness acts as a cybernetic governor β€” combining feedforward and feedback to regulate the codebase toward its desired state. BΓΆckeler identifies three regulation dimensions:

  1. Maintainability harness β€” internal code quality (linters, complexity checks, coverage). The most mature category with extensive pre-existing tooling.
  2. Architecture fitness harness β€” system characteristics (performance, observability, security). Essentially architectural fitness functions.
  3. Behaviour harness β€” functional correctness. The hardest category: how do we verify that the application does what we need? This remains the elephant in the room.

4.4 The Three Pillars (Thoughtworks)

Thoughtworks frames harness work into three pillars:

  1. Context engineering β€” managing what the agent knows and when
  2. Architectural constraints β€” enforcing invariants mechanically
  3. Garbage collection β€” fighting entropy and drift continuously

4.5 Control–Agency–Runtime (CAR) Decomposition

An academic position paper proposes treating the harness layer as a first-class research object with three dimensions:

  • Control β€” constraints, guardrails, permissions
  • Agency β€” planning, decision-making, self-evaluation
  • Runtime β€” execution environment, tools, infrastructure

5. Context Engineering: The Working Memory Budget

Context engineering is the practice of managing the agent's context window as a working memory budget rather than a dumping ground. It is arguably the most critical aspect of harness engineering.

5.1 The One Big File Anti-Pattern

OpenAI learned the hard way that a monolithic AGENTS.md doesn't scale:

  • Context is a scarce resource. A giant instruction file crowds out the task, the code, and the relevant docs.
  • Too much guidance becomes non-guidance. When everything is "important," nothing is.
  • It rots instantly. A monolithic manual turns into a graveyard of stale rules.
  • It's hard to verify. A single blob doesn't lend itself to mechanical checks.

Their solution: treat AGENTS.md as a table of contents (~100 lines) that points to deeper sources of truth in a structured docs/ directory. This enables progressive disclosure β€” agents start with a small, stable entry point and are taught where to look next.

5.2 Progressive Disclosure

Progressive disclosure is the principle that agents should only receive specific instructions, knowledge, or tools when they actually need them. Loading everything upfront pushes the agent into what HumanLayer calls "the dumb zone" β€” where context window fill degrades performance even on simple tasks.

Chroma's research on context rot provides empirical backing: models perform measurably worse at longer context lengths, and degradation is steeper when there's low semantic similarity between the query and the relevant information in context.

Skills solve this: they're activated on demand, bringing in focused knowledge only when needed.

5.3 Context-Efficient Backpressure

HumanLayer's backpressure philosophy is essential: verification mechanisms must be context-efficient. Running a full test suite after every change floods the context window with thousands of lines of passing tests. The agent loses track of its actual task.

The rule: success is silent, only failures produce output. Swallow the output of passing checks and only surface errors.

5.4 Sub-Agents as Context Firewalls

Sub-agents provide context isolation β€” each gets a fresh, small, high-relevance context window for its task, and only the condensed result flows back to the parent. This is far more powerful than simply making context windows bigger:

"A bigger context window doesn't make the model better at finding the needle β€” it just makes the haystack bigger."

Effective sub-agent tasks: codebase exploration, grep/search operations, tracing information flow, research tasks β€” anything with a straightforward question and simple answer that requires many intermediate tool calls.

5.5 Lessons from Manus

Manus' playbook contributed specific techniques: KV-cache locality optimization, tool masking, filesystem memory, and keeping useful failures in-context while discarding noise.

5.6 OpenHands Context Condensation

OpenHands' approach to bounded conversation memory preserves goals, progress, critical files, and failing tests while condensing everything else β€” keeping long-running coding sessions efficient without losing essential state.


6. Constraints, Guardrails & Safe Autonomy

6.1 Enforcing Invariants, Not Micromanaging

OpenAI's approach is instructive: enforce boundaries centrally, allow autonomy locally. They require Codex to parse data shapes at the boundary (parse, don't validate), but don't prescribe how. Each business domain follows a fixed layered architecture (Types β†’ Config β†’ Repo β†’ Service β†’ Runtime β†’ UI) with strictly validated dependency directions enforced by custom linters and structural tests β€” all Codex-generated.

"This is the kind of architecture you usually postpone until you have hundreds of engineers. With coding agents, it's an early prerequisite: the constraints are what allows speed without decay."

Custom linter error messages are written to inject remediation instructions into agent context β€” a positive form of prompt injection that guides self-correction.

6.2 Sandboxing and Controlled Execution

Anthropic's work on sandboxing focuses on reducing approval friction without losing control. Rather than prompting humans for every action, better sandboxing and policy design allow agents to work more autonomously while staying within safe boundaries.

MCP-based code execution gives agents controlled execution power through explicit, inspectable tool boundaries β€” making it clear what the agent can and cannot do.

6.3 Tool Design for Safety

Anthropic's guidance on writing tools for agents emphasizes that tool interfaces should be easy for models to call correctly and safely. Poorly designed tools lead to misuse; well-designed tools guide the agent toward correct behavior.

6.4 Prompt Injection Defense

OpenHands' practical guide covers confirmation mode, analyzers, sandboxing, and hard policies for reducing prompt-injection risk. This is especially important given that MCP server tool descriptions are injected into system prompts β€” never connect to one you don't trust.

6.5 Quality Checks in the Loop

Rather than relying on after-the-fact manual review, Thoughtworks advocates moving quality checks into the agent's own loop. Anchoring agents to reference applications constrains output with concrete exemplars. The question for humans becomes not "how do I review every line?" but "where should I strengthen the harness?"


7. Specs, Agent Files & Workflow Design

7.1 AGENTS.md and Agent Files

AGENTS.md is a lightweight open format for repo-local instructions that tell agents how to work inside a codebase. A related effort, agent.md, pursues machine-readable agent instructions across projects and tools.

The key insights from practitioners:

  • Keep it concise β€” under 60 lines is a good target. HumanLayer's CLAUDE.md is under 60 lines.
  • Avoid auto-generating it β€” LLM-generated agent files actually hurt performance while costing 20%+ more tokens.
  • Don't include directory listings β€” agents discover repository structure on their own just fine.
  • Use progressive disclosure β€” point to deeper resources rather than inlining everything.
  • Make instructions universally applicable β€” avoid conditional rules that confuse the model.

7.2 Spec-Driven Development

GitHub's Spec Kit enables spec-driven development where agents execute against explicit product and engineering specifications. Thoughtworks' analysis explains why strong specs make AI-assisted delivery more dependable: they give agents unambiguous goals to work toward.

7.3 The 12-Factor Agent

HumanLayer's 12-Factor Agents establishes operating principles for production agents:

  • Explicit prompts over implicit behavior
  • State ownership β€” agents manage their own state
  • Clean pause-resume behavior
  • Context discipline
  • Reproducible workflows

The companion 12-Factor AgentOps extends these principles to operations.

7.4 Feature Lists for Long-Running Work

Anthropic's approach to long-running agents uses an initializer agent that generates a comprehensive feature list (200+ features for a web app), all initially marked as "failing." Subsequent coding agents work through features one at a time, marking them as passing only after verification. This prevents two critical failure modes:

  1. The one-shot attempt β€” the agent tries to build everything at once, runs out of context mid-implementation, and leaves a broken state.
  2. Premature victory β€” the agent sees existing progress and declares the job done.

The feature list uses JSON rather than Markdown because models are less likely to inappropriately modify structured JSON files.


8. Evals & Observability

8.1 Why Evals Are Hard for Agents

Anthropic's guidance on demystifying evals highlights the fundamental challenge: agents have many possible trajectories to success or failure. A single task can be completed through vastly different tool-call sequences, making traditional input-output evaluation insufficient.

8.2 Eval Taxonomies

OpenAI's eval guide introduces turning agent traces into repeatable evals with JSONL logs and deterministic checks. LangChain's breakdown distinguishes:

  • Single-step evals β€” does one tool call produce the right result?
  • Full-run evals β€” does the complete task get solved?
  • Multi-turn evals β€” does the agent handle conversations and evolving goals?

8.3 Trace Grading

OpenAI's trace grading enables grading agent traces directly β€” especially helpful for long multi-step tasks where the final output alone doesn't reveal whether the agent's process was sound.

8.4 Verification Stacks

OpenHands' layered verification uses trajectory critics trained on production traces for reranking, early stopping, and review-time quality control. This goes beyond simple pass/fail testing to assess the quality of the agent's reasoning process.

8.5 Skill-Level Evals

OpenHands' playbook emphasizes measuring whether a specific skill actually helps using:

  • Bounded tasks
  • Deterministic verifiers
  • No-skill baselines
  • Trace review

8.6 Infrastructure Noise

Anthropic's research on infrastructure noise shows that runtime configuration can move coding benchmark scores by more than many leaderboard gaps β€” meaning that benchmark results may reflect infrastructure choices more than model intelligence.


9. Runtimes, Harnesses & Reference Implementations

9.1 The Framework/Runtime/Harness Distinction

LangChain's decomposition clarifies what belongs where:

  • Framework β€” reusable abstractions for building agents (LangGraph, CrewAI)
  • Runtime β€” execution infrastructure (sandboxes, state management, scheduling)
  • Harness β€” the task-specific configuration and environment around a particular agent deployment

9.2 Notable Implementations

Project Description
SWE-agent Mature research coding agent with inspectable harness, prompt, tools, and environment design
Claude Agent SDK Production-oriented SDK with sessions, tools, and orchestration
deepagents LangChain's open-source project for building deeper, longer-running agents with middleware and harness patterns
Citadel Harness for Claude Code and Codex with isolated worktrees, multi-agent coordination, and persisted memory
AgentKit TypeScript toolkit for building durable, workflow-aware agents on event-driven infrastructure
Harbor Generalized harness for evaluating and improving agents at scale
Harness Evolver Claude Code plugin that autonomously evolves agent harnesses using multi-agent proposers and evaluation
SWE-ReX Sandboxed code execution infrastructure for AI agents
Uni-CLI Universal CLI connecting agents to 134 sites via 711 declarative YAML pipelines with self-repair loop

9.3 Skills Ecosystem

skills.sh is a community marketplace for discovering, sharing, and installing reusable AI agent skills across runtimes like Claude Code β€” making harness capabilities portable and composable. However, caution is warranted: skill registries have already been caught distributing malicious skills, so treat them like npm install random-package and read what you're installing.


10. Benchmarks: Measuring Harness Quality

Benchmarks are especially useful when you want to compare harness quality, not just model quality. They stress context handling, tool calling, environment control, verification logic, and the runtime scaffolding around the model.

10.1 Software Engineering Benchmarks

Benchmark Focus
SWE-bench Verified Real GitHub issues and tests; makes harness choices around retrieval, patching, and validation highly visible
Terminal-Bench Terminal-native agents in shells and filesystems; especially useful for comparing coding-agent harnesses
EvoClaw Dependent milestone sequences from real repo history; surfaces regression accumulation

10.2 Web & Browser Benchmarks

Benchmark Focus
WebArena Self-hostable web environment for evaluating autonomous agents
VisualWebArena Multimodal web agents with image and screenshot inputs
BrowserGym Reproducible framework comparing harnesses across multiple web benchmarks
WorkArena Enterprise-style web workflows

10.3 General & Multi-Domain Benchmarks

Benchmark Focus
AgentBench Cross-environment: OS, databases, knowledge graphs, web browsing
GAIA General AI assistant tasks comparing harness-level choices
OSWorld Real computer-use across Ubuntu, Windows, macOS
AppWorld Interactive coding agents with state-based and execution-based unit tests
ClawBench Search, reasoning, coding, safety, and multi-turn conversation

10.4 MCP-Specific Benchmarks

Benchmark Focus
MCP Bench Tool accuracy, latency, and token use across MCP server types
MCPMark Stress-testing on real-world MCP tasks (Notion, GitHub, Postgres)
OSWorld-MCP Real-world computer tasks using Model Context Protocol

10.5 Leaderboards

Leaderboard Focus
Agent Arena ELO-style ratings from head-to-head agent battles
HAL Holistic agent evaluation with reliability, cost, and broad task coverage
Galileo Agent Leaderboard LLM agents on task completion and tool calling across business domains

11. Practical Playbook: Engineering Your Own Harness

Drawing from all the sources in the Awesome Harness Engineering collection, here is a practical playbook for building an effective harness.

Step 1: Start with Agent Files (But Keep Them Lean)

Create a concise AGENTS.md or CLAUDE.md at the root of your repository:

  • Keep it under 60 lines
  • Treat it as a table of contents, not an encyclopedia
  • Include: build commands, test commands, key conventions, project structure pointers
  • Exclude: directory listings, conditional rules, auto-generated content
  • Point to deeper docs/ for architectural decisions, design principles, and domain knowledge

Step 2: Set Up Computational Feedback Loops

These are your highest-leverage, lowest-cost sensors:

  • Type checking β€” catches structural errors deterministically
  • Linting with custom error messages that include remediation instructions
  • Fast test suites β€” run a targeted subset, not the full suite
  • Structural/architectural tests β€” enforce module boundaries and dependency directions

Critical rule: success is silent. Swallow output from passing checks; only surface errors to the agent.

Step 3: Add Verification Tools

Give the agent ways to verify its own work as a human would:

  • Browser automation (Puppeteer, Playwright) for end-to-end UI verification
  • Screenshot capture so the agent can visually inspect results
  • Observability stack β€” expose logs via LogQL, metrics via PromQL, traces via TraceQL
  • Dev server per worktree β€” isolate each agent's environment

Step 4: Implement Hooks for Control Flow

Use harness hooks to create deterministic checkpoints:

  • Pre-stop hooks: run formatter + type checker before the agent finishes; re-engage on failure
  • Notification hooks: alert humans when the agent needs attention
  • Approval hooks: auto-approve safe operations, deny dangerous ones (e.g., migrations)
  • Integration hooks: create PRs, post to Slack, set up preview environments

Step 5: Use Sub-Agents for Context Control

Don't use sub-agents as "frontend engineer" or "backend engineer" personas β€” that doesn't work. Use them as context firewalls:

  • Delegate research, grep, and exploration to sub-agents
  • Use cheaper models (Sonnet, Haiku) for sub-agents; expensive models (Opus) for the parent
  • Return condensed results with source citations (filepath:line format)
  • Keep the parent thread in the "smart zone" with minimal context pollution

Step 6: Enforce Architecture Mechanically

  • Define layered architecture with fixed dependency directions
  • Write custom linters (the agent can write them!)
  • Add structural tests that check invariants on every commit
  • Encode "taste" as rules: structured logging, naming conventions, file size limits

Step 7: Manage Long-Running Work

For tasks spanning multiple context windows:

  • Use an initializer agent to set up the environment: init.sh, progress file, feature list, initial git commit
  • Each subsequent coding agent reads progress, works on one feature, commits, and writes a summary
  • Always start a session by reading progress files and running a basic health check
  • Always end a session with a clean, mergeable state

Step 8: Fight Entropy Continuously

OpenAI's "garbage collection" pattern:

  • Encode golden principles directly in the repository
  • Run recurring background agents that scan for deviations
  • Open targeted, small refactoring PRs that can be reviewed in under a minute
  • Track quality grades per domain and per architectural layer
  • Treat technical debt like a high-interest loan: pay it down daily

Step 9: Make Knowledge Repository-Local

Everything the agent needs must be in the repo:

  • Slack discussions about architecture? Encode them as markdown
  • Design decisions? Write ADRs
  • Onboarding knowledge? Put it in structured docs
  • Knowledge in people's heads? Doesn't exist for the agent

"From the agent's point of view, anything it can't access in-context while running effectively doesn't exist."

Step 10: Iterate Based on Failures

The most important meta-principle:

"Anytime you find an agent makes a mistake, you take the time to engineer a solution such that the agent never makes that mistake again." β€” Mitchell Hashimoto

Don't design the ideal harness upfront. Bias toward shipping. Add configuration only when the agent actually fails. Throw away things that don't help. Distribute battle-tested configurations via repository-level config.


12. The Future of Harness Engineering

Model-Harness Co-Evolution

Today's frontier coding agents (Claude Code, Codex) are post-trained with models and harnesses in the loop. This creates a feedback cycle where useful primitives are discovered, added to the harness, and then used when training the next generation of models. But this co-evolution has interesting side effects: models can become over-fitted to their training harness, performing worse in alternative harness configurations.

Harnessability as a Design Criterion

Not every codebase is equally amenable to harnessing. Thoughtworks introduces "harnessability" β€” the structural properties that make a codebase governable:

  • Strongly typed languages naturally have type-checking as a sensor
  • Clearly definable module boundaries afford architectural constraint rules
  • Mature frameworks abstract away details agents don't need to worry about

Greenfield teams can bake harnessability in from day one. Legacy teams face the harder problem: the harness is most needed where it is hardest to build.

Harness Templates

Enterprises have a few common service topologies (CRUD APIs, event processors, data dashboards). These may evolve into harness templates β€” bundles of guides and sensors pre-configured for a topology. Teams may start picking tech stacks partly based on what harnesses are already available.

Autonomous Harness Evolution

Projects like Harness Evolver point toward a future where agents autonomously improve their own harnesses using multi-agent proposers, evaluation-backed selection, and git worktree isolation.

Open Problems

  • How do we keep a harness coherent as it grows, with guides and sensors in sync?
  • How far can we trust agents to make trade-offs when instructions conflict?
  • If sensors never fire, is that high quality or inadequate detection?
  • How do we evaluate harness coverage similar to code coverage?
  • How does architectural coherence evolve over years in a fully agent-generated system?
  • Can we generalize these findings beyond coding to scientific research, financial modeling, and other domains?

13. Conclusion

Harness engineering represents a fundamental shift in how we think about AI-assisted software development. The discipline acknowledges a counterintuitive truth: the model is usually fine; the problem is the system around it.

The implications reshape what it means to be a software engineer:

  • Engineers become environment designers β€” specifying intent, building feedback loops, and shaping constraints rather than writing code directly.
  • Architecture becomes an early prerequisite β€” not a luxury for large teams, but a necessity for agent reliability.
  • Repository knowledge becomes the system of record β€” everything the agent needs must be versioned, discoverable, and mechanically verifiable.
  • Quality is enforced mechanically β€” once encoded, standards apply everywhere at once, at every hour.
  • Entropy is managed continuously β€” technical debt is treated as a high-interest loan, paid down daily by background agents.

As Thoughtworks' BΓΆckeler frames it: building this outer harness is emerging as an ongoing engineering practice, not a one-time configuration. Harnesses externalize what human developer experience brings to the table β€” conventions, quality intuitions, architectural judgment, organizational alignment β€” making it explicit, verifiable, and continuously enforceable.

The field is young, evolving rapidly, and full of open questions. But one thing is clear: the teams that invest in harness engineering β€” shaping the environment around their agents β€” will get dramatically better results than those waiting for the next model release to solve their problems.

"Building software still demands discipline, but the discipline shows up more in the scaffolding rather than the code." β€” OpenAI


14. References & Further Reading

Foundational Articles

  1. Harness engineering: leveraging Codex in an agent-first world β€” OpenAI
  2. Effective harnesses for long-running agents β€” Anthropic
  3. Harness design for long-running application development β€” Anthropic
  4. The Anatomy of an Agent Harness β€” LangChain
  5. Harness Engineering β€” Thoughtworks
  6. Building effective agents β€” Anthropic
  7. Skill Issue: Harness Engineering for Coding Agents β€” HumanLayer
  8. Your Agent Needs a Harness, Not a Framework β€” Inngest
  9. Harness Engineering for Language Agents (CAR Decomposition) β€” Academic Paper

Context & Memory

  1. Effective context engineering for AI agents β€” Anthropic
  2. Context Engineering for AI Agents: Lessons from Building Manus β€” Manus
  3. Context Engineering for Coding Agents β€” Thoughtworks
  4. Advanced Context Engineering for Coding Agents β€” HumanLayer
  5. Context-Efficient Backpressure for Coding Agents β€” HumanLayer
  6. OpenHands Context Condensation β€” OpenHands
  7. Writing a good CLAUDE.md β€” HumanLayer

Safety & Constraints

  1. Beyond permission prompts: making Claude Code more secure β€” Anthropic
  2. Code execution with MCP β€” Anthropic
  3. Writing effective tools for agents β€” Anthropic
  4. Mitigating Prompt Injection Attacks β€” OpenHands
  5. Claude Code: Best practices for agentic coding β€” Anthropic

Workflow Design

  1. AGENTS.md β€” Open Standard
  2. GitHub Spec Kit β€” GitHub
  3. 12 Factor Agents β€” HumanLayer
  4. 12-Factor AgentOps β€” AgentOps

Evals & Observability

  1. Testing Agent Skills Systematically with Evals β€” OpenAI
  2. Demystifying Evals for AI Agents β€” Anthropic
  3. Evaluating Deep Agents β€” LangChain
  4. Improving Deep Agents with harness engineering β€” LangChain

Courses

  1. walkinglabs/learn-harness-engineering β€” A project-based course on making Codex and Claude Code more reliable

Curated Collection

  1. walkinglabs/awesome-harness-engineering β€” The comprehensive, community-maintained list that informed this article

This article synthesizes insights from the Awesome Harness Engineering collection β€” a curated list maintained by Walking Labs. All referenced works are credited to their original authors and organizations.


If you found this helpful, let me know by leaving a πŸ‘ or a comment!, or if you think this post could help someone, feel free to share it! Thank you very much! πŸ˜ƒ

Top comments (0)