Most people are still building AI agents like demos.
They connect an LLM to a few tools, add a system prompt, wrap everything in a chat UI, and call it an agent.
That is not an agent system.
That is a model with tool access.
A real AI agent is not just a prompt, a model, or a framework. A real AI agent is an engineered runtime.
It needs a harness.

The agentic harness is the system around the model that makes agent behavior useful, repeatable, observable, and safe.
It decides:
- How the model receives context
- How it uses tools
- How progress is persisted
- How failures are handled
- How work is evaluated
- How the system improves over time
The mindset shift is simple:
The model is not the product.
The harness around the model is the product.
A stronger model can improve reasoning.
But the harness determines whether that reasoning turns into reliable action.
What Is an Agentic Harness?
An agentic harness is the runtime layer that enables a model to behave like an agent.
It receives a task, loads the right instructions and context, exposes the right tools, manages the execution loop, captures state, verifies progress, handles errors, records traces, and returns the final result.
A simple version looks like this:
1. Receive task
2. Load identity
3. Load task instructions
4. Load relevant context
5. Retrieve memory
6. Select tools
7. Plan next action
8. Execute tool call
9. Observe result
10. Update state
11. Verify outcome
12. Write durable progress
13. Return response
14. Record trace
The important part is not that every agent uses this exact loop.
The important part is that the loop exists outside the model.
A weak agent relies on the model to figure everything out inside one giant context window.
A strong agent externalizes responsibilities into the harness:
1. Identity lives in a stable instruction layer
2. Memory lives outside the prompt
3. Skills live as reusable procedures
4. Tools expose controlled actions
5. Policies constrain execution
6. Progress files preserve continuity
7. Traces capture behavior
8. Evals measure outcomes and trajectories
9. Governance defines ownership
The model should reason.
The harness should govern.
Practical rule: Do not put everything inside the prompt. Build the system around the prompt.
Start Simple, Then Add Agency Only Where It Pays Off
One of the biggest mistakes in agent development is adding autonomy too early.
Not every AI system needs to be an agent.
Some tasks are better served by:
1. A single model call
2. Retrieval-augmented generation
3. A deterministic workflow
4. A simple rules engine
5. A human review flow
Some tasks genuinely need an agent that can decide what to do next, use tools, and adapt across multiple turns.
A useful distinction:
Workflow: the system follows predefined code paths.
Agent: the model dynamically decides its process and tool usage.
A good harness lets you mix both.
User request
↓
Intent router
↓
Simple task? → deterministic workflow
Complex task? → agent loop
High-risk task? → human review gate
This gives you a practical architecture:
Keep deterministic paths deterministic.
Reserve agentic behavior for places where model-driven decision-making actually creates value.
Practical rule: Use the simplest system that can solve the task. Add agency only when flexibility is worth the cost.
Define the Agent’s Operating Identity
Before memory, tools, skills, and evals, the agent needs identity.
Identity is not personality decoration.
It is behavioral control.
A weak identity says:
You are a helpful AI assistant.
That does almost nothing.
A stronger identity says:
You are a pragmatic staff engineer operating in production systems. You optimize for correctness, reliability, maintainability, and small safe diffs. You read before editing. You verify before claiming completion. You preserve existing architecture unless the architecture itself is the failure. You surface uncertainty instead of hiding it.
This gives the model an operating posture.
In a real harness, this identity can live in a stable file such as:
1. SOUL.md
2. AGENTS.md
3. system profile
4. team-owned instruction file
It should define:
1. Who the agent is
2. What it optimizes for
3. How it communicates
4. What it refuses to do
5. How it uses tools
6. What it remembers
7. What it ignores
8. When it asks for help
9. When it stops
Example:
## Core Truths
- Read before writing.
Existing systems contain context that the prompt does not.
- Small diffs beat broad rewrites.
Local fixes are safer unless the abstraction itself is broken.
- Verification is part of the task.
Never claim success without evidence.
- Production systems punish cleverness.
Prefer explicit, observable, boring solutions.
- Uncertainty must be surfaced.
A confident guess is worse than a clearly labeled assumption.
A next-level agent needs judgment, not just capability.
Identity is where that judgment starts.
Practical rule: Give the agent a stable operating profile before giving it powerful tools.
Make the Execution Contract Explicit
Every agent should have an execution contract.
The execution contract tells the agent how work moves from task to completion.
For a coding agent, the contract might be:
1. Understand the request.
2. Inspect relevant files.
3. Identify the smallest safe change.
4. Apply the change.
5. Run targeted tests.
6. Run broader tests if risk is high.
7. Summarize the diff.
8. Document verification.
9. List residual risk.
Without this contract, the agent improvises.
Improvisation is fine for chat.
It is dangerous for production systems.
A better coding-agent instruction looks like this:
You are debugging a production Python service.
Mission:
Find the smallest safe fix.
Workflow:
1. Read the exact error.
2. Inspect the file where the error originates.
3. Inspect the caller.
4. Search for similar patterns in the repository.
5. Identify the smallest local fix.
6. Apply the patch.
7. Run the narrowest relevant test.
8. If the touched surface is broad, run the related suite.
9. Report changed files, verification, and remaining risks.
Rules:
- Do not edit before reading.
- Do not introduce dependencies unless existing tools are insufficient.
- Do not rewrite modules for local bugs.
- Do not claim tests passed unless the command actually ran.
- Do not suppress uncertainty.
This is what separates an agent from a chatbot.
The chatbot answers.
The agent follows an execution contract.
Practical rule: Define how the agent starts, acts, verifies, and stops.
Treat Tools as Privileged Interfaces
Most agent demos expose tools too casually.
They give the model a shell, browser, database, file editor, or API client and trust the prompt to keep behavior sane.
That is not enough.
Tool use needs policy.
For every tool, define:
1. When to use it
2. When not to use it
3. Required preconditions
4. Allowed scope
5. Failure behavior
6. Retry limits
7. Logging requirements
8. Approval boundaries
Example:
## Shell Tool Policy
Use shell for:
- running tests
- inspecting repo structure
- searching files
- checking git state
Do not use shell for:
- destructive commands
- credential access
- broad file deletion
- installing dependencies without approval
Before mutation:
- inspect target files
- check git status
- prefer minimal commands
After mutation:
- run relevant verification
- summarize command output
Every tool expands the agent’s action space.
A larger action space means more capability, but also more failure modes.
The harness should make tool use scoped, observable, and reversible where possible.
Practical rule: Tools should be powerful, scoped, observable, and policy-constrained.
Engineer Context Like a Runtime Resource
Context is not a giant text box.
Context is working memory.
If you treat the context window like a dumping ground, agent quality degrades.
The agent becomes distracted. Stale information competes with fresh information. The model starts to miss details that should have been obvious.
A better mental model is a memory hierarchy:
L0: stable identity
L1: task instructions
L2: active working context
L3: retrieved project context
L4: long-term memory
L5: external documents and tools
L6: durable progress artifacts
Each layer has a job.
The identity layer should be small and stable.
The task layer should be specific.
Retrieved context should be relevant and fresh.
Memory should contain durable facts, not noise.
Tool outputs should be summarized instead of blindly appended forever.
Progress artifacts should preserve state across sessions.
Context engineering asks:
What must be in the prompt?
What can be retrieved on demand?
What should be summarized?
What should be persisted?
What should be forgotten?
What should never enter context?
More context is not always better.
Better-routed context is better.
Practical rule: Treat context like RAM, not storage.
Build Durable State Outside the Context Window
Long-running agents fail when all state lives in the chat.
Eventually, the context compresses, degrades, or disappears.
The agent forgets why it made a decision, repeats work, loses track of tests, or declares success without remembering what is still broken.
A serious harness needs durable progress artifacts.
Examples:
PROGRESS.md
PLAN.md
DECISIONS.md
RISKS.md
TODO.md
CHANGELOG.md
git commits
trace logs
test reports
A weak long-running agent does this:
1. Make many changes
2. Lose context
3. Forget why
4. Declare success
5. Leave broken state
A strong long-running agent does this:
1. Read progress
2. Select one task
3. Make a small change
4. Run verification
5. Commit or checkpoint
6. Update progress
7. Record risks
8. Continue
For coding agents, a good PROGRESS.md might look like this:
## Current Goal
Implement scoped retry handling for failed ingestion jobs.
## Completed
- Identified retry path in worker.py
- Added unit test for transient network failure
- Confirmed existing backoff utility exists
## In Progress
- Wiring retry policy into ingestion worker
## Blockers
- Need to confirm max retry count for production
## Next Step
- Add integration test for failed job replay
## Risks
- Duplicate processing if idempotency key is missing
This gives the next agent session a clean handoff.
Practical rule: Long-running agents need state outside the model.
Separate Memory From Skills
Many agent systems confuse memory and skills.
They are not the same. Memory stores facts. Skills store procedures.
Memory answers:
What does the agent know?
Skills answer:
How does the agent do something?
Examples of memory:
The project uses Poetry.
The user prefers concise technical explanations.
The staging deploy requires manual approval.
The API gateway owns refresh-token handling.
Examples of skills:
How to debug a failing Kubernetes pod.
How to review a pull request.
How to investigate a latency regression.
How to create a database migration safely.
How to summarize a production incident.
A skill should be structured and reusable:
---
name: latency-regression-debug
description: Use when p95/p99 latency increases after a deploy.
version: 1.0.0
---
## When to Use
Use when latency regression is reported after a code, config, infra, or model change.
## Procedure
1. Identify affected endpoint or job.
2. Compare p50, p95, and p99 before and after deploy.
3. Check recent diffs.
4. Inspect dependency latency.
5. Check queue depth and saturation.
6. Reproduce with a controlled benchmark if possible.
7. Propose the smallest reversible fix.
## Pitfalls
- Optimizing average latency while ignoring p99.
- Blaming the database before checking queueing.
- Ignoring cold starts.
- Comparing different traffic windows.
## Verification
- Same traffic class.
- Same time window.
- p95/p99 restored.
- No regression in error rate.
This is procedural memory. It helps the agent avoid rediscovering workflows.
Practical rule: Facts go into memory. Repeatable procedures become skills.
Build the Evaluation Harness With the Agent Harness
Agent evals are harder than normal LLM evals.
A chatbot produces an answer. An agent produces a trajectory.
That trajectory includes:
1. Tool calls
2. File reads
3. Edits
4. API calls
5. Retries
6. Failures
7. Recoveries
8. Test runs
9. Final output
10. State changes
A final answer can look correct while the trajectory is bad.
For example, the test passes, but the agent:
1. Edited the wrong abstraction
2. Ignored an existing helper
3. Introduced duplicate logic
4. Skipped security-sensitive checks
5. Used 40 unnecessary tool calls
6. Failed to document risk
That should not be a full pass.
A serious eval harness should measure both outcome quality and process quality.
For agent systems, useful eval dimensions include:
1. Task success
2. Tool selection
3. Tool efficiency
4. State changes
5. Policy violations
6. Latency
7. Token cost
8. Retry behavior
9. Verification quality
10. Diff quality
11. Failure recovery
The key idea is simple:
Agent harness = runs the agent
Eval harness = runs the agent against tasks,
captures traces, grades outcomes,
and aggregates results
You need both.
Practical rule: Evaluate the trajectory, not just the answer.
Use Macro Evals to Debug Systemic Failures
Single-run debugging is not enough.
Agent systems fail in patterns.
Examples:
Planner delegates too late
Researcher over-collects sources
Coder edits before reading
Reviewer focuses on style instead of correctness
Memory retrieval injects stale context
Tool retry loop burns tokens
Subagents duplicate work
Escalation happens too late
Macro evals look across many traces to identify repeated failure modes.
The workflow looks like this:
1. Collect traces
2. Score individual runs
3. Compress traces into comparable summaries
4. Cluster recurring behavior patterns
5. Rank patterns by impact
6. Inspect representative examples
7. Patch system behavior
8. Rerun evals
This moves you from anecdotal debugging to distribution-level engineering.
Instead of asking:
Why did this one run fail?
Ask:
What class of runs fails, and what system behavior causes it?
That is the difference between debugging an example and improving a platform.
Practical rule: Beginners debug examples. Advanced teams debug failure distributions.
Measure Reliability, Not Just Capability
Agents are nondeterministic.
One successful run does not mean the system is reliable.
Two useful metrics are:
pass@k = did at least one of k attempts succeed?
pass^k = did all k attempts succeed?
These measure different things.
pass@k measures capability.
It asks whether the system can solve the task if given multiple chances.
pass^k measures consistency.
It asks whether the system succeeds every time.
A coding agent that solves a task once out of five attempts is capable.
It is not reliable.
A support agent that gives the correct policy once but fails randomly later is dangerous.
Track:
1. Success rate
2. Variance
3. Retry count
4. Cost per success
5. Latency per success
6. Tool calls per success
7. Failure categories
8. Recovery rate
Practical rule: Production agents need consistency, not occasional brilliance.
Design Failure Handling Explicitly
Most agent demos ignore failure handling.
Real systems cannot.
Every agent needs a failure model.
Define what happens when:
1. A tool call fails
2. Retrieval returns stale context
3. Tests fail
4. An API rate limit is hit
5. The agent loops
6. Required context is missing
7. Permissions are insufficient
8. Output confidence is low
9. Subagents disagree
10. Verification cannot be completed
A good failure policy looks like this:
## Failure Policy
If a tool fails:
- retry once if the failure is transient
- do not retry destructive actions automatically
- summarize the failure
- choose an alternate path if available
If tests fail:
- inspect the failure
- make at most one targeted fix
- rerun the narrow test
- if still failing, stop and report
If context is insufficient:
- state what is missing
- proceed only with clearly labeled assumptions
- avoid irreversible actions
Agents should not silently push through uncertainty.
A reliable agent knows when to continue, when to retry, and when to stop.
Practical rule: Failure handling is part of the harness, not an afterthought.
Use Multi-Agent Systems Only When Coordination Pays Off
Multi-agent systems sound advanced.
Often they are just expensive chaos.
Use multiple agents only when the task benefits from parallelism or specialization.
Good fits:
1. Broad research
2. Multi-source investigation
3. Red-team / blue-team review
4. Planner-coder-reviewer workflows
5. Independent verification
6. Large codebase exploration
Bad fits:
1. Simple Q&A
2. Small code edits
3. Basic summarization
4. Single-file changes
5. Narrow classification
A useful architecture:
Lead agent
owns task framing, planning, and synthesis
Research agents
explore independent branches
Coder agent
makes implementation changes
Reviewer agent
checks correctness, safety, and regressions
Verifier agent
runs tests and validates outputs
Important harness rules:
1. Give each agent a narrow role
2. Set token and tool budgets
3. Require compressed findings
4. Avoid raw context dumps
5. Prevent duplicate work
6. Define handoff contracts
7. Evaluate the system as a whole
Multi-agent systems are not automatically better.
They are better only when coordination is cheaper than sequential work.
Practical rule: Add agents when specialization creates leverage, not because the diagram looks impressive.
Add Observability From Day One
If you cannot inspect an agent, you cannot improve it.
A production-grade harness should emit traces.
Capture:
1. Input task
2. Loaded context
3. Retrieved memories
4. Selected skills
5. Tool calls
6. Tool outputs
7. State transitions
8. Errors
9. Retries
10. Final answer
11. Cost
12. Latency
13. User feedback
Without traces, you cannot answer:
Why did the agent choose this tool?
Why did it ignore the relevant file?
Why did it retrieve stale memory?
Why did it loop?
Why did cost spike?
Why did the final answer look correct but fail?
Observability enables debugging, evals, macro analysis, cost control, policy enforcement, skill improvement, and memory cleanup.
Practical rule: No traces, no serious agent engineering.
Put Governance Around the Harness
Agent adoption is not only technical.
It is organizational.
Without governance:
1. Every developer writes their own prompts
2. Permissions drift
3. Skills duplicate
4. Memory gets messy
5. Evals are missing
6. Tools are unsafe
7. Nobody owns regressions
With governance:
1. Shared configs
2. Shared skills
3. Shared evals
4. Clear permissions
5. Standard review process
6. Centralized observability
7. Safer rollout
8. Faster onboarding
Every serious agent platform needs a DRI.
Someone must own:
1. Identity files
2. Tool policies
3. Memory policy
4. Skill library
5. Eval suite
6. Permission model
7. Release process
8. Incident review
9. Documentation
Bottom-up experimentation creates energy.
Governance turns it into infrastructure.
Practical rule: If nobody owns the harness, nobody owns the agent.
Final Takeaway
Next-level AI agents are not built by writing bigger prompts.
They are built by engineering better harnesses.
The model is the reasoning engine.
The harness is the operating system around it.
A serious agentic harness needs:
1. Identity
2. Execution contracts
3. Tool policies
4. Context engineering
5. Memory discipline
6. Skills
7. Durable state
8. Failure handling
9. Trajectory evals
10. Macro evals
11. Observability
12. Governance
If you are a student, learn this early.
If you are a developer, practice this deliberately.
If you are building AI products, treat this as infrastructure.
The best AI developers will not be the ones who only know how to call an API.
They will be the ones who know how to design the system around the model.
That is how we move from:
“This AI agent helps me sometimes.”
To:
“This agentic harness is part of my engineering system.”
References
- Anthropic: Demystifying evals for AI agents
- Anthropic: Building effective agents
- OpenAI Cookbook: Building Governed AI Agents
- Anthropic: Effective harnesses for long-running agents
- Anthropic: Effective context engineering for AI agents
- OpenAI Cookbook: Macro Evals for Agentic Systems
- OpenAI Cookbook: Getting started with OpenAI Evals
- Anthropic: How we built our multi-agent research system

Top comments (0)