šā
Iām Luhui Dev, a developer who has been breaking down Agent engineering and exploring how AI can be applied in education.
I focus on Agent Harness, LLM application engineering, AI for Math, and the productization of education SaaS.
Intro
Anthropic's recent posts on Agent Harness are worth your time.
They quietly pushed the whole field forward ā from "how do I write a smarter loop" to "how do I design a runtime that survives production."
This piece walks through the latest practice: Session, Harness, Sandbox, Credentials, Tool Protocol, Context Builder, Trace, Eval.
How the thinking shifted
Anthropic didn't wake up one day and decide Agents needed a Runtime. The center of gravity moved a few times over the past couple of years.
Phase 1: Long context as the main lever
Early Anthropic talked a lot about long context.
100K, then 200K context windows showed up. Claude could read more docs, hold longer conversations, juggle more complex material. Most problems were still framed as prompt engineering ā how to stuff information in, how to make the model find the right piece in a long window, how to cut down on misses.
Made sense at the time. When the window suddenly gets bigger, everyone wants to throw task state, docs, and chat history into it.
But real Agent work proved a simple point: a bigger workspace is not the same as reliable memory.
No matter how long the window is, it's still tokens the model sees in a single call. It gets expensive. It degrades. It gets compressed. It gets polluted by noise.
Phase 2: Splitting Agents into workflow vs autonomous loop
By the Building Effective AI Agents era, Anthropic started drawing a hard line between workflow and agent.
A workflow has a defined process and controllable paths. The model makes calls at certain nodes.
An agent is an open loop. The model plans, calls tools, reads results, and keeps going on its own.
This distinction matters more than people give it credit for. Most products don't need a highly autonomous Agent.
Stable business processes are cheaper, more reliable, and easier to debug as workflows. Forcing an Agent in usually just turns a controllable process into an uncontrollable black box.
The takeaway from this phase still holds: start simple. Only reach for higher autonomy when the task actually demands open-ended decisions.
Phase 3: Tools, context, and safety become the main battlefield
After 2025, Anthropic's posts pivoted hard toward engineering details.
think tool ā gives space for reasoning inside complex tool calls.
Multi-agent research system ā parallel search and division of labor for heavy research tasks.
Context engineering ā selecting, compressing, trimming, and dynamically loading context.
Agent Skills ā procedural domain knowledge, loaded on demand.
Claude Code sandboxing ā drawing the line around code execution, filesystem, network, and credentials.
MCP, code execution with MCP, advanced tool use ā connecting tools, discovering them, and stopping tool definition bloat and intermediate-result pollution from wrecking the context.
These look like scattered topics. They all point at the same thing:
Once an Agent does real work, the question stops being "can the model answer" and becomes "can the system carry the model's actions."
Too many tools ā context explodes.
Tasks too long ā chat history runs out of road.
Execution too free ā safety boundary collapses.
Multi-Agent too eager ā cost and coordination overhead pile up.
Models upgrading too fast ā old harness assumptions expire.
Phase 4: Lift the problem to the Runtime layer
In the latest Managed Agents post, Anthropic stopped debating how to write a specific harness. They started talking about a stable interface for an Agent Runtime.
The system gets split into Session, Harness, Sandbox.
Claude + harness = the brain. Sandbox and execution environment = the hands.
Session lives outside the context window.
Credentials live outside the sandbox.
Execution environments are allowed to fail, get replaced, get rebuilt.
That's the whole arc of Anthropic's thinking:
Long context ā Tool loop ā Context engineering ā Safe execution ā Recoverable runtime
The core idea of Managed Agents: stable interfaces, swappable strategies
Managed Agents boils down to one line: don't bolt together the things that will keep changing.
Models change. Harness strategies change. Tools change. Sandbox shapes change. Context strategies change. Customer deployment environments change. Safety requirements change.
Cram all of that into one container, one loop, one prompt stack ā and within a year your system is a brick you can't replace.
Harnesses encode assumptions that go stale as models improve.
A harness encodes the current model's weaknesses. Model can't plan long tasks? Add a planner. Model misses checks? Add an evaluator. Model bails early when context is close to full? Add context reset. Model is shaky on tool calls? Add elaborate retry logic.
These strategies work on one generation of the model. The next generation, they're dead weight.
Anthropic gave a sharp example: Claude Sonnet 4.5 tended to wrap up early near the context limit, so the harness added a context reset. With Claude Opus 4.5, that behavior was gone ā and the reset logic became overhead.
The lesson:
Don't bake today's model defects into tomorrow's architecture.
The core interfaces Managed Agents pulls out look roughly like this:
Session: what happened during the task
Harness: what to do next
Sandbox: where actions execute
Tool interface: how actions get called
Credential boundary: whether actions are authorized
Context builder: what the model sees this turn
Trace / Eval: how the run gets reviewed
The point isn't to land an elegant fixed Agent loop.
The point is: when models, tools, and execution environments change, the system can keep evolving.
That's what's actually worth stealing from Managed Agents.
Key idea #1: Brain / Hands decoupling
The most important cut in Managed Agents is splitting brain from hands.
Brain = Claude + harness.
Hands = sandbox, MCP server, external tools, devices, browser, code execution environment.
The early default was putting the brain inside the hands. One container running the harness, holding the session, executing tools, sitting on the filesystem ā sometimes with credentials thrown in for fun.
In production, this creates the classic problem: the container becomes a pet server.
You can't toss it. Can't easily restart it. If it crashes you have to rescue it. To debug you have to SSH in.
User data, execution state, tool calls, and credential boundaries all mashed together.
Anthropic's later approach: let the harness leave the sandbox. The harness becomes a relatively stateless control plane. The sandbox becomes a callable, rebuildable execution resource.
The two talk through a dead-simple interface:
execute(name, input) -> string
The harness doesn't need to know whether the other side is a container, a remote service, an MCP server, or some tool environment inside a customer's VPC. It calls the action. It gets the result back.
What you get out of this:
- Sandbox dies, task doesn't die.
- Brain can start work, sandbox can load later.
- One brain, many hands.
Key idea #2: Session design
The other big call in Managed Agents:
Session is not Claude's context window.
Lots of Agent systems blur session, chat history, memory, and context window together. Short tasks survive that. Long tasks fall apart.
The context window is just the tokens the model sees in a single inference call. It's a workspace.
The session should be the durable record of what happened ā closer to an event log.
A serious session should be capturing at least:
user_input
model_response
tool_call
tool_result
file_change
error
retry
approval
checkpoint
Every time the harness calls the model, it pulls from the session and assembles a context for that turn.
This separation is the whole game: Prompt is the workspace. Session is the ledger.
A workspace gets organized, compressed, trimmed, rearranged. A ledger stays as complete, queryable, and recoverable as you can make it. Dump all history into context and cost explodes while the model drowns in noise. Rely only on summaries and the detail you dropped becomes tomorrow's critical bug.
The shape that holds up:
Raw events kept long-term
ā
Context Builder picks dynamically
ā
This model call sees a high-signal context
This is also where context engineering and durable state split apart.
Context engineering decides what the model sees this turn.
Session event log records what actually happened in the system.
Long-running Agents that skip this layer pay for it later ā resume, trace, eval, debug all get painful.
Key idea #3: Sandbox design
Sandbox is the most underrated piece in an Agent system.
Most teams start by giving the Agent a shell. It can run commands, read files, edit code. Feels like enough.
Fine for demos. In production, the sandbox is your security boundary, your execution boundary, and a meaningful source of cost and latency.
What Anthropic pushed in Claude Code sandboxing and Managed Agents:
- Sandboxes isolate filesystem and network. Treat model-generated code as untrusted code. The sandbox needs to limit filesystem access and limit network reach. Otherwise prompt injection can talk the Agent into reading files it shouldn't, hitting services it shouldn't, and exfiltrating the result.
- Sandboxes shouldn't hold long-lived credentials. Every GitHub token, DB key, or cloud secret sitting inside the sandbox is something an attacker can talk the Agent into leaking.
- Sandboxes need to be rebuildable and recoverable. Long-running Agents will hit failures. Bind the sandbox to a session too tightly and the failure takes the whole task down. Better: make the sandbox rebuildable, recoverable, and ideally snapshot/resume-able. This is just Brain / Hands decoupling, taken seriously.
TL;DR: Anthropic's 2026 Agent Harness architecture
Stitching the whole 2026 thinking together, you get this picture:
What matters here is the responsibility boundary for each piece.
Harness is the control plane ā schedules models, context, tools, strategies.
Session event log is durable state ā not bound to any container.
Context Builder ā assembles a high-signal context from session, memory, skills, and tool results.
Tool Router ā dispatches actions to MCP, the code execution environment, the sandbox, or other hands.
Sandbox executes actions ā allowed to fail, allowed to be rebuilt.
Credential Proxy / Vault holds credentials ā untrusted execution environments never get the raw token.
Trace / Eval runs through the whole thing ā so you can review, regress, and A/B harness changes.
Research-grade harness vs production-grade harness
Plenty of Agent demos look great and then get clunky, expensive, and impossible to debug in production.
The reason: research harness and production harness have different goals.
A research harness chases capability ceiling. Burn more tokens, spawn more subagents, stack more evaluators, run another round. If the task success rate ticks up, the experiment was worth it.
A production harness chases stable returns. It has to count cost, watch latency, control permissions, recover from failure, be observable, ship gradually, roll back cleanly.
| Dimension | Research Harness | Production Harness |
|---|---|---|
| Goal | Push task success ceiling | Stable delivery under cost / latency / safety constraints |
| State | Transcript, local files, temp progress files | External session log, checkpoints, event history |
| Context | Give the model everything you can | Smaller, higher-signal context set |
| Tools | Wire up as many as possible | Dynamic discovery, on-demand loading, scoped permissions |
| Multi-Agent | Try parallelism and role splits first | Only on high-value, parallelizable tasks |
| Safety | Manual confirmation, light isolation | Sandbox, proxy, vault, scoped credentials |
| Failure recovery | Retry or human handoff | Resume, replay, checkpoint, trace |
| Evaluation | Did the final demo work | Outcome eval, trace analysis, regression suite |
| Iteration | Add modules, add strategies, add agents | Run ablations, delete the strategies that no longer pay rent |
One line from Anthropic's posts sticks: harness strategies get repriced every time the model upgrades.
Today's planner is helpful. Tomorrow it slows the system down. Today's evaluator catches errors. Tomorrow it just adds cost. Today's context reset is a necessary patch. Tomorrow it's dead weight.
So a production harness can't only add things. It has to delete things. That's the whole point of ablations.
Every model upgrade should re-test: is memory still earning its keep? Is the critic? Is tool search? Is multi-agent fanout? Is context reset?
In Agent engineering, the ability to delete obsolete complexity is its own skill.
Closing thoughts
Agent products will keep getting more complex. But the complexity shouldn't all live in the prompt and the loop. It belongs in the runtime:
Session: durable state
Harness: control plane
Context Builder: context scheduling
Tool Router: action dispatch
Sandbox: isolated execution
Credential Proxy: credential boundary
Trace: process record
Eval: outcome judgment
That's the actual foundation for Agents in production.
Multi-agent setups will keep evolving. The MCP ecosystem will keep growing. Context windows will keep getting longer. Models will keep getting better at tool calls, planning, and self-repair.
But none of that softens the core problem. It sharpens it: your system has to be able to swap out old strategies.
An Agent platform that hardcodes everything into prompts, containers, and a fixed loop gets harder to maintain every quarter.
A system that draws clear boundaries between state, execution, credentials, context, and evaluation ā that's the one that gets to evolve alongside the model.
References
- Anthropic, Scaling Managed Agents: Decoupling the brain from the hands www.anthropic.com/engineering/managed-agents
- Anthropic, Harness design for long-running application development www.anthropic.com/engineering/harness-design-long-running-apps
- Anthropic, Effective harnesses for long-running agents www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
- Anthropic, Effective context engineering for AI agents www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- Anthropic, Making Claude Code more secure and autonomous with sandboxing www.anthropic.com/engineering/claude-code-sandboxing
- Anthropic, Code execution with MCP: building more efficient AI agents www.anthropic.com/engineering/code-execution-with-mcp
- Anthropic, Introducing advanced tool use on the Claude Developer Platform www.anthropic.com/engineering/advanced-tool-use
- Anthropic, Equipping agents for the real world with Agent Skills www.anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
- Anthropic, Building Effective AI Agents www.anthropic.com/engineering/building-effective-agents

Top comments (0)