TL;DR
I worked on a coding agent harness that executes AI-generated code in Docker sandboxes, and learned that the real bottleneck isn't code generation, but everything after: environment setup, error handling, and integration with external services.
While this works great for local development, connecting to tools like GitHub, Slack, or databases meant building dozens of API integrations, managing OAuth flows, and handling edge cases. That's where Composio's ToolRouter came in, instead of building integrations myself, I now get tools through 6 simple meta-tools, with authentication and execution handled automatically.
The result? An agent that can write code, test it locally, create GitHub issues, and notify Slack, all through a single, observable execution loop. Turns out the coolest part wasn't just watching AI write code, but watching it interact with the real world safely and transparently.
Note: This is a learning project where I explored agent execution
patterns and integration approaches. The Composio integration shown
here demonstrates the concept, though a production implementation
would need additional error handling, cost controls, and testing.
Why Build a Coding Agent Harness
I didn’t start thinking about a coding agent harness because existing tools are bad. Quite the opposite. Tools like Claude Code and Codex are excellent, and I use them regularly for debugging and iteration.
What sparked this work wasn’t dissatisfaction, but curiosity.
While using these tools, I kept noticing that a lot of the most important work happens outside the model: code execution, error capture, retries, environment setup, and tool integration. These systems handle that complexity well, but largely invisibly. You get an answer, a fix, or a result, but not always a clear view into how the system arrived there.
I wanted to understand that execution loop more deeply.
Not just did it work, but:
- what code actually ran
- what error was thrown
- what context was passed back to the model
- what changed between attempts
- and where control really lives when things fail
Most modern AI coding tools optimize for speed and convenience. They hide complexity to make workflows feel smooth. That’s usually the right tradeoff. But it also means the execution layer, the part where code meets reality, is opaque.
The Real Bottleneck Isn’t Code Generation
Writing code is no longer the hard part. Models are already very good at producing first drafts, boilerplate, and obvious fixes.
The harder part is everything that happens after: running code in a real environment, capturing raw errors, installing missing dependencies, handling environment variables, and retrying with proper context when something fails.
Humans still do most of this manually. We run the code, read the error, decide what matters, and feed it back into the model. That makes the human the slowest and most expensive part of the loop.
If an AI system can write code but cannot directly observe failures and respond to them, it isn’t really an agent. It’s a suggestion engine with a human acting as the executor.
What I Mean by a “Harness”
A coding agent harness isn’t a model and it isn’t a framework. It’s the infrastructure around the model.
It’s the execution loop that:
- runs code in a controlled environment
- captures raw outputs and errors
- feeds that reality back to the model
- applies guardrails around what the system can touch
- and makes every step visible and debuggable
The language model provides reasoning and code.
The harness provides execution, feedback, and constraints.
Together, they turn “generate code” into “try, fail, observe, and improve.”
Why This Matters
This isn’t about replacing developers or claiming AI writes perfect code.
It’s about removing the least valuable parts of development work: rerunning commands, fixing missing imports, interpreting test failures, and repeating the same debug loop over and over.
By making execution and feedback explicit, AI systems become easier to trust, easier to debug, and easier to reason about.
That’s the motivation behind this work: not building a better model, but understanding and shaping the execution layer that makes these tools actually useful.
And honestly, the first time you watch an agent write code, see it fail, understand what broke, and fix itself, that’s genuinely fucking cool. Not because it’s magic, but because you can finally see the entire loop working.
Building the Harness
First part was the motivation. Next is the part where the idea becomes a real execution loop you can watch, debug, and trust.
Before writing any code, I forced myself to answer one question:
What does an "agent" actually do, step by step, when you strip away the UI?
Here's the loop in plain English:
- Task comes in
- Model decides what to do next
- Model asks to use a tool
- The harness runs something in the real world
- The harness returns raw results
- Model updates its plan
- Repeat until done or we hit a cap
The important detail is the one people forget:
The model never touches your machine.
It only sees whatever the harness returns.
That boundary is the whole point. The harness is the interface between intelligence and execution. If you control that interface, you control the agent.
The Architecture
This harness has three main parts.
1) The workspace
A fresh directory on the host for every run.
This is where the agent writes and reads files. Think of it like a tiny project repo the agent can manipulate.
That choice matters because it's the difference between:
- "run a snippet of Python"
- and "build and iterate on an actual codebase"
2) The Docker sandbox
Every command runs inside a Docker container with the workspace mounted at /workspace.
That gives me isolation, repeatability, and a place to put guardrails.
By default:
- Resource limits are enforced (CPU + memory)
- Commands have a hard timeout
- Networking is disabled unless explicitly enabled
If you are going to let a model execute arbitrary code, you need a safety story. Docker isn't perfect, but it is an enormous step up from running things directly on the host.
The Image Setup:
I use a custom Docker image with pytest pre-installed:
FROM python:3.11-slim
RUN pip install --no-cache-dir pytest
WORKDIR /workspace
CMD ["bash"]
This solves a critical problem: if the agent has to install pytest every run, it wastes iterations and API calls. Pre-installing common dependencies in the image means every container starts ready to work.
3) The tool contract
The model does not run anything directly. It calls tools. The harness executes those tools and returns structured results.
I kept the tool surface area small but real:
list_files()read_file(path)write_file(path, content)run_command(command, timeout, network_enabled)task_complete(summary)
That is enough for real developer workflows:
- Create modules
- Write tests
- Run pytest
- Read failures
- Patch code
- Rerun until green
Why Docker
Yes, I could have used subprocess.run() and called it a day.
But that's not "agent infrastructure." That's handing a model a loaded gun.
You do not want a tool-using model to have direct access to:
- Your filesystem
- Your SSH keys
- Your environment variables
- Your network
- Your ability to fork-bomb your laptop into a space heater
Docker gives me a baseline set of protections:
- Process isolation: Code runs in a container, not on my host
- Resource limits: CPU and memory caps prevent obvious abuse
- Network control: Networking can be off by default
- Reproducibility: Every run starts from a known image
The tradeoff is complexity. It adds friction and edge cases. But for this category of problem, it's the right trade.
The Tools
I used to think tool calling was the "agent" part.
It's not. Tool calling is just the API plumbing that makes the loop possible.
The model is effectively saying: "Please run this for me, and tell me what happened."
The harness is the one doing the doing.
Tool: write_file
This is how the model creates and edits code. It writes directly into the workspace.
In the simplest form, this tool is just an interface to a safe path resolver plus a size limit so the agent can't spam huge files.
Tool: run_command
This is the core tool. It's how the model runs tests, lints, scripts, whatever.
Two design choices here ended up being surprisingly important:
- Timeouts are enforced. If a command hangs, it dies.
- Networking is off by default. If the agent wants internet (pip install, curl, etc.), it has to explicitly ask for it.
That single boolean makes the system easier to reason about, and it's the kind of guardrail you never see from the outside in most agent products.
Here's what the tool schema looks like:
{
"name": "run_command",
"description": "Run a shell command in Docker in /workspace. Networking is off by default.",
"parameters": {
"type": "object",
"properties": {
"command": {"type": "string"},
"timeout": {"type": "integer", "description": "Seconds"},
"network_enabled": {"type": "boolean", "description": "Enable for pip install, etc."},
"description": {"type": "string", "description": "What this is trying to do"}
},
"required": ["command", "description"]
}
}
Tool: task_complete
This is the stop button. Without it, the harness just keeps looping until it hits max_iterations, even if the task is obviously done.
For an agent loop, you need a clear termination condition.
How tool calling actually works
The mechanics are simple:
- You define tools using a JSON schema
- You send them with your API request
- The model can respond with tool calls rather than plain text
- You execute the tool calls in your harness
- You send the results back as tool messages
- The model sees those results and continues
So the conversation looks like:
User: "Build X"
Model: calls write_file and run_command
Harness: executes and returns real output and errors
Model: uses that reality to decide what to do next
Repeat
The key point again:
The model is not executing. It's requesting.
You are the executor. The harness is the gate.
The code that matters
Most of the harness is glue code. The heart is the loop:
for i in range(max_iterations):
# Get model response with tools
resp = client.chat.completions.create(
model=MODEL,
messages=messages,
tools=TOOLS,
tool_choice="auto",
)
msg = resp.choices[0].message
# If model wants to use tools
if msg.tool_calls:
# Store assistant message
messages.append({
"role": "assistant",
"tool_calls": [...]
})
# Execute each tool call
for tc in msg.tool_calls:
name = tc.function.name
args = json.loads(tc.function.arguments)
result = execute_tool(name, args)
# Add result back to conversation
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": json.dumps(result)
})
# Check for completion
if name == "task_complete":
return success
In practice, the "agent" behavior comes from one thing:
The model sees raw stdout, exit codes, and failures, and it learns what to do next from those signals.
It's the difference between:
- "I think this should work"
- and "I ran it, here is the traceback"
A Real Run
Here's what actually happened when I gave it a real task:
Task: Create a Python text normalization utility with pytest tests
Iteration 1: Create project structure
→ write_file("textutils/core.py")
→ write_file("tests/test_normalize.py")
✓ Files created
The agent writes a normalize() function and 6 test cases covering edge cases like empty strings, multiple spaces, tabs, and newlines.
Iteration 2: Run tests (first attempt)
→ run_command("pytest tests/test_normalize.py")
✗ exit_code=2
Error: "ERROR: file or directory not found: tests/test_normalize.py"
Pytest can't find the module. The agent doesn't understand why yet.
Iterations 3-7: The struggle
→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2
→ run_command("pytest --pythonpath=/workspace")
✗ exit_code=4
→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2
This is where you see real problem-solving behavior. The agent is stuck trying different pytest configuration flags, but that's not actually the problem.
Five iterations of trying variations before it figures out the real issue.
Iteration 8: The breakthrough
→ write_file("textutils/__init__.py", "")
→ run_command("pytest tests/test_normalize.py")
✓ exit_code=0
===== test session starts =====
collected 6 items
tests/test_normalize.py ...... [100%]
===== 6 passed in 0.02s =====
The agent realizes textutils needs to be a proper Python package. Adds the __init__.py file. All tests pass.
Iteration 9: Done
→ task_complete("Created a text normalization utility with tests. All tests passed successfully.")
Total time: ~25 seconds. Iterations: 9.
What This Teaches You
The satisfying part how it worked.
The agent:
- Created real files
- Ran real commands
- Saw real failures
- Got stuck for a bit (iterations 3-7)
- Had a realization (iteration 8)
- Fixed the actual problem
- Verified success
That's the loop working. Not magic, not cherry-picked—just a feedback cycle that eventually converges on the right answer.
And when something goes wrong, you can replay it:
{
"iteration": 8,
"tool": "run_command",
"args": {"command": "pytest tests/test_normalize.py"},
"result": {
"exit_code": 0,
"stdout": "===== 6 passed in 0.02s =====",
"duration": 1.85
}
}
Every tool call is logged. Every decision is traceable.
Error handling and retries
I don't do explicit "retry three times" logic.
The retry loop is implicit.
The model retries because it sees structured execution results like:
exit_codetimed_out-
stdout(often includes tracebacks) duration
When a command fails, the model gets the failure, thinks, changes code, and reruns.
This works well for:
- Syntax errors
- Missing imports
- Obvious mistakes in logic
- Test failures with clear assertions
It works less well when:
- Requirements are ambiguous
- The failure needs domain knowledge
- The fix is large and multi-step
That's not a model problem. That's a harness ergonomics problem, because the agent still lacks better tools like patch editing, diffing, and memory.
Logs: the feature you don't appreciate until you need it
I log every tool call to a JSONL file inside the workspace.
That log includes:
- What tool was called
- Arguments
- The full result payload
- Timestamps
It sounds boring, but it changes everything.
When something goes wrong, you can answer:
- What command actually ran
- What the model actually saw
- Whether it tried to install dependencies
- Whether it timed out
- What changed right before it broke
Logs turn agent runs into something you can debug, replay, and trust.
What broke (and what I learned)
Let me be honest about the failures, because this is where the actual lessons are.
Problem 1: Docker dependency management
Initially, I tried letting the agent install pytest on every run. Bad idea.
Fresh containers mean fresh installs every time. The agent would waste 2-3 iterations just getting its environment ready before doing actual work.
The fix: Pre-install common dependencies (pytest, pip tools) in the Docker image. Every container starts with them available.
This is a general pattern: if you know the agent will need something frequently, bake it into the base image.
Problem 2: Context growth
Each iteration adds messages:
system → user → assistant → tool → assistant → tool → …
On longer runs you eventually run into context limits or degraded performance.
What helped in practice:
- Keep
max_iterationssane (12-15) - Make tasks more specific
- Reduce output size (truncate stdout to 20k chars)
The real fix (not fully implemented yet) is proper context management:
- Sliding windows
- Summarizing older tool results
- Storing full logs separately from working context
Problem 3: The agent can be too polite
Sometimes the model hits a wall and says "can't do that" instead of pushing.
Example: missing dependency, or a test failure it doesn't understand, and it decides to stop.
The fix is not "better model." It's better scaffolding:
- Stronger system prompt about persistence
- Better tools (like search within workspace, apply patch)
- Clearer success criteria (tests passing)
Problem 4: Networking policy
Some tasks need internet, often just for installing additional deps.
If network is on by default, the agent will use it constantly.
If it's off forever, the agent is stuck.
So I made it explicit: network is disabled by default, but the model can request it via network_enabled=true on run_command.
That's a good compromise because it forces you to see when and why the agent touches the network. In the logs, you can track every network-enabled command.
Beyond the Sandbox - Adding External Tools
The Integration Problem
When you start thinking about adding external tools, you realize the scope pretty quickly.
If you want your agent to create GitHub issues, you need:
- OAuth flow for GitHub
- Token management and refresh
- API wrapper for GitHub's REST API
- Error handling for rate limits
- Updates when GitHub's API changes
Now multiply that by every service you want: Slack, Gmail, Linear, Notion, databases...
You're looking at months of integration work. Or you could use something like Composio that already built those 1000+ integrations.
The important thing is that external tools don't change the agent loop, they stress it. Authentication failures, network timeouts, and rate limits are just another form of "reality" the harness has to surface back to the model. ToolRouter fits cleanly because it preserves the same contract: the model requests, the harness executes, and raw results come back.
What Composio's ToolRouter Actually Is
Composio ToolRouter is an integration layer that gives your agent access to external tools through a simple API.
The core concept is sessions. Each session:
- Is scoped to a specific user
- Manages that user's connected apps
- Provides tools the agent can call
- Handles authentication automatically
Here's the basic setup:
from composio import Composio
# Initialize Composio
composio = Composio(api_key=COMPOSIO_API_KEY)
# Create a session for your user
session = composio.create(user_id="alice@company.com")
# Get tools for this session
tools = session.tools()
That's it. Now tools contains the ToolRouter meta-tools scoped to this user.
How ToolRouter Actually Works
Here's what I learned when integrating it: ToolRouter doesn't give you individual tools upfront.
Instead, it provides 6 meta-tools that handle everything:
- COMPOSIO_SEARCH_TOOLS - Searches for relevant tools based on task description
- COMPOSIO_MULTI_EXECUTE_TOOL - Executes discovered tools
- COMPOSIO_MANAGE_CONNECTIONS - Handles authentication flows
- COMPOSIO_REMOTE_WORKBENCH - Processes large responses
- COMPOSIO_REMOTE_BASH_TOOL - Runs bash commands remotely
- COMPOSIO_SEARCH_ENTITIES - Searches for entities in connected apps
So instead of calling GITHUB_CREATE_ISSUE directly, the workflow is:
- Agent calls
COMPOSIO_SEARCH_TOOLSwith "create a GitHub issue" - ToolRouter finds the right tool
- Agent calls
COMPOSIO_MULTI_EXECUTE_TOOLwith parameters - ToolRouter executes it using the user's connected GitHub account
This is actually smarter than loading every possible tool. The meta-tools handle discovery, authentication, and execution dynamically.
Two Ways to Use ToolRouter
Composio gives you two integration patterns:
Option 1: As Native Tools (What I Used)
You get the 6 meta-tools and pass them to your agent framework:
# Create session
session = composio.create(user_id="user_123")
# Get meta-tools
tools = session.tools()
# Pass to your agent (OpenAI, Anthropic, etc.)
response = client.chat.completions.create(
model="gpt-4",
messages=[...],
tools=tools # The 6 ToolRouter meta-tools
)
The agent uses these meta-tools to discover and execute what it needs.
Option 2: As MCP Server
If you're using an MCP-compatible framework (like Claude Agent SDK), you can connect via the MCP protocol:
# Create session
session = composio.create(user_id="user_123")
# Get MCP endpoint
mcp_url = session.mcp.url
mcp_headers = session.mcp.headers
# Configure your MCP client
# The MCP server handles tool routing automatically
The MCP approach is more dynamic but adds complexity. For the harness, native tools made more sense.
For this harness, I wanted to keep the execution boundary explicit in my own code rather than delegate it to an MCP runtime.
How It Integrates With the Harness
The integration pattern is straightforward:
Agent Loop
├─→ Local Tools (files, Docker commands)
└─→ Composio Meta-Tools (discovery, execution)
↓
Execute tool
├─→ Local? Run in Docker
└─→ Composio? Route to ToolRouter
↓
Results back to agent
The key is combining both tool types:
class AgentExecutor:
def __init__(self, user_id, enable_composio=False):
self.workspace = Workspace()
self.sandbox = DockerSandbox(self.workspace)
# Composio setup
self.composio_session = None
if enable_composio:
composio = Composio(api_key=COMPOSIO_API_KEY)
self.composio_session = composio.create(user_id=user_id)
def get_all_tools(self):
"""Combine local and Composio tools"""
tools = LOCAL_TOOLS # write_file, run_command, etc.
if self.composio_session:
composio_tools = self.composio_session.tools()
tools.extend(composio_tools) # Add the 6 meta-tools
return tools
The agent doesn't know or care where tools come from. It just calls them. We handle the routing.
Authentication Flow
The authentication part is what makes ToolRouter useful. Here's how it works:
First time a user needs an app:
- Agent tries to use a tool (searches for "GitHub issue")
- User hasn't connected GitHub yet
- ToolRouter (via
COMPOSIO_MANAGE_CONNECTIONS) returns an auth URL - User clicks URL, completes OAuth
- Agent retries, now it works
After that:
- Composio manages the tokens
- Handles refresh automatically
- Agent just calls the meta-tools
You can pre-connect apps for users via the Composio dashboard (https://app.composio.dev/apps), or let them authenticate on-demand.
For the harness, pre-connecting apps is smoother. Less interruption during agent runs.
A Simple Example
Let me show you what a workflow looks like with ToolRouter.
Task: "Create a calculator module, test it, then create a GitHub issue."
The agent:
- Writes
calc.pywith functions (local tools) - Writes
test_calc.py(local tools) - Runs
pytest(local tool: Docker execution) - Sees tests pass
- Calls
COMPOSIO_SEARCH_TOOLSto find GitHub tools - Calls
COMPOSIO_MULTI_EXECUTE_TOOLto create the issue - Returns the issue URL
Steps 5-6 are where ToolRouter shines. The agent doesn't need to know GitHub's API. It just describes what it wants, and ToolRouter handles the rest.
What's Available
Through ToolRouter's meta-tools, you get access to many integrations:
Development:
- GitHub: create issues, PRs, manage repos
- GitLab, Bitbucket
- Jira, Linear: task management
Communication:
- Slack: send messages, create channels
- Discord, Teams
- Gmail: send/read emails
Productivity:
- Notion, Google Docs
- Calendar, Drive
- Trello, Asana
Databases:
- PostgreSQL, MongoDB, MySQL
- Airtable, Supabase
The agent discovers these through COMPOSIO_SEARCH_TOOLS as needed.
What I Learned
What works well:
- The meta-tool approach is cleaner than loading individual tools
- Authentication handling is automatic
- Discovery works - agent finds the right tools
- Error messages are clear (e.g., "connect this app first")
What needs more work:
- Execution can be slow (network round trips)
- Error handling for transient failures could be better
- Cost tracking per workflow
- Pre-checking connection status before trying operations
The goal isn't to integrate everything. It's to integrate the 3-5 apps that matter for your specific workflows.
The Honest Assessment
Composio solves a real problem. Building and maintaining multiple integrations yourself isn't realistic.
But it's not magic:
- You still need to understand how the meta-tools work
- Authentication setup takes time (connecting apps)
- External APIs add latency and failure modes
- Costs go up (more API calls)
The value proposition: trade integration complexity for a simpler API. Instead of learning OAuth flows for 10 services, you learn how ToolRouter's 6 meta-tools work.
For most cases, that's the right trade.
Wrapping Up
At this point, the agent isn't just writing code in isolation. It can change a codebase, verify behavior, and take real actions on behalf of a user, all through explicit, inspectable boundaries.
That's the pattern that keeps showing up: models reason, harnesses execute, and tools expose reality. Once you get that separation right, adding integrations stops being scary and starts being composable.
The result: an agent that can write code, test it locally, and take actions through connected services.
It's not perfect. It's not production-ready. But it's functional and extensible.
The integration pattern works - the agent can discover and call external tools through ToolRouter's meta-tools. I've tested the basic workflow locally, and the potential for GitHub issues, Slack notifications, and other integrations is straightforward from here.
The integration patterns are straightforward. The APIs are manageable. The possibilities are interesting.
That's where this is at right now.
Top comments (0)