Varshith Krishna for Composio

Posted on Jan 19

I'm Building My Own Coding Agent Harness (And It's Pretty Cool)

#llm #automation #agents #productivity

TL;DR
I worked on a coding agent harness that executes AI-generated code in Docker sandboxes, and learned that the real bottleneck isn't code generation, but everything after: environment setup, error handling, and integration with external services.

While this works great for local development, connecting to tools like GitHub, Slack, or databases meant building dozens of API integrations, managing OAuth flows, and handling edge cases. That's where Composio's ToolRouter came in, instead of building integrations myself, I now get tools through 6 simple meta-tools, with authentication and execution handled automatically.

The result? An agent that can write code, test it locally, create GitHub issues, and notify Slack, all through a single, observable execution loop. Turns out the coolest part wasn't just watching AI write code, but watching it interact with the real world safely and transparently.

Note: This is a learning project where I explored agent execution
patterns and integration approaches. The Composio integration shown
here demonstrates the concept, though a production implementation
would need additional error handling, cost controls, and testing.

Why Build a Coding Agent Harness

I didn’t start thinking about a coding agent harness because existing tools are bad. Quite the opposite. Tools like Claude Code and Codex are excellent, and I use them regularly for debugging and iteration.

What sparked this work wasn’t dissatisfaction, but curiosity.

While using these tools, I kept noticing that a lot of the most important work happens outside the model: code execution, error capture, retries, environment setup, and tool integration. These systems handle that complexity well, but largely invisibly. You get an answer, a fix, or a result, but not always a clear view into how the system arrived there.

I wanted to understand that execution loop more deeply.

Not just did it work, but:

what code actually ran
what error was thrown
what context was passed back to the model
what changed between attempts
and where control really lives when things fail

Most modern AI coding tools optimize for speed and convenience. They hide complexity to make workflows feel smooth. That’s usually the right tradeoff. But it also means the execution layer, the part where code meets reality, is opaque.

The Real Bottleneck Isn’t Code Generation

Writing code is no longer the hard part. Models are already very good at producing first drafts, boilerplate, and obvious fixes.

The harder part is everything that happens after: running code in a real environment, capturing raw errors, installing missing dependencies, handling environment variables, and retrying with proper context when something fails.

Humans still do most of this manually. We run the code, read the error, decide what matters, and feed it back into the model. That makes the human the slowest and most expensive part of the loop.

If an AI system can write code but cannot directly observe failures and respond to them, it isn’t really an agent. It’s a suggestion engine with a human acting as the executor.

What I Mean by a “Harness”

A coding agent harness isn’t a model and it isn’t a framework. It’s the infrastructure around the model.

It’s the execution loop that:

runs code in a controlled environment
captures raw outputs and errors
feeds that reality back to the model
applies guardrails around what the system can touch
and makes every step visible and debuggable

The language model provides reasoning and code.

The harness provides execution, feedback, and constraints.

Together, they turn “generate code” into “try, fail, observe, and improve.”

Why This Matters

This isn’t about replacing developers or claiming AI writes perfect code.

It’s about removing the least valuable parts of development work: rerunning commands, fixing missing imports, interpreting test failures, and repeating the same debug loop over and over.

By making execution and feedback explicit, AI systems become easier to trust, easier to debug, and easier to reason about.

That’s the motivation behind this work: not building a better model, but understanding and shaping the execution layer that makes these tools actually useful.

And honestly, the first time you watch an agent write code, see it fail, understand what broke, and fix itself, that’s genuinely fucking cool. Not because it’s magic, but because you can finally see the entire loop working.

Building the Harness

First part was the motivation. Next is the part where the idea becomes a real execution loop you can watch, debug, and trust.

Before writing any code, I forced myself to answer one question:

What does an "agent" actually do, step by step, when you strip away the UI?

Here's the loop in plain English:

Task comes in
Model decides what to do next
Model asks to use a tool
The harness runs something in the real world
The harness returns raw results
Model updates its plan
Repeat until done or we hit a cap

The important detail is the one people forget:

The model never touches your machine.

It only sees whatever the harness returns.

That boundary is the whole point. The harness is the interface between intelligence and execution. If you control that interface, you control the agent.

The Architecture

This harness has three main parts.

1) The workspace

A fresh directory on the host for every run.

This is where the agent writes and reads files. Think of it like a tiny project repo the agent can manipulate.

That choice matters because it's the difference between:

"run a snippet of Python"
and "build and iterate on an actual codebase"

2) The Docker sandbox

Every command runs inside a Docker container with the workspace mounted at /workspace.

That gives me isolation, repeatability, and a place to put guardrails.

By default:

Resource limits are enforced (CPU + memory)
Commands have a hard timeout
Networking is disabled unless explicitly enabled

If you are going to let a model execute arbitrary code, you need a safety story. Docker isn't perfect, but it is an enormous step up from running things directly on the host.

The Image Setup:

I use a custom Docker image with pytest pre-installed:

FROM python:3.11-slim
RUN pip install --no-cache-dir pytest
WORKDIR /workspace
CMD ["bash"]

This solves a critical problem: if the agent has to install pytest every run, it wastes iterations and API calls. Pre-installing common dependencies in the image means every container starts ready to work.

3) The tool contract

The model does not run anything directly. It calls tools. The harness executes those tools and returns structured results.

I kept the tool surface area small but real:

list_files()
read_file(path)
write_file(path, content)
run_command(command, timeout, network_enabled)
task_complete(summary)

That is enough for real developer workflows:

Create modules
Write tests
Run pytest
Read failures
Patch code
Rerun until green

Why Docker

Yes, I could have used subprocess.run() and called it a day.

But that's not "agent infrastructure." That's handing a model a loaded gun.

You do not want a tool-using model to have direct access to:

Your filesystem
Your SSH keys
Your environment variables
Your network
Your ability to fork-bomb your laptop into a space heater

Docker gives me a baseline set of protections:

Process isolation: Code runs in a container, not on my host
Resource limits: CPU and memory caps prevent obvious abuse
Network control: Networking can be off by default
Reproducibility: Every run starts from a known image

The tradeoff is complexity. It adds friction and edge cases. But for this category of problem, it's the right trade.

The Tools

I used to think tool calling was the "agent" part.

It's not. Tool calling is just the API plumbing that makes the loop possible.

The model is effectively saying: "Please run this for me, and tell me what happened."

The harness is the one doing the doing.

Tool: `write_file`

This is how the model creates and edits code. It writes directly into the workspace.

In the simplest form, this tool is just an interface to a safe path resolver plus a size limit so the agent can't spam huge files.

Tool: `run_command`

This is the core tool. It's how the model runs tests, lints, scripts, whatever.

Two design choices here ended up being surprisingly important:

Timeouts are enforced. If a command hangs, it dies.
Networking is off by default. If the agent wants internet (pip install, curl, etc.), it has to explicitly ask for it.

That single boolean makes the system easier to reason about, and it's the kind of guardrail you never see from the outside in most agent products.

Here's what the tool schema looks like:

{
  "name": "run_command",
  "description": "Run a shell command in Docker in /workspace. Networking is off by default.",
  "parameters": {
    "type": "object",
    "properties": {
      "command": {"type": "string"},
      "timeout": {"type": "integer", "description": "Seconds"},
      "network_enabled": {"type": "boolean", "description": "Enable for pip install, etc."},
      "description": {"type": "string", "description": "What this is trying to do"}
    },
    "required": ["command", "description"]
  }
}

Tool: `task_complete`

This is the stop button. Without it, the harness just keeps looping until it hits max_iterations, even if the task is obviously done.

For an agent loop, you need a clear termination condition.

How tool calling actually works

The mechanics are simple:

You define tools using a JSON schema
You send them with your API request
The model can respond with tool calls rather than plain text
You execute the tool calls in your harness
You send the results back as tool messages
The model sees those results and continues

So the conversation looks like:

User: "Build X"

Model: calls write_file and run_command

Harness: executes and returns real output and errors

Model: uses that reality to decide what to do next

Repeat

The key point again:

The model is not executing. It's requesting.

You are the executor. The harness is the gate.

The code that matters

Most of the harness is glue code. The heart is the loop:

for i in range(max_iterations):
    # Get model response with tools
    resp = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto",
    )

    msg = resp.choices[0].message

    # If model wants to use tools
    if msg.tool_calls:
        # Store assistant message
        messages.append({
            "role": "assistant",
            "tool_calls": [...]
        })

        # Execute each tool call
        for tc in msg.tool_calls:
            name = tc.function.name
            args = json.loads(tc.function.arguments)
            result = execute_tool(name, args)

            # Add result back to conversation
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result)
            })

            # Check for completion
            if name == "task_complete":
                return success

In practice, the "agent" behavior comes from one thing:

The model sees raw stdout, exit codes, and failures, and it learns what to do next from those signals.

It's the difference between:

"I think this should work"
and "I ran it, here is the traceback"

A Real Run

Here's what actually happened when I gave it a real task:

Task: Create a Python text normalization utility with pytest tests

Iteration 1: Create project structure

→ write_file("textutils/core.py")
→ write_file("tests/test_normalize.py")
✓ Files created

The agent writes a normalize() function and 6 test cases covering edge cases like empty strings, multiple spaces, tabs, and newlines.

Iteration 2: Run tests (first attempt)

→ run_command("pytest tests/test_normalize.py")
✗ exit_code=2
Error: "ERROR: file or directory not found: tests/test_normalize.py"

Pytest can't find the module. The agent doesn't understand why yet.

Iterations 3-7: The struggle

→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

→ run_command("pytest --pythonpath=/workspace")
✗ exit_code=4

→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

This is where you see real problem-solving behavior. The agent is stuck trying different pytest configuration flags, but that's not actually the problem.

Five iterations of trying variations before it figures out the real issue.

Iteration 8: The breakthrough

→ write_file("textutils/__init__.py", "")
→ run_command("pytest tests/test_normalize.py")
✓ exit_code=0

===== test session starts =====
collected 6 items

tests/test_normalize.py ......                [100%]

===== 6 passed in 0.02s =====

The agent realizes textutils needs to be a proper Python package. Adds the __init__.py file. All tests pass.

Iteration 9: Done

→ task_complete("Created a text normalization utility with tests. All tests passed successfully.")

Total time: ~25 seconds. Iterations: 9.

What This Teaches You

The satisfying part how it worked.

The agent:

Created real files
Ran real commands
Saw real failures
Got stuck for a bit (iterations 3-7)
Had a realization (iteration 8)
Fixed the actual problem
Verified success

That's the loop working. Not magic, not cherry-picked—just a feedback cycle that eventually converges on the right answer.

And when something goes wrong, you can replay it:

{
  "iteration": 8,
  "tool": "run_command",
  "args": {"command": "pytest tests/test_normalize.py"},
  "result": {
    "exit_code": 0,
    "stdout": "===== 6 passed in 0.02s =====",
    "duration": 1.85
  }
}

Every tool call is logged. Every decision is traceable.

Error handling and retries

I don't do explicit "retry three times" logic.

The retry loop is implicit.

The model retries because it sees structured execution results like:

exit_code
timed_out
stdout (often includes tracebacks)
duration

When a command fails, the model gets the failure, thinks, changes code, and reruns.

This works well for:

Syntax errors
Missing imports
Obvious mistakes in logic
Test failures with clear assertions

It works less well when:

Requirements are ambiguous
The failure needs domain knowledge
The fix is large and multi-step

That's not a model problem. That's a harness ergonomics problem, because the agent still lacks better tools like patch editing, diffing, and memory.

Logs: the feature you don't appreciate until you need it

I log every tool call to a JSONL file inside the workspace.

That log includes:

What tool was called
Arguments
The full result payload
Timestamps

It sounds boring, but it changes everything.

When something goes wrong, you can answer:

What command actually ran
What the model actually saw
Whether it tried to install dependencies
Whether it timed out
What changed right before it broke

Logs turn agent runs into something you can debug, replay, and trust.

What broke (and what I learned)

Let me be honest about the failures, because this is where the actual lessons are.

Problem 1: Docker dependency management

Initially, I tried letting the agent install pytest on every run. Bad idea.

Fresh containers mean fresh installs every time. The agent would waste 2-3 iterations just getting its environment ready before doing actual work.

The fix: Pre-install common dependencies (pytest, pip tools) in the Docker image. Every container starts with them available.

This is a general pattern: if you know the agent will need something frequently, bake it into the base image.

Problem 2: Context growth

Each iteration adds messages:

system → user → assistant → tool → assistant → tool → …

On longer runs you eventually run into context limits or degraded performance.

What helped in practice:

Keep max_iterations sane (12-15)
Make tasks more specific
Reduce output size (truncate stdout to 20k chars)

The real fix (not fully implemented yet) is proper context management:

Sliding windows
Summarizing older tool results
Storing full logs separately from working context

Problem 3: The agent can be too polite

Sometimes the model hits a wall and says "can't do that" instead of pushing.

Example: missing dependency, or a test failure it doesn't understand, and it decides to stop.

The fix is not "better model." It's better scaffolding:

Stronger system prompt about persistence
Better tools (like search within workspace, apply patch)
Clearer success criteria (tests passing)

Problem 4: Networking policy

Some tasks need internet, often just for installing additional deps.

If network is on by default, the agent will use it constantly.

If it's off forever, the agent is stuck.

So I made it explicit: network is disabled by default, but the model can request it via network_enabled=true on run_command.

That's a good compromise because it forces you to see when and why the agent touches the network. In the logs, you can track every network-enabled command.

Beyond the Sandbox - Adding External Tools

The Integration Problem

When you start thinking about adding external tools, you realize the scope pretty quickly.

If you want your agent to create GitHub issues, you need:

OAuth flow for GitHub
Token management and refresh
API wrapper for GitHub's REST API
Error handling for rate limits
Updates when GitHub's API changes

Now multiply that by every service you want: Slack, Gmail, Linear, Notion, databases...

You're looking at months of integration work. Or you could use something like Composio that already built those 1000+ integrations.

The important thing is that external tools don't change the agent loop, they stress it. Authentication failures, network timeouts, and rate limits are just another form of "reality" the harness has to surface back to the model. ToolRouter fits cleanly because it preserves the same contract: the model requests, the harness executes, and raw results come back.

What Composio's ToolRouter Actually Is

Composio ToolRouter is an integration layer that gives your agent access to external tools through a simple API.

The core concept is sessions. Each session:

Is scoped to a specific user
Manages that user's connected apps
Provides tools the agent can call
Handles authentication automatically

Here's the basic setup:

from composio import Composio

# Initialize Composio
composio = Composio(api_key=COMPOSIO_API_KEY)
# Create a session for your user
session = composio.create(user_id="alice@company.com")
# Get tools for this session
tools = session.tools()

That's it. Now tools contains the ToolRouter meta-tools scoped to this user.

How ToolRouter Actually Works

Here's what I learned when integrating it: ToolRouter doesn't give you individual tools upfront.

Instead, it provides 6 meta-tools that handle everything:

COMPOSIO_SEARCH_TOOLS - Searches for relevant tools based on task description
COMPOSIO_MULTI_EXECUTE_TOOL - Executes discovered tools
COMPOSIO_MANAGE_CONNECTIONS - Handles authentication flows
COMPOSIO_REMOTE_WORKBENCH - Processes large responses
COMPOSIO_REMOTE_BASH_TOOL - Runs bash commands remotely
COMPOSIO_SEARCH_ENTITIES - Searches for entities in connected apps

So instead of calling GITHUB_CREATE_ISSUE directly, the workflow is:

Agent calls COMPOSIO_SEARCH_TOOLS with "create a GitHub issue"
ToolRouter finds the right tool
Agent calls COMPOSIO_MULTI_EXECUTE_TOOL with parameters
ToolRouter executes it using the user's connected GitHub account

This is actually smarter than loading every possible tool. The meta-tools handle discovery, authentication, and execution dynamically.

Two Ways to Use ToolRouter

Composio gives you two integration patterns:

Option 1: As Native Tools (What I Used)

You get the 6 meta-tools and pass them to your agent framework:

# Create session
session = composio.create(user_id="user_123")

# Get meta-tools
tools = session.tools()

# Pass to your agent (OpenAI, Anthropic, etc.)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    tools=tools  # The 6 ToolRouter meta-tools
)

The agent uses these meta-tools to discover and execute what it needs.

Option 2: As MCP Server

If you're using an MCP-compatible framework (like Claude Agent SDK), you can connect via the MCP protocol:

# Create session
session = composio.create(user_id="user_123")

# Get MCP endpoint
mcp_url = session.mcp.url
mcp_headers = session.mcp.headers

# Configure your MCP client
# The MCP server handles tool routing automatically

The MCP approach is more dynamic but adds complexity. For the harness, native tools made more sense.

For this harness, I wanted to keep the execution boundary explicit in my own code rather than delegate it to an MCP runtime.

How It Integrates With the Harness

The integration pattern is straightforward:

Agent Loop
    ├─→ Local Tools (files, Docker commands)
    └─→ Composio Meta-Tools (discovery, execution)
    ↓
Execute tool
    ├─→ Local? Run in Docker
    └─→ Composio? Route to ToolRouter
    ↓
Results back to agent

The key is combining both tool types:

class AgentExecutor:
    def __init__(self, user_id, enable_composio=False):
        self.workspace = Workspace()
        self.sandbox = DockerSandbox(self.workspace)

        # Composio setup
        self.composio_session = None
        if enable_composio:
            composio = Composio(api_key=COMPOSIO_API_KEY)
            self.composio_session = composio.create(user_id=user_id)

    def get_all_tools(self):
        """Combine local and Composio tools"""
        tools = LOCAL_TOOLS  # write_file, run_command, etc.

        if self.composio_session:
            composio_tools = self.composio_session.tools()
            tools.extend(composio_tools)  # Add the 6 meta-tools

        return tools

The agent doesn't know or care where tools come from. It just calls them. We handle the routing.

Authentication Flow

The authentication part is what makes ToolRouter useful. Here's how it works:

First time a user needs an app:

Agent tries to use a tool (searches for "GitHub issue")
User hasn't connected GitHub yet
ToolRouter (via COMPOSIO_MANAGE_CONNECTIONS) returns an auth URL
User clicks URL, completes OAuth
Agent retries, now it works

After that:

Composio manages the tokens
Handles refresh automatically
Agent just calls the meta-tools

You can pre-connect apps for users via the Composio dashboard (https://app.composio.dev/apps), or let them authenticate on-demand.

For the harness, pre-connecting apps is smoother. Less interruption during agent runs.

A Simple Example

Let me show you what a workflow looks like with ToolRouter.

Task: "Create a calculator module, test it, then create a GitHub issue."

The agent:

Writes calc.py with functions (local tools)
Writes test_calc.py (local tools)
Runs pytest (local tool: Docker execution)
Sees tests pass
Calls COMPOSIO_SEARCH_TOOLS to find GitHub tools
Calls COMPOSIO_MULTI_EXECUTE_TOOL to create the issue
Returns the issue URL

Steps 5-6 are where ToolRouter shines. The agent doesn't need to know GitHub's API. It just describes what it wants, and ToolRouter handles the rest.

What's Available

Through ToolRouter's meta-tools, you get access to many integrations:

Development:

GitHub: create issues, PRs, manage repos
GitLab, Bitbucket
Jira, Linear: task management

Communication:

Slack: send messages, create channels
Discord, Teams
Gmail: send/read emails

Productivity:

Notion, Google Docs
Calendar, Drive
Trello, Asana

Databases:

PostgreSQL, MongoDB, MySQL
Airtable, Supabase

The agent discovers these through COMPOSIO_SEARCH_TOOLS as needed.

What I Learned

What works well:

The meta-tool approach is cleaner than loading individual tools
Authentication handling is automatic
Discovery works - agent finds the right tools
Error messages are clear (e.g., "connect this app first")

What needs more work:

Execution can be slow (network round trips)
Error handling for transient failures could be better
Cost tracking per workflow
Pre-checking connection status before trying operations

The goal isn't to integrate everything. It's to integrate the 3-5 apps that matter for your specific workflows.

The Honest Assessment

Composio solves a real problem. Building and maintaining multiple integrations yourself isn't realistic.

But it's not magic:

You still need to understand how the meta-tools work
Authentication setup takes time (connecting apps)
External APIs add latency and failure modes
Costs go up (more API calls)

The value proposition: trade integration complexity for a simpler API. Instead of learning OAuth flows for 10 services, you learn how ToolRouter's 6 meta-tools work.

For most cases, that's the right trade.

Wrapping Up

At this point, the agent isn't just writing code in isolation. It can change a codebase, verify behavior, and take real actions on behalf of a user, all through explicit, inspectable boundaries.

That's the pattern that keeps showing up: models reason, harnesses execute, and tools expose reality. Once you get that separation right, adding integrations stops being scary and starts being composable.

The result: an agent that can write code, test it locally, and take actions through connected services.

It's not perfect. It's not production-ready. But it's functional and extensible.

The integration pattern works - the agent can discover and call external tools through ToolRouter's meta-tools. I've tested the basic workflow locally, and the potential for GitHub issues, Slack notifications, and other integrations is straightforward from here.

The integration patterns are straightforward. The APIs are manageable. The possibilities are interesting.

That's where this is at right now.

Top comments (1)

Developer Harsh Composio • Jan 26

Gona give it a try. Any GitHub repo for reference. Quality of write is very good :)