<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: VK</title>
    <description>The latest articles on DEV Community by VK (@vk_durden).</description>
    <link>https://dev.to/vk_durden</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3589287%2F004229ee-6c94-42c8-8768-2bdc0f615e03.jpg</url>
      <title>DEV Community: VK</title>
      <link>https://dev.to/vk_durden</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vk_durden"/>
    <language>en</language>
    <item>
      <title>I'm Building My Own Coding Agent Harness (And It's Pretty Cool)</title>
      <dc:creator>VK</dc:creator>
      <pubDate>Mon, 19 Jan 2026 16:09:04 +0000</pubDate>
      <link>https://dev.to/composiodev/im-building-my-own-coding-agent-harness-and-its-pretty-cool-1lpf</link>
      <guid>https://dev.to/composiodev/im-building-my-own-coding-agent-harness-and-its-pretty-cool-1lpf</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;br&gt;
I worked on a coding agent harness that executes AI-generated code in Docker sandboxes, and learned that the real bottleneck isn't code generation, but everything after: environment setup, error handling, and integration with external services.&lt;/p&gt;

&lt;p&gt;While this works great for local development, connecting to tools like GitHub, Slack, or databases meant building dozens of API integrations, managing OAuth flows, and handling edge cases. That's where Composio's ToolRouter came in, instead of building integrations myself, I now get tools through 6 simple meta-tools, with authentication and execution handled automatically.&lt;/p&gt;

&lt;p&gt;The result? An agent that can write code, test it locally, create GitHub issues, and notify Slack, all through a single, observable execution loop. Turns out the coolest part wasn't just watching AI write code, but watching it interact with the real world safely and transparently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: This is a learning project where I explored agent execution &lt;br&gt;
patterns and integration approaches. The Composio integration shown &lt;br&gt;
here demonstrates the concept, though a production implementation &lt;br&gt;
would need additional error handling, cost controls, and testing.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why Build a Coding Agent Harness
&lt;/h2&gt;

&lt;p&gt;I didn’t start thinking about a coding agent harness because existing tools are bad. Quite the opposite. Tools like Claude Code and Codex are excellent, and I use them regularly for debugging and iteration.&lt;/p&gt;

&lt;p&gt;What sparked this work wasn’t dissatisfaction, but curiosity.&lt;/p&gt;

&lt;p&gt;While using these tools, I kept noticing that a lot of the most important work happens outside the model: code execution, error capture, retries, environment setup, and tool integration. These systems handle that complexity well, but largely invisibly. You get an answer, a fix, or a result, but not always a clear view into how the system arrived there.&lt;/p&gt;

&lt;p&gt;I wanted to understand that execution loop more deeply.&lt;/p&gt;

&lt;p&gt;Not just did it work, but:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;what code actually ran&lt;/li&gt;
&lt;li&gt;what error was thrown&lt;/li&gt;
&lt;li&gt;what context was passed back to the model&lt;/li&gt;
&lt;li&gt;what changed between attempts&lt;/li&gt;
&lt;li&gt;and where control really lives when things fail&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most modern AI coding tools optimize for speed and convenience. They hide complexity to make workflows feel smooth. That’s usually the right tradeoff. But it also means the execution layer, the part where code meets reality, is opaque.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Real Bottleneck Isn’t Code Generation
&lt;/h3&gt;

&lt;p&gt;Writing code is no longer the hard part. Models are already very good at producing first drafts, boilerplate, and obvious fixes.&lt;/p&gt;

&lt;p&gt;The harder part is everything that happens after: running code in a real environment, capturing raw errors, installing missing dependencies, handling environment variables, and retrying with proper context when something fails.&lt;/p&gt;

&lt;p&gt;Humans still do most of this manually. We run the code, read the error, decide what matters, and feed it back into the model. That makes the human the slowest and most expensive part of the loop.&lt;/p&gt;

&lt;p&gt;If an AI system can write code but cannot directly observe failures and respond to them, it isn’t really an agent. It’s a suggestion engine with a human acting as the executor.&lt;/p&gt;
&lt;h3&gt;
  
  
  What I Mean by a “Harness”
&lt;/h3&gt;

&lt;p&gt;A coding agent harness isn’t a model and it isn’t a framework. It’s the infrastructure around the model.&lt;/p&gt;

&lt;p&gt;It’s the execution loop that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;runs code in a controlled environment&lt;/li&gt;
&lt;li&gt;captures raw outputs and errors&lt;/li&gt;
&lt;li&gt;feeds that reality back to the model&lt;/li&gt;
&lt;li&gt;applies guardrails around what the system can touch&lt;/li&gt;
&lt;li&gt;and makes every step visible and debuggable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The language model provides reasoning and code.&lt;/p&gt;

&lt;p&gt;The harness provides execution, feedback, and constraints.&lt;/p&gt;

&lt;p&gt;Together, they turn “generate code” into “try, fail, observe, and improve.”&lt;/p&gt;
&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;This isn’t about replacing developers or claiming AI writes perfect code.&lt;/p&gt;

&lt;p&gt;It’s about removing the least valuable parts of development work: rerunning commands, fixing missing imports, interpreting test failures, and repeating the same debug loop over and over.&lt;/p&gt;

&lt;p&gt;By making execution and feedback explicit, AI systems become easier to trust, easier to debug, and easier to reason about.&lt;/p&gt;

&lt;p&gt;That’s the motivation behind this work: not building a better model, but understanding and shaping the execution layer that makes these tools actually useful.&lt;/p&gt;

&lt;p&gt;And honestly, the first time you watch an agent write code, see it fail, understand what broke, and fix itself, that’s genuinely fucking cool. Not because it’s magic, but because you can finally see the entire loop working.&lt;/p&gt;
&lt;h2&gt;
  
  
  Building the Harness
&lt;/h2&gt;

&lt;p&gt;First part was the motivation. Next is the part where the idea becomes a real execution loop you can watch, debug, and trust.&lt;/p&gt;

&lt;p&gt;Before writing any code, I forced myself to answer one question:&lt;/p&gt;

&lt;p&gt;What does an "agent" actually do, step by step, when you strip away the UI?&lt;/p&gt;

&lt;p&gt;Here's the loop in plain English:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Task comes in&lt;/li&gt;
&lt;li&gt;Model decides what to do next&lt;/li&gt;
&lt;li&gt;Model asks to use a tool&lt;/li&gt;
&lt;li&gt;The harness runs something in the real world&lt;/li&gt;
&lt;li&gt;The harness returns raw results&lt;/li&gt;
&lt;li&gt;Model updates its plan&lt;/li&gt;
&lt;li&gt;Repeat until done or we hit a cap&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The important detail is the one people forget:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The model never touches your machine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It only sees whatever the harness returns.&lt;/p&gt;

&lt;p&gt;That boundary is the whole point. The harness is the interface between intelligence and execution. If you control that interface, you control the agent.&lt;/p&gt;


&lt;h3&gt;
  
  
  The Architecture
&lt;/h3&gt;

&lt;p&gt;This harness has three main parts.&lt;/p&gt;
&lt;h3&gt;
  
  
  1) The workspace
&lt;/h3&gt;

&lt;p&gt;A fresh directory on the host for every run.&lt;/p&gt;

&lt;p&gt;This is where the agent writes and reads files. Think of it like a tiny project repo the agent can manipulate.&lt;/p&gt;

&lt;p&gt;That choice matters because it's the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"run a snippet of Python"&lt;/li&gt;
&lt;li&gt;and "build and iterate on an actual codebase"&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  2) The Docker sandbox
&lt;/h3&gt;

&lt;p&gt;Every command runs inside a Docker container with the workspace mounted at &lt;code&gt;/workspace&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That gives me isolation, repeatability, and a place to put guardrails.&lt;/p&gt;

&lt;p&gt;By default:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Resource limits are enforced (CPU + memory)&lt;/li&gt;
&lt;li&gt;Commands have a hard timeout&lt;/li&gt;
&lt;li&gt;Networking is disabled unless explicitly enabled&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you are going to let a model execute arbitrary code, you need a safety story. Docker isn't perfect, but it is an enormous step up from running things directly on the host.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Image Setup:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I use a custom Docker image with pytest pre-installed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight docker"&gt;&lt;code&gt;&lt;span class="k"&gt;FROM&lt;/span&gt;&lt;span class="s"&gt; python:3.11-slim&lt;/span&gt;
&lt;span class="k"&gt;RUN &lt;/span&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--no-cache-dir&lt;/span&gt; pytest
&lt;span class="k"&gt;WORKDIR&lt;/span&gt;&lt;span class="s"&gt; /workspace&lt;/span&gt;
&lt;span class="k"&gt;CMD&lt;/span&gt;&lt;span class="s"&gt; ["bash"]&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This solves a critical problem: if the agent has to install pytest every run, it wastes iterations and API calls. Pre-installing common dependencies in the image means every container starts ready to work.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) The tool contract
&lt;/h3&gt;

&lt;p&gt;The model does not run anything directly. It calls tools. The harness executes those tools and returns structured results.&lt;/p&gt;

&lt;p&gt;I kept the tool surface area small but real:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;list_files()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;read_file(path)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;write_file(path, content)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;run_command(command, timeout, network_enabled)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;task_complete(summary)&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is enough for real developer workflows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create modules&lt;/li&gt;
&lt;li&gt;Write tests&lt;/li&gt;
&lt;li&gt;Run pytest&lt;/li&gt;
&lt;li&gt;Read failures&lt;/li&gt;
&lt;li&gt;Patch code&lt;/li&gt;
&lt;li&gt;Rerun until green&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Why Docker
&lt;/h3&gt;

&lt;p&gt;Yes, I could have used &lt;code&gt;subprocess.run()&lt;/code&gt; and called it a day.&lt;/p&gt;

&lt;p&gt;But that's not "agent infrastructure." That's handing a model a loaded gun.&lt;/p&gt;

&lt;p&gt;You do not want a tool-using model to have direct access to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your filesystem&lt;/li&gt;
&lt;li&gt;Your SSH keys&lt;/li&gt;
&lt;li&gt;Your environment variables&lt;/li&gt;
&lt;li&gt;Your network&lt;/li&gt;
&lt;li&gt;Your ability to fork-bomb your laptop into a space heater&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Docker gives me a baseline set of protections:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Process isolation:&lt;/strong&gt; Code runs in a container, not on my host&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits:&lt;/strong&gt; CPU and memory caps prevent obvious abuse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Network control:&lt;/strong&gt; Networking can be off by default&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducibility:&lt;/strong&gt; Every run starts from a known image&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The tradeoff is complexity. It adds friction and edge cases. But for this category of problem, it's the right trade.&lt;/p&gt;




&lt;h3&gt;
  
  
  The Tools
&lt;/h3&gt;

&lt;p&gt;I used to think tool calling was the "agent" part.&lt;/p&gt;

&lt;p&gt;It's not. Tool calling is just the API plumbing that makes the loop possible.&lt;/p&gt;

&lt;p&gt;The model is effectively saying: "Please run this for me, and tell me what happened."&lt;/p&gt;

&lt;p&gt;The harness is the one doing the doing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool: &lt;code&gt;write_file&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is how the model creates and edits code. It writes directly into the workspace.&lt;/p&gt;

&lt;p&gt;In the simplest form, this tool is just an interface to a safe path resolver plus a size limit so the agent can't spam huge files.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tool: &lt;code&gt;run_command&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the core tool. It's how the model runs tests, lints, scripts, whatever.&lt;/p&gt;

&lt;p&gt;Two design choices here ended up being surprisingly important:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Timeouts are enforced.&lt;/strong&gt; If a command hangs, it dies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Networking is off by default.&lt;/strong&gt; If the agent wants internet (pip install, curl, etc.), it has to explicitly ask for it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That single boolean makes the system easier to reason about, and it's the kind of guardrail you never see from the outside in most agent products.&lt;/p&gt;

&lt;p&gt;Here's what the tool schema looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run_command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Run a shell command in Docker in /workspace. Networking is off by default."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"properties"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"integer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Seconds"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"network_enabled"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"boolean"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Enable for pip install, etc."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What this is trying to do"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"required"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Tool: &lt;code&gt;task_complete&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;This is the stop button. Without it, the harness just keeps looping until it hits &lt;code&gt;max_iterations&lt;/code&gt;, even if the task is obviously done.&lt;/p&gt;

&lt;p&gt;For an agent loop, you need a clear termination condition.&lt;/p&gt;




&lt;h3&gt;
  
  
  How tool calling actually works
&lt;/h3&gt;

&lt;p&gt;The mechanics are simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;You define tools using a JSON schema&lt;/li&gt;
&lt;li&gt;You send them with your API request&lt;/li&gt;
&lt;li&gt;The model can respond with tool calls rather than plain text&lt;/li&gt;
&lt;li&gt;You execute the tool calls in your harness&lt;/li&gt;
&lt;li&gt;You send the results back as tool messages&lt;/li&gt;
&lt;li&gt;The model sees those results and continues&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So the conversation looks like:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User:&lt;/strong&gt; "Build X"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt; calls &lt;code&gt;write_file&lt;/code&gt; and &lt;code&gt;run_command&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness:&lt;/strong&gt; executes and returns real output and errors&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt; uses that reality to decide what to do next&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repeat&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The key point again:&lt;/p&gt;

&lt;p&gt;The model is not executing. It's requesting.&lt;/p&gt;

&lt;p&gt;You are the executor. The harness is the gate.&lt;/p&gt;




&lt;h3&gt;
  
  
  The code that matters
&lt;/h3&gt;

&lt;p&gt;Most of the harness is glue code. The heart is the loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;max_iterations&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Get model response with tools
&lt;/span&gt;    &lt;span class="n"&gt;resp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;TOOLS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tool_choice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;auto&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;

    &lt;span class="c1"&gt;# If model wants to use tools
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Store assistant message
&lt;/span&gt;        &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...]&lt;/span&gt;
        &lt;span class="p"&gt;})&lt;/span&gt;

        &lt;span class="c1"&gt;# Execute each tool call
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;
            &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute_tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

            &lt;span class="c1"&gt;# Add result back to conversation
&lt;/span&gt;            &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_call_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;})&lt;/span&gt;

            &lt;span class="c1"&gt;# Check for completion
&lt;/span&gt;            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task_complete&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;success&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In practice, the "agent" behavior comes from one thing:&lt;/p&gt;

&lt;p&gt;The model sees raw stdout, exit codes, and failures, and it learns what to do next from those signals.&lt;/p&gt;

&lt;p&gt;It's the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I think this should work"&lt;/li&gt;
&lt;li&gt;and "I ran it, here is the traceback"&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  A Real Run
&lt;/h3&gt;

&lt;p&gt;Here's what actually happened when I gave it a real task:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; Create a Python text normalization utility with pytest tests&lt;/p&gt;

&lt;h3&gt;
  
  
  Iteration 1: Create project structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;→ write_file("textutils/core.py")
→ write_file("tests/test_normalize.py")
✓ Files created

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent writes a &lt;code&gt;normalize()&lt;/code&gt; function and 6 test cases covering edge cases like empty strings, multiple spaces, tabs, and newlines.&lt;/p&gt;

&lt;h3&gt;
  
  
  Iteration 2: Run tests (first attempt)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;→ run_command("pytest tests/test_normalize.py")
✗ exit_code=2
Error: "ERROR: file or directory not found: tests/test_normalize.py"

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pytest can't find the module. The agent doesn't understand why yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  Iterations 3-7: The struggle
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

→ run_command("pytest --pythonpath=/workspace")
✗ exit_code=4

→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is where you see real problem-solving behavior. The agent is stuck trying different pytest configuration flags, but that's not actually the problem.&lt;/p&gt;

&lt;p&gt;Five iterations of trying variations before it figures out the real issue.&lt;/p&gt;

&lt;h3&gt;
  
  
  Iteration 8: The breakthrough
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;→ write_file("textutils/__init__.py", "")
→ run_command("pytest tests/test_normalize.py")
✓ exit_code=0

===== test session starts =====
collected 6 items

tests/test_normalize.py ......                [100%]

===== 6 passed in 0.02s =====

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent realizes &lt;code&gt;textutils&lt;/code&gt; needs to be a proper Python package. Adds the &lt;code&gt;__init__.py&lt;/code&gt; file. All tests pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Iteration 9: Done
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;→ task_complete("Created a text normalization utility with tests. All tests passed successfully.")

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Total time: ~25 seconds. Iterations: 9.&lt;/p&gt;




&lt;h3&gt;
  
  
  What This Teaches You
&lt;/h3&gt;

&lt;p&gt;The satisfying part &lt;em&gt;how&lt;/em&gt; it worked.&lt;/p&gt;

&lt;p&gt;The agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Created real files&lt;/li&gt;
&lt;li&gt;Ran real commands&lt;/li&gt;
&lt;li&gt;Saw real failures&lt;/li&gt;
&lt;li&gt;Got stuck for a bit (iterations 3-7)&lt;/li&gt;
&lt;li&gt;Had a realization (iteration 8)&lt;/li&gt;
&lt;li&gt;Fixed the actual problem&lt;/li&gt;
&lt;li&gt;Verified success&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's the loop working. Not magic, not cherry-picked—just a feedback cycle that eventually converges on the right answer.&lt;/p&gt;

&lt;p&gt;And when something goes wrong, you can replay it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"iteration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"tool"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"run_command"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"pytest tests/test_normalize.py"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"result"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"exit_code"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"stdout"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"===== 6 passed in 0.02s ====="&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"duration"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.85&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every tool call is logged. Every decision is traceable.&lt;/p&gt;




&lt;h3&gt;
  
  
  Error handling and retries
&lt;/h3&gt;

&lt;p&gt;I don't do explicit "retry three times" logic.&lt;/p&gt;

&lt;p&gt;The retry loop is implicit.&lt;/p&gt;

&lt;p&gt;The model retries because it sees structured execution results like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;exit_code&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;timed_out&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;stdout&lt;/code&gt; (often includes tracebacks)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;duration&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When a command fails, the model gets the failure, thinks, changes code, and reruns.&lt;/p&gt;

&lt;p&gt;This works well for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Syntax errors&lt;/li&gt;
&lt;li&gt;Missing imports&lt;/li&gt;
&lt;li&gt;Obvious mistakes in logic&lt;/li&gt;
&lt;li&gt;Test failures with clear assertions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It works less well when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requirements are ambiguous&lt;/li&gt;
&lt;li&gt;The failure needs domain knowledge&lt;/li&gt;
&lt;li&gt;The fix is large and multi-step&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's not a model problem. That's a harness ergonomics problem, because the agent still lacks better tools like patch editing, diffing, and memory.&lt;/p&gt;




&lt;h3&gt;
  
  
  Logs: the feature you don't appreciate until you need it
&lt;/h3&gt;

&lt;p&gt;I log every tool call to a JSONL file inside the workspace.&lt;/p&gt;

&lt;p&gt;That log includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What tool was called&lt;/li&gt;
&lt;li&gt;Arguments&lt;/li&gt;
&lt;li&gt;The full result payload&lt;/li&gt;
&lt;li&gt;Timestamps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It sounds boring, but it changes everything.&lt;/p&gt;

&lt;p&gt;When something goes wrong, you can answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What command actually ran&lt;/li&gt;
&lt;li&gt;What the model actually saw&lt;/li&gt;
&lt;li&gt;Whether it tried to install dependencies&lt;/li&gt;
&lt;li&gt;Whether it timed out&lt;/li&gt;
&lt;li&gt;What changed right before it broke&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Logs turn agent runs into something you can debug, replay, and trust.&lt;/p&gt;




&lt;h3&gt;
  
  
  What broke (and what I learned)
&lt;/h3&gt;

&lt;p&gt;Let me be honest about the failures, because this is where the actual lessons are.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Docker dependency management
&lt;/h3&gt;

&lt;p&gt;Initially, I tried letting the agent install pytest on every run. Bad idea.&lt;/p&gt;

&lt;p&gt;Fresh containers mean fresh installs every time. The agent would waste 2-3 iterations just getting its environment ready before doing actual work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix:&lt;/strong&gt; Pre-install common dependencies (pytest, pip tools) in the Docker image. Every container starts with them available.&lt;/p&gt;

&lt;p&gt;This is a general pattern: if you know the agent will need something frequently, bake it into the base image.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Context growth
&lt;/h3&gt;

&lt;p&gt;Each iteration adds messages:&lt;/p&gt;

&lt;p&gt;system → user → assistant → tool → assistant → tool → …&lt;/p&gt;

&lt;p&gt;On longer runs you eventually run into context limits or degraded performance.&lt;/p&gt;

&lt;p&gt;What helped in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Keep &lt;code&gt;max_iterations&lt;/code&gt; sane (12-15)&lt;/li&gt;
&lt;li&gt;Make tasks more specific&lt;/li&gt;
&lt;li&gt;Reduce output size (truncate stdout to 20k chars)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The real fix (not fully implemented yet) is proper context management:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sliding windows&lt;/li&gt;
&lt;li&gt;Summarizing older tool results&lt;/li&gt;
&lt;li&gt;Storing full logs separately from working context&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Problem 3: The agent can be too polite
&lt;/h3&gt;

&lt;p&gt;Sometimes the model hits a wall and says "can't do that" instead of pushing.&lt;/p&gt;

&lt;p&gt;Example: missing dependency, or a test failure it doesn't understand, and it decides to stop.&lt;/p&gt;

&lt;p&gt;The fix is not "better model." It's better scaffolding:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stronger system prompt about persistence&lt;/li&gt;
&lt;li&gt;Better tools (like search within workspace, apply patch)&lt;/li&gt;
&lt;li&gt;Clearer success criteria (tests passing)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Problem 4: Networking policy
&lt;/h3&gt;

&lt;p&gt;Some tasks need internet, often just for installing additional deps.&lt;/p&gt;

&lt;p&gt;If network is on by default, the agent will use it constantly.&lt;/p&gt;

&lt;p&gt;If it's off forever, the agent is stuck.&lt;/p&gt;

&lt;p&gt;So I made it explicit: network is disabled by default, but the model can request it via &lt;code&gt;network_enabled=true&lt;/code&gt; on &lt;code&gt;run_command&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That's a good compromise because it forces you to see when and why the agent touches the network. In the logs, you can track every network-enabled command.&lt;/p&gt;




&lt;h2&gt;
  
  
  Beyond the Sandbox - Adding External Tools
&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Integration Problem
&lt;/h2&gt;

&lt;p&gt;When you start thinking about adding external tools, you realize the scope pretty quickly.&lt;/p&gt;

&lt;p&gt;If you want your agent to create GitHub issues, you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth flow for GitHub&lt;/li&gt;
&lt;li&gt;Token management and refresh&lt;/li&gt;
&lt;li&gt;API wrapper for GitHub's REST API&lt;/li&gt;
&lt;li&gt;Error handling for rate limits&lt;/li&gt;
&lt;li&gt;Updates when GitHub's API changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now multiply that by every service you want: Slack, Gmail, Linear, Notion, databases...&lt;/p&gt;

&lt;p&gt;You're looking at months of integration work. Or you could use something like Composio that already built those 1000+ integrations.&lt;/p&gt;

&lt;p&gt;The important thing is that external tools don't change the agent loop, they stress it. Authentication failures, network timeouts, and rate limits are just another form of "reality" the harness has to surface back to the model. ToolRouter fits cleanly because it preserves the same contract: the model requests, the harness executes, and raw results come back.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Composio's ToolRouter Actually Is
&lt;/h2&gt;

&lt;p&gt;Composio ToolRouter is an integration layer that gives your agent access to external tools through a simple API.&lt;/p&gt;

&lt;p&gt;The core concept is &lt;strong&gt;sessions&lt;/strong&gt;. Each session:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is scoped to a specific user&lt;/li&gt;
&lt;li&gt;Manages that user's connected apps&lt;/li&gt;
&lt;li&gt;Provides tools the agent can call&lt;/li&gt;
&lt;li&gt;Handles authentication automatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the basic setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;composio&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Composio&lt;/span&gt;

&lt;span class="c1"&gt;# Initialize Composio
&lt;/span&gt;&lt;span class="n"&gt;composio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Composio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COMPOSIO_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Create a session for your user
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;composio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;alice@company.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Get tools for this session
&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's it. Now &lt;code&gt;tools&lt;/code&gt; contains the ToolRouter meta-tools scoped to this user.&lt;/p&gt;

&lt;h2&gt;
  
  
  How ToolRouter Actually Works
&lt;/h2&gt;

&lt;p&gt;Here's what I learned when integrating it: ToolRouter doesn't give you individual tools upfront.&lt;/p&gt;

&lt;p&gt;Instead, it provides &lt;strong&gt;6 meta-tools&lt;/strong&gt; that handle everything:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_SEARCH_TOOLS&lt;/strong&gt; - Searches for relevant tools based on task description&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_MULTI_EXECUTE_TOOL&lt;/strong&gt; - Executes discovered tools&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_MANAGE_CONNECTIONS&lt;/strong&gt; - Handles authentication flows&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_REMOTE_WORKBENCH&lt;/strong&gt; - Processes large responses&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_REMOTE_BASH_TOOL&lt;/strong&gt; - Runs bash commands remotely&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;COMPOSIO_SEARCH_ENTITIES&lt;/strong&gt; - Searches for entities in connected apps&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So instead of calling &lt;code&gt;GITHUB_CREATE_ISSUE&lt;/code&gt; directly, the workflow is:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent calls &lt;code&gt;COMPOSIO_SEARCH_TOOLS&lt;/code&gt; with "create a GitHub issue"&lt;/li&gt;
&lt;li&gt;ToolRouter finds the right tool&lt;/li&gt;
&lt;li&gt;Agent calls &lt;code&gt;COMPOSIO_MULTI_EXECUTE_TOOL&lt;/code&gt; with parameters&lt;/li&gt;
&lt;li&gt;ToolRouter executes it using the user's connected GitHub account&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is actually smarter than loading every possible tool. The meta-tools handle discovery, authentication, and execution dynamically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Ways to Use ToolRouter
&lt;/h2&gt;

&lt;p&gt;Composio gives you two integration patterns:&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 1: As Native Tools (What I Used)
&lt;/h3&gt;

&lt;p&gt;You get the 6 meta-tools and pass them to your agent framework:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create session
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;composio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Get meta-tools
&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Pass to your agent (OpenAI, Anthropic, etc.)
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[...],&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;  &lt;span class="c1"&gt;# The 6 ToolRouter meta-tools
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent uses these meta-tools to discover and execute what it needs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Option 2: As MCP Server
&lt;/h3&gt;

&lt;p&gt;If you're using an MCP-compatible framework (like Claude Agent SDK), you can connect via the MCP protocol:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create session
&lt;/span&gt;&lt;span class="n"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;composio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user_123&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Get MCP endpoint
&lt;/span&gt;&lt;span class="n"&gt;mcp_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;
&lt;span class="n"&gt;mcp_headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;

&lt;span class="c1"&gt;# Configure your MCP client
# The MCP server handles tool routing automatically
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The MCP approach is more dynamic but adds complexity. For the harness, native tools made more sense.&lt;/p&gt;

&lt;p&gt;For this harness, I wanted to keep the execution boundary explicit in my own code rather than delegate it to an MCP runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Integrates With the Harness
&lt;/h2&gt;

&lt;p&gt;The integration pattern is straightforward:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent Loop
    ├─→ Local Tools (files, Docker commands)
    └─→ Composio Meta-Tools (discovery, execution)
    ↓
Execute tool
    ├─→ Local? Run in Docker
    └─→ Composio? Route to ToolRouter
    ↓
Results back to agent

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The key is combining both tool types:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentExecutor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enable_composio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Workspace&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sandbox&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DockerSandbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;workspace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Composio setup
&lt;/span&gt;        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composio_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;enable_composio&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;composio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Composio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;COMPOSIO_API_KEY&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composio_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;composio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_all_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Combine local and Composio tools&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;LOCAL_TOOLS&lt;/span&gt;  &lt;span class="c1"&gt;# write_file, run_command, etc.
&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composio_session&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;composio_tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;composio_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;extend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;composio_tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Add the 6 meta-tools
&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent doesn't know or care where tools come from. It just calls them. We handle the routing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Authentication Flow
&lt;/h2&gt;

&lt;p&gt;The authentication part is what makes ToolRouter useful. Here's how it works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;First time a user needs an app:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent tries to use a tool (searches for "GitHub issue")&lt;/li&gt;
&lt;li&gt;User hasn't connected GitHub yet&lt;/li&gt;
&lt;li&gt;ToolRouter (via &lt;code&gt;COMPOSIO_MANAGE_CONNECTIONS&lt;/code&gt;) returns an auth URL&lt;/li&gt;
&lt;li&gt;User clicks URL, completes OAuth&lt;/li&gt;
&lt;li&gt;Agent retries, now it works&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;After that:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Composio manages the tokens&lt;/li&gt;
&lt;li&gt;Handles refresh automatically&lt;/li&gt;
&lt;li&gt;Agent just calls the meta-tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can pre-connect apps for users via the Composio dashboard (&lt;a href="https://app.composio.dev/apps" rel="noopener noreferrer"&gt;https://app.composio.dev/apps&lt;/a&gt;), or let them authenticate on-demand.&lt;/p&gt;

&lt;p&gt;For the harness, pre-connecting apps is smoother. Less interruption during agent runs.&lt;/p&gt;

&lt;h2&gt;
  
  
  A Simple Example
&lt;/h2&gt;

&lt;p&gt;Let me show you what a workflow looks like with ToolRouter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Task:&lt;/strong&gt; "Create a calculator module, test it, then create a GitHub issue."&lt;/p&gt;

&lt;p&gt;The agent:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Writes &lt;code&gt;calc.py&lt;/code&gt; with functions (local tools)&lt;/li&gt;
&lt;li&gt;Writes &lt;code&gt;test_calc.py&lt;/code&gt; (local tools)&lt;/li&gt;
&lt;li&gt;Runs &lt;code&gt;pytest&lt;/code&gt; (local tool: Docker execution)&lt;/li&gt;
&lt;li&gt;Sees tests pass&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;COMPOSIO_SEARCH_TOOLS&lt;/code&gt; to find GitHub tools&lt;/li&gt;
&lt;li&gt;Calls &lt;code&gt;COMPOSIO_MULTI_EXECUTE_TOOL&lt;/code&gt; to create the issue&lt;/li&gt;
&lt;li&gt;Returns the issue URL&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 5-6 are where ToolRouter shines. The agent doesn't need to know GitHub's API. It just describes what it wants, and ToolRouter handles the rest.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Available
&lt;/h2&gt;

&lt;p&gt;Through ToolRouter's meta-tools, you get access to many integrations:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Development:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub: create issues, PRs, manage repos&lt;/li&gt;
&lt;li&gt;GitLab, Bitbucket&lt;/li&gt;
&lt;li&gt;Jira, Linear: task management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Communication:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Slack: send messages, create channels&lt;/li&gt;
&lt;li&gt;Discord, Teams&lt;/li&gt;
&lt;li&gt;Gmail: send/read emails&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Productivity:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Notion, Google Docs&lt;/li&gt;
&lt;li&gt;Calendar, Drive&lt;/li&gt;
&lt;li&gt;Trello, Asana&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Databases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL, MongoDB, MySQL&lt;/li&gt;
&lt;li&gt;Airtable, Supabase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The agent discovers these through &lt;code&gt;COMPOSIO_SEARCH_TOOLS&lt;/code&gt; as needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What works well:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The meta-tool approach is cleaner than loading individual tools&lt;/li&gt;
&lt;li&gt;Authentication handling is automatic&lt;/li&gt;
&lt;li&gt;Discovery works - agent finds the right tools&lt;/li&gt;
&lt;li&gt;Error messages are clear (e.g., "connect this app first")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What needs more work:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Execution can be slow (network round trips)&lt;/li&gt;
&lt;li&gt;Error handling for transient failures could be better&lt;/li&gt;
&lt;li&gt;Cost tracking per workflow&lt;/li&gt;
&lt;li&gt;Pre-checking connection status before trying operations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal isn't to integrate everything. It's to integrate the 3-5 apps that matter for your specific workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Assessment
&lt;/h2&gt;

&lt;p&gt;Composio solves a real problem. Building and maintaining multiple integrations yourself isn't realistic.&lt;/p&gt;

&lt;p&gt;But it's not magic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You still need to understand how the meta-tools work&lt;/li&gt;
&lt;li&gt;Authentication setup takes time (connecting apps)&lt;/li&gt;
&lt;li&gt;External APIs add latency and failure modes&lt;/li&gt;
&lt;li&gt;Costs go up (more API calls)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The value proposition: &lt;strong&gt;trade integration complexity for a simpler API&lt;/strong&gt;. Instead of learning OAuth flows for 10 services, you learn how ToolRouter's 6 meta-tools work.&lt;/p&gt;

&lt;p&gt;For most cases, that's the right trade.&lt;/p&gt;

&lt;h2&gt;
  
  
  Wrapping Up
&lt;/h2&gt;

&lt;p&gt;At this point, the agent isn't just writing code in isolation. It can change a codebase, verify behavior, and take real actions on behalf of a user, all through explicit, inspectable boundaries.&lt;/p&gt;

&lt;p&gt;That's the pattern that keeps showing up: models reason, harnesses execute, and tools expose reality. Once you get that separation right, adding integrations stops being scary and starts being composable.&lt;/p&gt;

&lt;p&gt;The result: an agent that can write code, test it locally, and take actions through connected services.&lt;/p&gt;

&lt;p&gt;It's not perfect. It's not production-ready. But it's functional and extensible.&lt;/p&gt;

&lt;p&gt;The integration pattern works - the agent can discover and call external tools through ToolRouter's meta-tools. I've tested the basic workflow locally, and the potential for GitHub issues, Slack notifications, and other integrations is straightforward from here.&lt;/p&gt;

&lt;p&gt;The integration patterns are straightforward. The APIs are manageable. The possibilities are interesting.&lt;/p&gt;

&lt;p&gt;That's where this is at right now.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>automation</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I tested the top 3 AI coding models on real engineering problems. The results surprised me.</title>
      <dc:creator>VK</dc:creator>
      <pubDate>Fri, 28 Nov 2025 13:14:52 +0000</pubDate>
      <link>https://dev.to/composiodev/i-tested-the-top-3-ai-coding-models-on-real-engineering-problems-the-results-surprised-me-pkc</link>
      <guid>https://dev.to/composiodev/i-tested-the-top-3-ai-coding-models-on-real-engineering-problems-the-results-surprised-me-pkc</guid>
      <description>&lt;p&gt;Over the last week, three of the biggest coding-focused AI models dropped almost back to back:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Claude Opus 4.5&lt;/li&gt;
&lt;li&gt;GPT-5.1&lt;/li&gt;
&lt;li&gt;Gemini 3.0 Pro&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Everyone has been posting charts, benchmarks, and SWE-bench numbers. Those do not tell me much about how these models behave when dropped into a real codebase with real constraints, real logs, real edge cases, and real integrations.&lt;/p&gt;

&lt;p&gt;So I decided to test them in my own system.&lt;/p&gt;

&lt;p&gt;I took the exact same two engineering problems from my observability platform and asked each model to implement them directly inside my repository. No special prep, no fine-tuning, no scaffolding. Just: "Here is the context. Build it."&lt;/p&gt;

&lt;p&gt;This is what happened.&lt;/p&gt;

&lt;p&gt;TL;DR — Quick Results&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Total Cost&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;What It's Good For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3 Pro&lt;/td&gt;
&lt;td&gt;$0.25&lt;/td&gt;
&lt;td&gt;Fastest (~5–6m)&lt;/td&gt;
&lt;td&gt;Fast prototyping, creative solutions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-5.1 Codex&lt;/td&gt;
&lt;td&gt;$0.51&lt;/td&gt;
&lt;td&gt;Medium (~5–6m)&lt;/td&gt;
&lt;td&gt;Production-ready code that integrates cleanly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Opus 4.5&lt;/td&gt;
&lt;td&gt;$1.76&lt;/td&gt;
&lt;td&gt;Slowest (~12m)&lt;/td&gt;
&lt;td&gt;Deep architecture, system design&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h1&gt;
  
  
  What I tested (identical for all models)
&lt;/h1&gt;

&lt;p&gt;I gave all three models two core components from my system.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Statistical anomaly detection
&lt;/h2&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Learn baseline error rates&lt;/li&gt;
&lt;li&gt;Use EWMA and z-scores&lt;/li&gt;
&lt;li&gt;Detect approximately 5x spike changes&lt;/li&gt;
&lt;li&gt;Handle more than 100,000 logs per minute&lt;/li&gt;
&lt;li&gt;Do not crash from NaN, Infinity, or zero division&lt;/li&gt;
&lt;li&gt;Adapt as the system evolves&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Distributed alert deduplication
&lt;/h2&gt;

&lt;p&gt;Requirements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiple processors detecting the same anomaly&lt;/li&gt;
&lt;li&gt;Up to 3 seconds of clock skew&lt;/li&gt;
&lt;li&gt;Survive crashes&lt;/li&gt;
&lt;li&gt;Enforce a 5-second dedupe window&lt;/li&gt;
&lt;li&gt;Avoid duplicate alerts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All implementations were tested inside my actual codebase.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why this experiment matters
&lt;/h1&gt;

&lt;p&gt;This was not about ranking models. It was about understanding their behavior where it actually matters: real systems with real traffic.&lt;/p&gt;

&lt;p&gt;Some observations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Architectural intelligence is not the same as production safety&lt;/li&gt;
&lt;li&gt;Minimal designs often outperform complex ones when load is high&lt;/li&gt;
&lt;li&gt;Defensive programming is still an essential skill, even for AI models&lt;/li&gt;
&lt;li&gt;Agentic tooling like Composio can simplify integration work dramatically&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most importantly: model choice should be driven by the engineering problem, not leaderboard hype.&lt;/p&gt;




&lt;h1&gt;
  
  
  Claude Opus 4.5: "Let me architect this properly."
&lt;/h1&gt;

&lt;p&gt;Claude treated the task like a platform redesign.&lt;/p&gt;

&lt;p&gt;For anomaly detection, it produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A complete statistical engine&lt;/li&gt;
&lt;li&gt;Welford variance&lt;/li&gt;
&lt;li&gt;Snapshotting and serialization&lt;/li&gt;
&lt;li&gt;Configuration layers&lt;/li&gt;
&lt;li&gt;A documentation-level explanation of every component&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The architecture was genuinely impressive.&lt;/p&gt;

&lt;p&gt;Where things failed was in execution. One edge case crashed the entire service:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;current&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;previous&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;       &lt;span class="c1"&gt;// previous = 0 -&amp;gt; Infinity&lt;/span&gt;
&lt;span class="nx"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toFixed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;                        &lt;span class="c1"&gt;// Crash&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After a restart, the serialized baseline was also reconstructed incorrectly, which left the system in a corrupted state.&lt;/p&gt;

&lt;p&gt;My takeaway: Claude behaves like an architect, not a production IC. The design quality is excellent, but I needed to harden the output before trusting it in a high-volume ingestion path.&lt;/p&gt;




&lt;h1&gt;
  
  
  GPT-5.1: "Let us ship something that will not break."
&lt;/h1&gt;

&lt;p&gt;Codex produced the most balanced and production-safe output in my tests.&lt;/p&gt;

&lt;p&gt;For anomaly detection it used:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A straightforward O(1) update loop&lt;/li&gt;
&lt;li&gt;EWMA with no unnecessary complexity&lt;/li&gt;
&lt;li&gt;Defensive programming on every numerical operation&lt;/li&gt;
&lt;li&gt;Clean integration with my existing pipeline on the first attempt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For deduplication it suggested:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A simple reservation table&lt;/li&gt;
&lt;li&gt;Postgres row-level locks with FOR UPDATE&lt;/li&gt;
&lt;li&gt;TTL cleanup&lt;/li&gt;
&lt;li&gt;Clock skew handled at the database layer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It worked on the first run without crashes or inconsistencies.&lt;/p&gt;

&lt;p&gt;My takeaway: this model behaves like a senior engineer who optimizes for reliability and failsafe conditions. It was not flashy but it was dependable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Gemini 3.0 Pro: "Let us get something clean and fast into the repo."
&lt;/h1&gt;

&lt;p&gt;Gemini felt like the fastest and most concise contributor.&lt;/p&gt;

&lt;p&gt;For anomaly detection it gave:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A compact EWMA implementation&lt;/li&gt;
&lt;li&gt;Minimal and readable code&lt;/li&gt;
&lt;li&gt;Proper epsilon checks&lt;/li&gt;
&lt;li&gt;Simple logic that was easy to review&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For alert deduplication it produced:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Postgres INSERT ON CONFLICT design for atomic suppression&lt;/li&gt;
&lt;li&gt;No unnecessary layers&lt;/li&gt;
&lt;li&gt;The cleanest code to read among the three&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The limitation was that some edge cases were left for me to think through manually, and the design was tied closely to Postgres.&lt;/p&gt;

&lt;p&gt;My takeaway: Gemini is an excellent rapid prototyper. It is fast, clean, and efficient. I would simply perform an extra pass before deploying it to production.&lt;/p&gt;




&lt;h1&gt;
  
  
  What I learned from running all three in a live codebase
&lt;/h1&gt;

&lt;p&gt;This experiment made something clear:&lt;/p&gt;

&lt;p&gt;Models differ in engineering philosophy, not just accuracy.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Some try to design a platform&lt;/li&gt;
&lt;li&gt;Some try to ship robust production code&lt;/li&gt;
&lt;li&gt;Some try to produce fast and usable prototypes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Depending on the problem, each approach can be the best one.&lt;/p&gt;

&lt;p&gt;For my observability system, the style that emphasized correctness and integration performed best in this specific context.&lt;/p&gt;

&lt;p&gt;The architectural depth from Claude and the simplicity and speed of Gemini were also valuable.&lt;/p&gt;




&lt;h1&gt;
  
  
  Integrating Composio Tool Router
&lt;/h1&gt;

&lt;p&gt;For the Gemini branch, I also wired in Composio's Tool Router. It is essentially a unified way to give the agent access to Slack, Jira, PagerDuty, Gmail, and similar tools without hand-building each integration.&lt;/p&gt;

&lt;p&gt;A simplified version of my setup looked like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tsx"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;composioClient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ComposioClient&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;process&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;COMPOSIO_API_KEY&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;userId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;tracer-system&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;toolkits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;slack&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jira&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;pagerduty&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mcp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;composioClient&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createMCPClient&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;callAgent&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;agentName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;log-anomaly-alert-agent&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Anomaly detected in production...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tool Router streamlined agentic actions significantly and removed the overhead of wiring multiple third-party integrations manually.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final thoughts
&lt;/h1&gt;

&lt;p&gt;This was not a competition. It was an experiment inside a real, running observability pipeline.&lt;/p&gt;

&lt;p&gt;Three models.&lt;/p&gt;

&lt;p&gt;Same tasks.&lt;/p&gt;

&lt;p&gt;Same repository.&lt;/p&gt;

&lt;p&gt;Same constraints.&lt;/p&gt;

&lt;p&gt;Each one delivered a different tradeoff, a different strength, and a different engineering personality.&lt;/p&gt;

&lt;p&gt;If you build real systems, these differences matter more than leaderboard numbers.&lt;/p&gt;




&lt;h2&gt;
  
  
  Full Results &amp;amp; Code
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Complete analysis:&lt;/strong&gt; &lt;a href="https://composio.dev/blog/claude-4-5-opus-vs-gemini-3-pro-vs-gpt-5-codex-max-the-sota-coding-model" rel="noopener noreferrer"&gt;Read the full blog post&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This was an experimental comparison to understand model capabilities, not production deployment.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>performance</category>
    </item>
    <item>
      <title>Cursor Composer 1 vs SWE 1.5 What Surprised Me Most After Testing Both</title>
      <dc:creator>VK</dc:creator>
      <pubDate>Mon, 10 Nov 2025 13:11:25 +0000</pubDate>
      <link>https://dev.to/composiodev/cursor-composer-1-vs-swe-15-what-surprised-me-most-after-testing-both-25gh</link>
      <guid>https://dev.to/composiodev/cursor-composer-1-vs-swe-15-what-surprised-me-most-after-testing-both-25gh</guid>
      <description>&lt;p&gt;I’ve spent the last few weeks living with two of the most talked-about AI coding assistants, Cursor Composer 1 and Cognition SWE 1.5, inside real multi-service projects connected through Composio’s Rube MCP gateway.&lt;/p&gt;

&lt;p&gt;Not toy apps. Not single-file demos. Actual workflows: browser extensions, API connections, and live data running through real services.&lt;/p&gt;

&lt;p&gt;Here’s what stood out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cursor’s Secret Strength: Flow&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cursor still nails what it set out to do: get you to a working prototype fast.&lt;br&gt;
It keeps you in a "flow" state where ideas turn into working code almost immediately. The feedback loop feels natural, like coding with a hyperactive pair programmer who doesn’t get tired.&lt;/p&gt;

&lt;p&gt;But when the project grew past one file, that same speed started working against it. Quick fixes piled up. Error handling got messy. The MVP was done, but scaling it felt like untangling a ball of wires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SWE 1.5’s Advantage: Structure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;SWE 1.5 took longer to reach the same MVP, but the code it wrote looked like something a senior engineer would hand off to a team.&lt;br&gt;
It separated logic cleanly, anticipated edge cases, and wrote comments that actually explained why things worked.&lt;/p&gt;

&lt;p&gt;When I connected it through Rube MCP to multiple services, it handled streaming events, retries, and failure cases like a pro. It wasn’t flashy, but it was quietly solid.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Surprised Me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Error recovery&lt;/u&gt;: SWE 1.5 caught and retried partial SSE events automatically. Cursor often just… stopped.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Architecture&lt;/u&gt;: SWE 1.5 created multi-file structures with clear boundaries. Cursor favored single-file speed.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Debugging&lt;/u&gt;: SWE 1.5 left breadcrumbs in logs. Cursor left mystery.&lt;/p&gt;

&lt;p&gt;&lt;u&gt;Iteration speed&lt;/u&gt;: Cursor was addictive for prototyping. SWE 1.5 rewarded patience with cleaner long-term code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Numbers&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;&lt;u&gt;Speed &amp;amp; Scaffolding:&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cursor reached a working build in ~25 minutes (~40-50K tokens, ~$0.15-0.25) but required several debugging loops. &lt;/p&gt;

&lt;p&gt;SWE 1.5 took ~45 minutes (~55-65K tokens, ~$0.50-0.60) but fewer debugging loops (~3 vs ~6) and a more modular structure. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Architecture &amp;amp; Maintainability:&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cursor sample: single background.js, minimal separation of concerns. Fine for MVPs but weak on error handling. &lt;/p&gt;

&lt;p&gt;SWE 1.5: multi-file (background, popup, config, proxy), strong error recovery, buffered SSE handling, fallback logic. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;u&gt;Error Handling &amp;amp; Debugging:&lt;/u&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cursor: Syntax or stream parsing errors required manual fixes.&lt;/p&gt;

&lt;p&gt;SWE 1.5: Detected root causes, implemented retries, managed partial SSE messages, clearer logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Takeaway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you want momentum, something you can see and share within an hour, Cursor Composer is still unmatched.&lt;br&gt;
If you want something you can build on top of, with fewer “why did it break?” moments, SWE 1.5 is the safer bet.&lt;/p&gt;

&lt;p&gt;Both are excellent in their lanes. But in real multi-service builds powered by Composio, structure beats speed more often than not.&lt;/p&gt;

&lt;p&gt;I’ve detailed the full experiment, metrics, and side-by-side comparisons here:&lt;br&gt;
Read the full write-up on &lt;a href="https://composio.dev/blog/cursor-composer-vs-swe-1-5" rel="noopener noreferrer"&gt;Post link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Curious, have you tried building real integrations with these assistants (or others like Devin or Aider)?&lt;br&gt;
What patterns or failure modes have you noticed?&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>api</category>
    </item>
    <item>
      <title>10 Claude Skills that actually changed how I work</title>
      <dc:creator>VK</dc:creator>
      <pubDate>Thu, 06 Nov 2025 11:06:24 +0000</pubDate>
      <link>https://dev.to/composiodev/10-claude-skills-that-actually-changed-how-i-work-2b58</link>
      <guid>https://dev.to/composiodev/10-claude-skills-that-actually-changed-how-i-work-2b58</guid>
      <description>&lt;p&gt;Okay so Skills dropped last month and I've been testing them nonstop. Some are genuinely useful, others are kinda whatever. Here's what I actually use:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rube MCP Connector(community skill)&lt;/strong&gt; - This one's wild. Connect Claude to like 500 apps (Slack, GitHub, Notion, etc) through ONE server instead of setting up auth for each one separately. Saves so much time if you're doing automation stuff.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Superpowers&lt;/strong&gt; - obra's dev toolkit. Has /brainstorm, /write-plan, /execute-plan commands that basically turn Claude into a proper dev workflow instead of just a chatbot. Game changer if you're coding seriously.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Document Suite&lt;/strong&gt; - Official one. Makes Claude actually good at Word/Excel/PowerPoint/PDF. Not just reading them but ACTUALLY creating proper docs with formatting, formulas, all that. Built-in for Pro users.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Theme Factory&lt;/strong&gt; - Upload your brand guidelines once, every artifact Claude makes follows your colors/fonts automatically. Marketing teams will love this.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Algorithmic Art&lt;/strong&gt; - p5.js generative art but you just describe it. "Blue-purple gradient flow field, 5000 particles, seed 42" and boom, reproducible artwork. Creative coders eating good.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Slack GIF Creator&lt;/strong&gt; - Custom animated GIFs optimized for Slack. Instead of searching Giphy, just tell Claude what you want. Weirdly fun.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Webapp Testing&lt;/strong&gt; - Playwright automation. Tell Claude "test the login flow" and it writes + runs the tests. QA engineers this is for you.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;MCP Builder&lt;/strong&gt; - Generates MCP server boilerplate. If you're building custom integrations, this cuts setup time by like 80%.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Brand Guidelines&lt;/strong&gt; - Similar to Theme Factory but handles multiple brands. Switch between them easily.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Systematic Debugging&lt;/strong&gt; - Makes Claude debug like a senior dev. Root cause → hypotheses → fixes → documentation. No more random stabbing.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Quick thoughts:&lt;/p&gt;

&lt;p&gt;Skills are just markdown files with YAML metadata (super easy to make your own)&lt;/p&gt;

&lt;p&gt;They're token-efficient (~30-50 tokens until loaded)&lt;/p&gt;

&lt;p&gt;Work across Claude.ai, Claude Code, and API&lt;/p&gt;

&lt;p&gt;Community ones on GitHub are hit or miss, use at your own risk&lt;/p&gt;

&lt;p&gt;The Rube connector and Superpowers are my daily drivers now. Document Suite is clutch when clients send weird file formats.&lt;/p&gt;

&lt;p&gt;Anyone else trying these? What am I missing?&lt;/p&gt;

&lt;p&gt;Resources:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/ComposioHQ/awesome-claude-skills" rel="noopener noreferrer"&gt;Claude Skills repo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/obra/superpowers" rel="noopener noreferrer"&gt;Superpowers&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://rube.app/" rel="noopener noreferrer"&gt;Rube&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>productivity</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
