DEV Community: VK

AI Workflow Automation Tools Are a Mess, Here’s What I Learned the Hard Way

VK — Tue, 05 May 2026 13:53:16 +0000

I spent days testing tools, trying to build a simple AI workflow automation setup, and it was way harder than it should have been.

What is AI Workflow Automation?

AI workflow automation is the process of using AI tools to connect, trigger, and execute tasks across apps without manual intervention.
In theory, it should simplify work. In reality, most tools make it fragmented.

Built a lead routing workflow last week that should've taken two hours. New form submission comes in, gets scored, lands in HubSpot, pings the right Slack channel. Done this before. Should've been an afternoon.

Took three days. Not because it was hard, because I kept convincing myself the next tool would be cleaner. Got halfway through in one platform, hit some friction, jumped to the next one. By day two I had half-finished automations in four different tools and a creeping feeling that most of the "AI automation" category is just the same three features in different packaging.

Eventually I did the only reasonable thing: scrapped everything and rebuilt the same workflow in every tool I was curious about. Back to back. Same workflow, every platform.

Here's what I actually learned from that:

n8n is the one I keep coming back to for client work. Self-hosted it on a cheap Hetzner box and had something running before lunch. The real moment was when a client needed an audit trail, pulled up the canvas, traced every branch visually, exported something readable in five minutes. Nothing else I tried lets you do that cleanly.

Make was the surprise for internal stuff. Handed it to my content team with zero hand-holding. Monday someone asked if it was broken. Wednesday I got a screenshot, they'd built an entire RSS-to-Notion briefing pipeline end to end. The economics vs Zapier at any real volume aren't close either.

Composio was the thing I didn't expect to change how I work. I'd been writing OAuth boilerplate by hand every time I added a new integration to an agent. Connected Gmail, GitHub, Notion, and Slack to a LangChain agent in under an hour with zero auth code. The three devs I've recommended it to are all still using it months later.

Zapier I've rage-quit twice and come back twice. The docs are just the best in the category, when something breaks at 11pm and a client is waiting, that genuinely matters more than features. Expensive at volume though, that part isn't a myth.

Also been playing with Claude Cowork for simpler delegation stuff, typed a natural language instruction for a recurring research task and it just ran with no workflow setup at all. Falls apart on anything with real conditional logic but for straightforward recurring tasks it's kind of wild.

Tried Lindy too. The meeting prep agent is genuinely useful, drops a research brief before every discovery call without me thinking about it. Tried using it for anything beyond that and it struggled. Feels like it has a narrow sweet spot but inside that sweet spot it's really good.

Hadn't touched Bardeen until a sales rep on my team mentioned she was spending two hours a day copying LinkedIn profiles into our CRM manually. Set her up with a playbook in 25 minutes, cut the task to four minutes. Main limitation is your browser has to stay open, not great for anything you want running in the background overnight.

Probably the most useful thing I took from the whole exercise: most people are treating these tools like they're interchangeable and they're really not. There's a difference between tools for delegation, tools for structured data routing, tools for agent infrastructure, and tools for browser work. Picking the wrong category for your problem is why people keep jumping between platforms and feeling like nothing works.

Until AI workflow automation tools become more unified, developers will keep wasting time stitching together workflows instead of building actual products.

Complete analysis: Read the full blog post: top AI Automation tools

I'm Building My Own Coding Agent Harness (And It's Pretty Cool)

VK — Mon, 19 Jan 2026 16:09:04 +0000

TL;DR
I worked on a coding agent harness that executes AI-generated code in Docker sandboxes, and learned that the real bottleneck isn't code generation, but everything after: environment setup, error handling, and integration with external services.

While this works great for local development, connecting to tools like GitHub, Slack, or databases meant building dozens of API integrations, managing OAuth flows, and handling edge cases. That's where Composio's ToolRouter came in, instead of building integrations myself, I now get tools through 6 simple meta-tools, with authentication and execution handled automatically.

The result? An agent that can write code, test it locally, create GitHub issues, and notify Slack, all through a single, observable execution loop. Turns out the coolest part wasn't just watching AI write code, but watching it interact with the real world safely and transparently.

Note: This is a learning project where I explored agent execution
patterns and integration approaches. The Composio integration shown
here demonstrates the concept, though a production implementation
would need additional error handling, cost controls, and testing.

Why Build a Coding Agent Harness

I didn’t start thinking about a coding agent harness because existing tools are bad. Quite the opposite. Tools like Claude Code and Codex are excellent, and I use them regularly for debugging and iteration.

What sparked this work wasn’t dissatisfaction, but curiosity.

While using these tools, I kept noticing that a lot of the most important work happens outside the model: code execution, error capture, retries, environment setup, and tool integration. These systems handle that complexity well, but largely invisibly. You get an answer, a fix, or a result, but not always a clear view into how the system arrived there.

I wanted to understand that execution loop more deeply.

Not just did it work, but:

what code actually ran
what error was thrown
what context was passed back to the model
what changed between attempts
and where control really lives when things fail

Most modern AI coding tools optimize for speed and convenience. They hide complexity to make workflows feel smooth. That’s usually the right tradeoff. But it also means the execution layer, the part where code meets reality, is opaque.

The Real Bottleneck Isn’t Code Generation

Writing code is no longer the hard part. Models are already very good at producing first drafts, boilerplate, and obvious fixes.

The harder part is everything that happens after: running code in a real environment, capturing raw errors, installing missing dependencies, handling environment variables, and retrying with proper context when something fails.

Humans still do most of this manually. We run the code, read the error, decide what matters, and feed it back into the model. That makes the human the slowest and most expensive part of the loop.

If an AI system can write code but cannot directly observe failures and respond to them, it isn’t really an agent. It’s a suggestion engine with a human acting as the executor.

What I Mean by a “Harness”

A coding agent harness isn’t a model and it isn’t a framework. It’s the infrastructure around the model.

It’s the execution loop that:

runs code in a controlled environment
captures raw outputs and errors
feeds that reality back to the model
applies guardrails around what the system can touch
and makes every step visible and debuggable

The language model provides reasoning and code.

The harness provides execution, feedback, and constraints.

Together, they turn “generate code” into “try, fail, observe, and improve.”

Why This Matters

This isn’t about replacing developers or claiming AI writes perfect code.

It’s about removing the least valuable parts of development work: rerunning commands, fixing missing imports, interpreting test failures, and repeating the same debug loop over and over.

By making execution and feedback explicit, AI systems become easier to trust, easier to debug, and easier to reason about.

That’s the motivation behind this work: not building a better model, but understanding and shaping the execution layer that makes these tools actually useful.

And honestly, the first time you watch an agent write code, see it fail, understand what broke, and fix itself, that’s genuinely fucking cool. Not because it’s magic, but because you can finally see the entire loop working.

Building the Harness

First part was the motivation. Next is the part where the idea becomes a real execution loop you can watch, debug, and trust.

Before writing any code, I forced myself to answer one question:

What does an "agent" actually do, step by step, when you strip away the UI?

Here's the loop in plain English:

Task comes in
Model decides what to do next
Model asks to use a tool
The harness runs something in the real world
The harness returns raw results
Model updates its plan
Repeat until done or we hit a cap

The important detail is the one people forget:

The model never touches your machine.

It only sees whatever the harness returns.

That boundary is the whole point. The harness is the interface between intelligence and execution. If you control that interface, you control the agent.

The Architecture

This harness has three main parts.

1) The workspace

A fresh directory on the host for every run.

This is where the agent writes and reads files. Think of it like a tiny project repo the agent can manipulate.

That choice matters because it's the difference between:

"run a snippet of Python"
and "build and iterate on an actual codebase"

2) The Docker sandbox

Every command runs inside a Docker container with the workspace mounted at /workspace.

That gives me isolation, repeatability, and a place to put guardrails.

By default:

Resource limits are enforced (CPU + memory)
Commands have a hard timeout
Networking is disabled unless explicitly enabled

If you are going to let a model execute arbitrary code, you need a safety story. Docker isn't perfect, but it is an enormous step up from running things directly on the host.

The Image Setup:

I use a custom Docker image with pytest pre-installed:

FROM python:3.11-slim
RUN pip install --no-cache-dir pytest
WORKDIR /workspace
CMD ["bash"]

This solves a critical problem: if the agent has to install pytest every run, it wastes iterations and API calls. Pre-installing common dependencies in the image means every container starts ready to work.

3) The tool contract

The model does not run anything directly. It calls tools. The harness executes those tools and returns structured results.

I kept the tool surface area small but real:

list_files()
read_file(path)
write_file(path, content)
run_command(command, timeout, network_enabled)
task_complete(summary)

That is enough for real developer workflows:

Create modules
Write tests
Run pytest
Read failures
Patch code
Rerun until green

Why Docker

Yes, I could have used subprocess.run() and called it a day.

But that's not "agent infrastructure." That's handing a model a loaded gun.

You do not want a tool-using model to have direct access to:

Your filesystem
Your SSH keys
Your environment variables
Your network
Your ability to fork-bomb your laptop into a space heater

Docker gives me a baseline set of protections:

Process isolation: Code runs in a container, not on my host
Resource limits: CPU and memory caps prevent obvious abuse
Network control: Networking can be off by default
Reproducibility: Every run starts from a known image

The tradeoff is complexity. It adds friction and edge cases. But for this category of problem, it's the right trade.

The Tools

I used to think tool calling was the "agent" part.

It's not. Tool calling is just the API plumbing that makes the loop possible.

The model is effectively saying: "Please run this for me, and tell me what happened."

The harness is the one doing the doing.

Tool: `write_file`

This is how the model creates and edits code. It writes directly into the workspace.

In the simplest form, this tool is just an interface to a safe path resolver plus a size limit so the agent can't spam huge files.

Tool: `run_command`

This is the core tool. It's how the model runs tests, lints, scripts, whatever.

Two design choices here ended up being surprisingly important:

Timeouts are enforced. If a command hangs, it dies.
Networking is off by default. If the agent wants internet (pip install, curl, etc.), it has to explicitly ask for it.

That single boolean makes the system easier to reason about, and it's the kind of guardrail you never see from the outside in most agent products.

Here's what the tool schema looks like:

{
  "name": "run_command",
  "description": "Run a shell command in Docker in /workspace. Networking is off by default.",
  "parameters": {
    "type": "object",
    "properties": {
      "command": {"type": "string"},
      "timeout": {"type": "integer", "description": "Seconds"},
      "network_enabled": {"type": "boolean", "description": "Enable for pip install, etc."},
      "description": {"type": "string", "description": "What this is trying to do"}
    },
    "required": ["command", "description"]
  }
}

Tool: `task_complete`

This is the stop button. Without it, the harness just keeps looping until it hits max_iterations, even if the task is obviously done.

For an agent loop, you need a clear termination condition.

How tool calling actually works

The mechanics are simple:

You define tools using a JSON schema
You send them with your API request
The model can respond with tool calls rather than plain text
You execute the tool calls in your harness
You send the results back as tool messages
The model sees those results and continues

So the conversation looks like:

User: "Build X"

Model: calls write_file and run_command

Harness: executes and returns real output and errors

Model: uses that reality to decide what to do next

Repeat

The key point again:

The model is not executing. It's requesting.

You are the executor. The harness is the gate.

The code that matters

Most of the harness is glue code. The heart is the loop:

for i in range(max_iterations):
    # Get model response with tools
    resp = client.chat.completions.create(
        model=MODEL,
        messages=messages,
        tools=TOOLS,
        tool_choice="auto",
    )

    msg = resp.choices[0].message

    # If model wants to use tools
    if msg.tool_calls:
        # Store assistant message
        messages.append({
            "role": "assistant",
            "tool_calls": [...]
        })

        # Execute each tool call
        for tc in msg.tool_calls:
            name = tc.function.name
            args = json.loads(tc.function.arguments)
            result = execute_tool(name, args)

            # Add result back to conversation
            messages.append({
                "role": "tool",
                "tool_call_id": tc.id,
                "content": json.dumps(result)
            })

            # Check for completion
            if name == "task_complete":
                return success

In practice, the "agent" behavior comes from one thing:

The model sees raw stdout, exit codes, and failures, and it learns what to do next from those signals.

It's the difference between:

"I think this should work"
and "I ran it, here is the traceback"

A Real Run

Here's what actually happened when I gave it a real task:

Task: Create a Python text normalization utility with pytest tests

Iteration 1: Create project structure

→ write_file("textutils/core.py")
→ write_file("tests/test_normalize.py")
✓ Files created

The agent writes a normalize() function and 6 test cases covering edge cases like empty strings, multiple spaces, tabs, and newlines.

Iteration 2: Run tests (first attempt)

→ run_command("pytest tests/test_normalize.py")
✗ exit_code=2
Error: "ERROR: file or directory not found: tests/test_normalize.py"

Pytest can't find the module. The agent doesn't understand why yet.

Iterations 3-7: The struggle

→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

→ run_command("pytest --pythonpath=/workspace")
✗ exit_code=4

→ run_command("pytest --rootdir=/workspace")
✗ exit_code=2

This is where you see real problem-solving behavior. The agent is stuck trying different pytest configuration flags, but that's not actually the problem.

Five iterations of trying variations before it figures out the real issue.

Iteration 8: The breakthrough

→ write_file("textutils/__init__.py", "")
→ run_command("pytest tests/test_normalize.py")
✓ exit_code=0

===== test session starts =====
collected 6 items

tests/test_normalize.py ......                [100%]

===== 6 passed in 0.02s =====

The agent realizes textutils needs to be a proper Python package. Adds the __init__.py file. All tests pass.

Iteration 9: Done

→ task_complete("Created a text normalization utility with tests. All tests passed successfully.")

Total time: ~25 seconds. Iterations: 9.

What This Teaches You

The satisfying part how it worked.

The agent:

Created real files
Ran real commands
Saw real failures
Got stuck for a bit (iterations 3-7)
Had a realization (iteration 8)
Fixed the actual problem
Verified success

That's the loop working. Not magic, not cherry-picked—just a feedback cycle that eventually converges on the right answer.

And when something goes wrong, you can replay it:

{
  "iteration": 8,
  "tool": "run_command",
  "args": {"command": "pytest tests/test_normalize.py"},
  "result": {
    "exit_code": 0,
    "stdout": "===== 6 passed in 0.02s =====",
    "duration": 1.85
  }
}

Every tool call is logged. Every decision is traceable.

Error handling and retries

I don't do explicit "retry three times" logic.

The retry loop is implicit.

The model retries because it sees structured execution results like:

exit_code
timed_out
stdout (often includes tracebacks)
duration

When a command fails, the model gets the failure, thinks, changes code, and reruns.

This works well for:

Syntax errors
Missing imports
Obvious mistakes in logic
Test failures with clear assertions

It works less well when:

Requirements are ambiguous
The failure needs domain knowledge
The fix is large and multi-step

That's not a model problem. That's a harness ergonomics problem, because the agent still lacks better tools like patch editing, diffing, and memory.

Logs: the feature you don't appreciate until you need it

I log every tool call to a JSONL file inside the workspace.

That log includes:

What tool was called
Arguments
The full result payload
Timestamps

It sounds boring, but it changes everything.

When something goes wrong, you can answer:

What command actually ran
What the model actually saw
Whether it tried to install dependencies
Whether it timed out
What changed right before it broke

Logs turn agent runs into something you can debug, replay, and trust.

What broke (and what I learned)

Let me be honest about the failures, because this is where the actual lessons are.

Problem 1: Docker dependency management

Initially, I tried letting the agent install pytest on every run. Bad idea.

Fresh containers mean fresh installs every time. The agent would waste 2-3 iterations just getting its environment ready before doing actual work.

The fix: Pre-install common dependencies (pytest, pip tools) in the Docker image. Every container starts with them available.

This is a general pattern: if you know the agent will need something frequently, bake it into the base image.

Problem 2: Context growth

Each iteration adds messages:

system → user → assistant → tool → assistant → tool → …

On longer runs you eventually run into context limits or degraded performance.

What helped in practice:

Keep max_iterations sane (12-15)
Make tasks more specific
Reduce output size (truncate stdout to 20k chars)

The real fix (not fully implemented yet) is proper context management:

Sliding windows
Summarizing older tool results
Storing full logs separately from working context

Problem 3: The agent can be too polite

Sometimes the model hits a wall and says "can't do that" instead of pushing.

Example: missing dependency, or a test failure it doesn't understand, and it decides to stop.

The fix is not "better model." It's better scaffolding:

Stronger system prompt about persistence
Better tools (like search within workspace, apply patch)
Clearer success criteria (tests passing)

Problem 4: Networking policy

Some tasks need internet, often just for installing additional deps.

If network is on by default, the agent will use it constantly.

If it's off forever, the agent is stuck.

So I made it explicit: network is disabled by default, but the model can request it via network_enabled=true on run_command.

That's a good compromise because it forces you to see when and why the agent touches the network. In the logs, you can track every network-enabled command.

Beyond the Sandbox - Adding External Tools

The Integration Problem

When you start thinking about adding external tools, you realize the scope pretty quickly.

If you want your agent to create GitHub issues, you need:

OAuth flow for GitHub
Token management and refresh
API wrapper for GitHub's REST API
Error handling for rate limits
Updates when GitHub's API changes

Now multiply that by every service you want: Slack, Gmail, Linear, Notion, databases...

You're looking at months of integration work. Or you could use something like Composio that already built those 1000+ integrations.

The important thing is that external tools don't change the agent loop, they stress it. Authentication failures, network timeouts, and rate limits are just another form of "reality" the harness has to surface back to the model. ToolRouter fits cleanly because it preserves the same contract: the model requests, the harness executes, and raw results come back.

What Composio's ToolRouter Actually Is

Composio ToolRouter is an integration layer that gives your agent access to external tools through a simple API.

The core concept is sessions. Each session:

Is scoped to a specific user
Manages that user's connected apps
Provides tools the agent can call
Handles authentication automatically

Here's the basic setup:

from composio import Composio

# Initialize Composio
composio = Composio(api_key=COMPOSIO_API_KEY)
# Create a session for your user
session = composio.create(user_id="alice@company.com")
# Get tools for this session
tools = session.tools()

That's it. Now tools contains the ToolRouter meta-tools scoped to this user.

How ToolRouter Actually Works

Here's what I learned when integrating it: ToolRouter doesn't give you individual tools upfront.

Instead, it provides 6 meta-tools that handle everything:

COMPOSIO_SEARCH_TOOLS - Searches for relevant tools based on task description
COMPOSIO_MULTI_EXECUTE_TOOL - Executes discovered tools
COMPOSIO_MANAGE_CONNECTIONS - Handles authentication flows
COMPOSIO_REMOTE_WORKBENCH - Processes large responses
COMPOSIO_REMOTE_BASH_TOOL - Runs bash commands remotely
COMPOSIO_SEARCH_ENTITIES - Searches for entities in connected apps

So instead of calling GITHUB_CREATE_ISSUE directly, the workflow is:

Agent calls COMPOSIO_SEARCH_TOOLS with "create a GitHub issue"
ToolRouter finds the right tool
Agent calls COMPOSIO_MULTI_EXECUTE_TOOL with parameters
ToolRouter executes it using the user's connected GitHub account

This is actually smarter than loading every possible tool. The meta-tools handle discovery, authentication, and execution dynamically.

Two Ways to Use ToolRouter

Composio gives you two integration patterns:

Option 1: As Native Tools (What I Used)

You get the 6 meta-tools and pass them to your agent framework:

# Create session
session = composio.create(user_id="user_123")

# Get meta-tools
tools = session.tools()

# Pass to your agent (OpenAI, Anthropic, etc.)
response = client.chat.completions.create(
    model="gpt-4",
    messages=[...],
    tools=tools  # The 6 ToolRouter meta-tools
)

The agent uses these meta-tools to discover and execute what it needs.

Option 2: As MCP Server

If you're using an MCP-compatible framework (like Claude Agent SDK), you can connect via the MCP protocol:

# Create session
session = composio.create(user_id="user_123")

# Get MCP endpoint
mcp_url = session.mcp.url
mcp_headers = session.mcp.headers

# Configure your MCP client
# The MCP server handles tool routing automatically

The MCP approach is more dynamic but adds complexity. For the harness, native tools made more sense.

For this harness, I wanted to keep the execution boundary explicit in my own code rather than delegate it to an MCP runtime.

How It Integrates With the Harness

The integration pattern is straightforward:

Agent Loop
    ├─→ Local Tools (files, Docker commands)
    └─→ Composio Meta-Tools (discovery, execution)
    ↓
Execute tool
    ├─→ Local? Run in Docker
    └─→ Composio? Route to ToolRouter
    ↓
Results back to agent

The key is combining both tool types:

class AgentExecutor:
    def __init__(self, user_id, enable_composio=False):
        self.workspace = Workspace()
        self.sandbox = DockerSandbox(self.workspace)

        # Composio setup
        self.composio_session = None
        if enable_composio:
            composio = Composio(api_key=COMPOSIO_API_KEY)
            self.composio_session = composio.create(user_id=user_id)

    def get_all_tools(self):
        """Combine local and Composio tools"""
        tools = LOCAL_TOOLS  # write_file, run_command, etc.

        if self.composio_session:
            composio_tools = self.composio_session.tools()
            tools.extend(composio_tools)  # Add the 6 meta-tools

        return tools

The agent doesn't know or care where tools come from. It just calls them. We handle the routing.

Authentication Flow

The authentication part is what makes ToolRouter useful. Here's how it works:

First time a user needs an app:

Agent tries to use a tool (searches for "GitHub issue")
User hasn't connected GitHub yet
ToolRouter (via COMPOSIO_MANAGE_CONNECTIONS) returns an auth URL
User clicks URL, completes OAuth
Agent retries, now it works

After that:

Composio manages the tokens
Handles refresh automatically
Agent just calls the meta-tools

You can pre-connect apps for users via the Composio dashboard (https://app.composio.dev/apps), or let them authenticate on-demand.

For the harness, pre-connecting apps is smoother. Less interruption during agent runs.

A Simple Example

Let me show you what a workflow looks like with ToolRouter.

Task: "Create a calculator module, test it, then create a GitHub issue."

The agent:

Writes calc.py with functions (local tools)
Writes test_calc.py (local tools)
Runs pytest (local tool: Docker execution)
Sees tests pass
Calls COMPOSIO_SEARCH_TOOLS to find GitHub tools
Calls COMPOSIO_MULTI_EXECUTE_TOOL to create the issue
Returns the issue URL

Steps 5-6 are where ToolRouter shines. The agent doesn't need to know GitHub's API. It just describes what it wants, and ToolRouter handles the rest.

What's Available

Through ToolRouter's meta-tools, you get access to many integrations:

Development:

GitHub: create issues, PRs, manage repos
GitLab, Bitbucket
Jira, Linear: task management

Communication:

Slack: send messages, create channels
Discord, Teams
Gmail: send/read emails

Productivity:

Notion, Google Docs
Calendar, Drive
Trello, Asana

Databases:

PostgreSQL, MongoDB, MySQL
Airtable, Supabase

The agent discovers these through COMPOSIO_SEARCH_TOOLS as needed.

What I Learned

What works well:

The meta-tool approach is cleaner than loading individual tools
Authentication handling is automatic
Discovery works - agent finds the right tools
Error messages are clear (e.g., "connect this app first")

What needs more work:

Execution can be slow (network round trips)
Error handling for transient failures could be better
Cost tracking per workflow
Pre-checking connection status before trying operations

The goal isn't to integrate everything. It's to integrate the 3-5 apps that matter for your specific workflows.

The Honest Assessment

Composio solves a real problem. Building and maintaining multiple integrations yourself isn't realistic.

But it's not magic:

You still need to understand how the meta-tools work
Authentication setup takes time (connecting apps)
External APIs add latency and failure modes
Costs go up (more API calls)

The value proposition: trade integration complexity for a simpler API. Instead of learning OAuth flows for 10 services, you learn how ToolRouter's 6 meta-tools work.

For most cases, that's the right trade.

Wrapping Up

At this point, the agent isn't just writing code in isolation. It can change a codebase, verify behavior, and take real actions on behalf of a user, all through explicit, inspectable boundaries.

That's the pattern that keeps showing up: models reason, harnesses execute, and tools expose reality. Once you get that separation right, adding integrations stops being scary and starts being composable.

The result: an agent that can write code, test it locally, and take actions through connected services.

It's not perfect. It's not production-ready. But it's functional and extensible.

The integration pattern works - the agent can discover and call external tools through ToolRouter's meta-tools. I've tested the basic workflow locally, and the potential for GitHub issues, Slack notifications, and other integrations is straightforward from here.

The integration patterns are straightforward. The APIs are manageable. The possibilities are interesting.

That's where this is at right now.

I tested the top 3 AI coding models on real engineering problems. The results surprised me.

VK — Fri, 28 Nov 2025 13:14:52 +0000

Over the last week, three of the biggest coding-focused AI models dropped almost back to back:

Claude Opus 4.5
GPT-5.1
Gemini 3.0 Pro

Everyone has been posting charts, benchmarks, and SWE-bench numbers. Those do not tell me much about how these models behave when dropped into a real codebase with real constraints, real logs, real edge cases, and real integrations.

So I decided to test them in my own system.

I took the exact same two engineering problems from my observability platform and asked each model to implement them directly inside my repository. No special prep, no fine-tuning, no scaffolding. Just: "Here is the context. Build it."

This is what happened.

TL;DR — Quick Results

Model	Total Cost	Time	What It's Good For
Gemini 3 Pro	$0.25	Fastest (~5–6m)	Fast prototyping, creative solutions
GPT-5.1 Codex	$0.51	Medium (~5–6m)	Production-ready code that integrates cleanly
Claude Opus 4.5	$1.76	Slowest (~12m)	Deep architecture, system design

What I tested (identical for all models)

I gave all three models two core components from my system.

1. Statistical anomaly detection

Requirements:

Learn baseline error rates
Use EWMA and z-scores
Detect approximately 5x spike changes
Handle more than 100,000 logs per minute
Do not crash from NaN, Infinity, or zero division
Adapt as the system evolves

2. Distributed alert deduplication

Requirements:

Multiple processors detecting the same anomaly
Up to 3 seconds of clock skew
Survive crashes
Enforce a 5-second dedupe window
Avoid duplicate alerts

All implementations were tested inside my actual codebase.

Why this experiment matters

This was not about ranking models. It was about understanding their behavior where it actually matters: real systems with real traffic.

Some observations:

Architectural intelligence is not the same as production safety
Minimal designs often outperform complex ones when load is high
Defensive programming is still an essential skill, even for AI models
Agentic tooling like Composio can simplify integration work dramatically

Most importantly: model choice should be driven by the engineering problem, not leaderboard hype.

Claude Opus 4.5: "Let me architect this properly."

Claude treated the task like a platform redesign.

For anomaly detection, it produced:

A complete statistical engine
Welford variance
Snapshotting and serialization
Configuration layers
A documentation-level explanation of every component

The architecture was genuinely impressive.

Where things failed was in execution. One edge case crashed the entire service:

const ratio = current / previous;       // previous = 0 -> Infinity
ratio.toFixed(2);                        // Crash

After a restart, the serialized baseline was also reconstructed incorrectly, which left the system in a corrupted state.

My takeaway: Claude behaves like an architect, not a production IC. The design quality is excellent, but I needed to harden the output before trusting it in a high-volume ingestion path.

GPT-5.1: "Let us ship something that will not break."

Codex produced the most balanced and production-safe output in my tests.

For anomaly detection it used:

A straightforward O(1) update loop
EWMA with no unnecessary complexity
Defensive programming on every numerical operation
Clean integration with my existing pipeline on the first attempt

For deduplication it suggested:

A simple reservation table
Postgres row-level locks with FOR UPDATE
TTL cleanup
Clock skew handled at the database layer

It worked on the first run without crashes or inconsistencies.

My takeaway: this model behaves like a senior engineer who optimizes for reliability and failsafe conditions. It was not flashy but it was dependable.

Gemini 3.0 Pro: "Let us get something clean and fast into the repo."

Gemini felt like the fastest and most concise contributor.

For anomaly detection it gave:

A compact EWMA implementation
Minimal and readable code
Proper epsilon checks
Simple logic that was easy to review

For alert deduplication it produced:

A Postgres INSERT ON CONFLICT design for atomic suppression
No unnecessary layers
The cleanest code to read among the three

The limitation was that some edge cases were left for me to think through manually, and the design was tied closely to Postgres.

My takeaway: Gemini is an excellent rapid prototyper. It is fast, clean, and efficient. I would simply perform an extra pass before deploying it to production.

What I learned from running all three in a live codebase

This experiment made something clear:

Models differ in engineering philosophy, not just accuracy.

Some try to design a platform
Some try to ship robust production code
Some try to produce fast and usable prototypes

Depending on the problem, each approach can be the best one.

For my observability system, the style that emphasized correctness and integration performed best in this specific context.

The architectural depth from Claude and the simplicity and speed of Gemini were also valuable.

Integrating Composio Tool Router

For the Gemini branch, I also wired in Composio's Tool Router. It is essentially a unified way to give the agent access to Slack, Jira, PagerDuty, Gmail, and similar tools without hand-building each integration.

A simplified version of my setup looked like this:

const composioClient = new ComposioClient({
  apiKey: process.env.COMPOSIO_API_KEY!,
  userId: 'tracer-system',
  toolkits: ['slack', 'jira', 'pagerduty'],
});

const mcp = await composioClient.createMCPClient();

await mcp.callAgent({
  agentName: 'log-anomaly-alert-agent',
  input: 'Anomaly detected in production...',
});

Tool Router streamlined agentic actions significantly and removed the overhead of wiring multiple third-party integrations manually.

Final thoughts

This was not a competition. It was an experiment inside a real, running observability pipeline.

Three models.

Same tasks.

Same repository.

Same constraints.

Each one delivered a different tradeoff, a different strength, and a different engineering personality.

If you build real systems, these differences matter more than leaderboard numbers.

Full Results & Code

Complete analysis: Read the full blog post

Note: This was an experimental comparison to understand model capabilities, not production deployment.

Cursor Composer 1 vs SWE 1.5 What Surprised Me Most After Testing Both

VK — Mon, 10 Nov 2025 13:11:25 +0000

I’ve spent the last few weeks living with two of the most talked-about AI coding assistants, Cursor Composer 1 and Cognition SWE 1.5, inside real multi-service projects connected through Composio’s Rube MCP gateway.

Not toy apps. Not single-file demos. Actual workflows: browser extensions, API connections, and live data running through real services.

Here’s what stood out.

Cursor’s Secret Strength: Flow

Cursor still nails what it set out to do: get you to a working prototype fast.
It keeps you in a "flow" state where ideas turn into working code almost immediately. The feedback loop feels natural, like coding with a hyperactive pair programmer who doesn’t get tired.

But when the project grew past one file, that same speed started working against it. Quick fixes piled up. Error handling got messy. The MVP was done, but scaling it felt like untangling a ball of wires.

SWE 1.5’s Advantage: Structure

SWE 1.5 took longer to reach the same MVP, but the code it wrote looked like something a senior engineer would hand off to a team.
It separated logic cleanly, anticipated edge cases, and wrote comments that actually explained why things worked.

When I connected it through Rube MCP to multiple services, it handled streaming events, retries, and failure cases like a pro. It wasn’t flashy, but it was quietly solid.

What Surprised Me

Error recovery: SWE 1.5 caught and retried partial SSE events automatically. Cursor often just… stopped.

Architecture: SWE 1.5 created multi-file structures with clear boundaries. Cursor favored single-file speed.

Debugging: SWE 1.5 left breadcrumbs in logs. Cursor left mystery.

Iteration speed: Cursor was addictive for prototyping. SWE 1.5 rewarded patience with cleaner long-term code.

The Numbers
Speed & Scaffolding:

Cursor reached a working build in ~25 minutes (~40-50K tokens, ~$0.15-0.25) but required several debugging loops.

SWE 1.5 took ~45 minutes (~55-65K tokens, ~$0.50-0.60) but fewer debugging loops (~3 vs ~6) and a more modular structure.

Architecture & Maintainability:

Cursor sample: single background.js, minimal separation of concerns. Fine for MVPs but weak on error handling.

SWE 1.5: multi-file (background, popup, config, proxy), strong error recovery, buffered SSE handling, fallback logic.

Error Handling & Debugging:

Cursor: Syntax or stream parsing errors required manual fixes.

SWE 1.5: Detected root causes, implemented retries, managed partial SSE messages, clearer logs.

The Takeaway

If you want momentum, something you can see and share within an hour, Cursor Composer is still unmatched.
If you want something you can build on top of, with fewer “why did it break?” moments, SWE 1.5 is the safer bet.

Both are excellent in their lanes. But in real multi-service builds powered by Composio, structure beats speed more often than not.

I’ve detailed the full experiment, metrics, and side-by-side comparisons here:
Read the full write-up on Post link

Curious, have you tried building real integrations with these assistants (or others like Devin or Aider)?
What patterns or failure modes have you noticed?

10 Claude Skills that actually changed how I work

VK — Thu, 06 Nov 2025 11:06:24 +0000

Okay so Skills dropped last month and I've been testing them nonstop. Some are genuinely useful, others are kinda whatever. Here's what I actually use:

Rube MCP Connector(community skill) - This one's wild. Connect Claude to like 500 apps (Slack, GitHub, Notion, etc) through ONE server instead of setting up auth for each one separately. Saves so much time if you're doing automation stuff.
Superpowers - obra's dev toolkit. Has /brainstorm, /write-plan, /execute-plan commands that basically turn Claude into a proper dev workflow instead of just a chatbot. Game changer if you're coding seriously.
Document Suite - Official one. Makes Claude actually good at Word/Excel/PowerPoint/PDF. Not just reading them but ACTUALLY creating proper docs with formatting, formulas, all that. Built-in for Pro users.
Theme Factory - Upload your brand guidelines once, every artifact Claude makes follows your colors/fonts automatically. Marketing teams will love this.
Algorithmic Art - p5.js generative art but you just describe it. "Blue-purple gradient flow field, 5000 particles, seed 42" and boom, reproducible artwork. Creative coders eating good.
Slack GIF Creator - Custom animated GIFs optimized for Slack. Instead of searching Giphy, just tell Claude what you want. Weirdly fun.
Webapp Testing - Playwright automation. Tell Claude "test the login flow" and it writes + runs the tests. QA engineers this is for you.
MCP Builder - Generates MCP server boilerplate. If you're building custom integrations, this cuts setup time by like 80%.
Brand Guidelines - Similar to Theme Factory but handles multiple brands. Switch between them easily.
Systematic Debugging - Makes Claude debug like a senior dev. Root cause → hypotheses → fixes → documentation. No more random stabbing.

Quick thoughts:

Skills are just markdown files with YAML metadata (super easy to make your own)

They're token-efficient (~30-50 tokens until loaded)

Work across Claude.ai, Claude Code, and API

Community ones on GitHub are hit or miss, use at your own risk

The Rube connector and Superpowers are my daily drivers now. Document Suite is clutch when clients send weird file formats.

Anyone else trying these? What am I missing?

Resources:

Claude Skills repo

Superpowers

Rube