Varshith Krishna for Composio

Posted on Nov 28 • Originally published at composio.dev

I tested the top 3 AI coding models on real engineering problems. The results surprised me.

#ai #programming #productivity #performance

Over the last week, three of the biggest coding-focused AI models dropped almost back to back:

Claude Opus 4.5
GPT-5.1
Gemini 3.0 Pro

Everyone has been posting charts, benchmarks, and SWE-bench numbers. Those do not tell me much about how these models behave when dropped into a real codebase with real constraints, real logs, real edge cases, and real integrations.

So I decided to test them in my own system.

I took the exact same two engineering problems from my observability platform and asked each model to implement them directly inside my repository. No special prep, no fine-tuning, no scaffolding. Just: "Here is the context. Build it."

This is what happened.

TL;DR — Quick Results

Model	Total Cost	Time	What It's Good For
Gemini 3 Pro	$0.25	Fastest (~5–6m)	Fast prototyping, creative solutions
GPT-5.1 Codex	$0.51	Medium (~5–6m)	Production-ready code that integrates cleanly
Claude Opus 4.5	$1.76	Slowest (~12m)	Deep architecture, system design

What I tested (identical for all models)

I gave all three models two core components from my system.

1. Statistical anomaly detection

Requirements:

Learn baseline error rates
Use EWMA and z-scores
Detect approximately 5x spike changes
Handle more than 100,000 logs per minute
Do not crash from NaN, Infinity, or zero division
Adapt as the system evolves

2. Distributed alert deduplication

Requirements:

Multiple processors detecting the same anomaly
Up to 3 seconds of clock skew
Survive crashes
Enforce a 5-second dedupe window
Avoid duplicate alerts

All implementations were tested inside my actual codebase.

Why this experiment matters

This was not about ranking models. It was about understanding their behavior where it actually matters: real systems with real traffic.

Some observations:

Architectural intelligence is not the same as production safety
Minimal designs often outperform complex ones when load is high
Defensive programming is still an essential skill, even for AI models
Agentic tooling like Composio can simplify integration work dramatically

Most importantly: model choice should be driven by the engineering problem, not leaderboard hype.

Claude Opus 4.5: "Let me architect this properly."

Claude treated the task like a platform redesign.

For anomaly detection, it produced:

A complete statistical engine
Welford variance
Snapshotting and serialization
Configuration layers
A documentation-level explanation of every component

The architecture was genuinely impressive.

Where things failed was in execution. One edge case crashed the entire service:

const ratio = current / previous;       // previous = 0 -> Infinity
ratio.toFixed(2);                        // Crash

After a restart, the serialized baseline was also reconstructed incorrectly, which left the system in a corrupted state.

My takeaway: Claude behaves like an architect, not a production IC. The design quality is excellent, but I needed to harden the output before trusting it in a high-volume ingestion path.

GPT-5.1: "Let us ship something that will not break."

Codex produced the most balanced and production-safe output in my tests.

For anomaly detection it used:

A straightforward O(1) update loop
EWMA with no unnecessary complexity
Defensive programming on every numerical operation
Clean integration with my existing pipeline on the first attempt

For deduplication it suggested:

A simple reservation table
Postgres row-level locks with FOR UPDATE
TTL cleanup
Clock skew handled at the database layer

It worked on the first run without crashes or inconsistencies.

My takeaway: this model behaves like a senior engineer who optimizes for reliability and failsafe conditions. It was not flashy but it was dependable.

Gemini 3.0 Pro: "Let us get something clean and fast into the repo."

Gemini felt like the fastest and most concise contributor.

For anomaly detection it gave:

A compact EWMA implementation
Minimal and readable code
Proper epsilon checks
Simple logic that was easy to review

For alert deduplication it produced:

A Postgres INSERT ON CONFLICT design for atomic suppression
No unnecessary layers
The cleanest code to read among the three

The limitation was that some edge cases were left for me to think through manually, and the design was tied closely to Postgres.

My takeaway: Gemini is an excellent rapid prototyper. It is fast, clean, and efficient. I would simply perform an extra pass before deploying it to production.

What I learned from running all three in a live codebase

This experiment made something clear:

Models differ in engineering philosophy, not just accuracy.

Some try to design a platform
Some try to ship robust production code
Some try to produce fast and usable prototypes

Depending on the problem, each approach can be the best one.

For my observability system, the style that emphasized correctness and integration performed best in this specific context.

The architectural depth from Claude and the simplicity and speed of Gemini were also valuable.

Integrating Composio Tool Router

For the Gemini branch, I also wired in Composio's Tool Router. It is essentially a unified way to give the agent access to Slack, Jira, PagerDuty, Gmail, and similar tools without hand-building each integration.

A simplified version of my setup looked like this:

const composioClient = new ComposioClient({
  apiKey: process.env.COMPOSIO_API_KEY!,
  userId: 'tracer-system',
  toolkits: ['slack', 'jira', 'pagerduty'],
});

const mcp = await composioClient.createMCPClient();

await mcp.callAgent({
  agentName: 'log-anomaly-alert-agent',
  input: 'Anomaly detected in production...',
});

Tool Router streamlined agentic actions significantly and removed the overhead of wiring multiple third-party integrations manually.

Final thoughts

This was not a competition. It was an experiment inside a real, running observability pipeline.

Three models.

Same tasks.

Same repository.

Same constraints.

Each one delivered a different tradeoff, a different strength, and a different engineering personality.

If you build real systems, these differences matter more than leaderboard numbers.

Full Results & Code

Complete analysis: Read the full blog post

Note: This was an experimental comparison to understand model capabilities, not production deployment.

DEV Community