DEV Community

Cover image for I tested the top 3 AI coding models on real engineering problems. The results surprised me.
Varshith Krishna for Composio

Posted on • Originally published at composio.dev

I tested the top 3 AI coding models on real engineering problems. The results surprised me.

Over the last week, three of the biggest coding-focused AI models dropped almost back to back:

  • Claude Opus 4.5
  • GPT-5.1
  • Gemini 3.0 Pro

Everyone has been posting charts, benchmarks, and SWE-bench numbers. Those do not tell me much about how these models behave when dropped into a real codebase with real constraints, real logs, real edge cases, and real integrations.

So I decided to test them in my own system.

I took the exact same two engineering problems from my observability platform and asked each model to implement them directly inside my repository. No special prep, no fine-tuning, no scaffolding. Just: "Here is the context. Build it."

This is what happened.

TL;DR — Quick Results

Model Total Cost Time What It's Good For
Gemini 3 Pro $0.25 Fastest (~5–6m) Fast prototyping, creative solutions
GPT-5.1 Codex $0.51 Medium (~5–6m) Production-ready code that integrates cleanly
Claude Opus 4.5 $1.76 Slowest (~12m) Deep architecture, system design

What I tested (identical for all models)

I gave all three models two core components from my system.

1. Statistical anomaly detection

Requirements:

  • Learn baseline error rates
  • Use EWMA and z-scores
  • Detect approximately 5x spike changes
  • Handle more than 100,000 logs per minute
  • Do not crash from NaN, Infinity, or zero division
  • Adapt as the system evolves

2. Distributed alert deduplication

Requirements:

  • Multiple processors detecting the same anomaly
  • Up to 3 seconds of clock skew
  • Survive crashes
  • Enforce a 5-second dedupe window
  • Avoid duplicate alerts

All implementations were tested inside my actual codebase.


Why this experiment matters

This was not about ranking models. It was about understanding their behavior where it actually matters: real systems with real traffic.

Some observations:

  • Architectural intelligence is not the same as production safety
  • Minimal designs often outperform complex ones when load is high
  • Defensive programming is still an essential skill, even for AI models
  • Agentic tooling like Composio can simplify integration work dramatically

Most importantly: model choice should be driven by the engineering problem, not leaderboard hype.


Claude Opus 4.5: "Let me architect this properly."

Claude treated the task like a platform redesign.

For anomaly detection, it produced:

  • A complete statistical engine
  • Welford variance
  • Snapshotting and serialization
  • Configuration layers
  • A documentation-level explanation of every component

The architecture was genuinely impressive.

Where things failed was in execution. One edge case crashed the entire service:

const ratio = current / previous;       // previous = 0 -> Infinity
ratio.toFixed(2);                        // Crash
Enter fullscreen mode Exit fullscreen mode

After a restart, the serialized baseline was also reconstructed incorrectly, which left the system in a corrupted state.

My takeaway: Claude behaves like an architect, not a production IC. The design quality is excellent, but I needed to harden the output before trusting it in a high-volume ingestion path.


GPT-5.1: "Let us ship something that will not break."

Codex produced the most balanced and production-safe output in my tests.

For anomaly detection it used:

  • A straightforward O(1) update loop
  • EWMA with no unnecessary complexity
  • Defensive programming on every numerical operation
  • Clean integration with my existing pipeline on the first attempt

For deduplication it suggested:

  • A simple reservation table
  • Postgres row-level locks with FOR UPDATE
  • TTL cleanup
  • Clock skew handled at the database layer

It worked on the first run without crashes or inconsistencies.

My takeaway: this model behaves like a senior engineer who optimizes for reliability and failsafe conditions. It was not flashy but it was dependable.


Gemini 3.0 Pro: "Let us get something clean and fast into the repo."

Gemini felt like the fastest and most concise contributor.

For anomaly detection it gave:

  • A compact EWMA implementation
  • Minimal and readable code
  • Proper epsilon checks
  • Simple logic that was easy to review

For alert deduplication it produced:

  • A Postgres INSERT ON CONFLICT design for atomic suppression
  • No unnecessary layers
  • The cleanest code to read among the three

The limitation was that some edge cases were left for me to think through manually, and the design was tied closely to Postgres.

My takeaway: Gemini is an excellent rapid prototyper. It is fast, clean, and efficient. I would simply perform an extra pass before deploying it to production.


What I learned from running all three in a live codebase

This experiment made something clear:

Models differ in engineering philosophy, not just accuracy.

  • Some try to design a platform
  • Some try to ship robust production code
  • Some try to produce fast and usable prototypes

Depending on the problem, each approach can be the best one.

For my observability system, the style that emphasized correctness and integration performed best in this specific context.

The architectural depth from Claude and the simplicity and speed of Gemini were also valuable.


Integrating Composio Tool Router

For the Gemini branch, I also wired in Composio's Tool Router. It is essentially a unified way to give the agent access to Slack, Jira, PagerDuty, Gmail, and similar tools without hand-building each integration.

A simplified version of my setup looked like this:

const composioClient = new ComposioClient({
  apiKey: process.env.COMPOSIO_API_KEY!,
  userId: 'tracer-system',
  toolkits: ['slack', 'jira', 'pagerduty'],
});

const mcp = await composioClient.createMCPClient();

await mcp.callAgent({
  agentName: 'log-anomaly-alert-agent',
  input: 'Anomaly detected in production...',
});
Enter fullscreen mode Exit fullscreen mode

Tool Router streamlined agentic actions significantly and removed the overhead of wiring multiple third-party integrations manually.


Final thoughts

This was not a competition. It was an experiment inside a real, running observability pipeline.

Three models.

Same tasks.

Same repository.

Same constraints.

Each one delivered a different tradeoff, a different strength, and a different engineering personality.

If you build real systems, these differences matter more than leaderboard numbers.


Full Results & Code

Complete analysis: Read the full blog post

Note: This was an experimental comparison to understand model capabilities, not production deployment.

Top comments (0)