Over the last week, three of the biggest coding-focused AI models dropped almost back to back:
- Claude Opus 4.5
- GPT-5.1
- Gemini 3.0 Pro
Everyone has been posting charts, benchmarks, and SWE-bench numbers. Those do not tell me much about how these models behave when dropped into a real codebase with real constraints, real logs, real edge cases, and real integrations.
So I decided to test them in my own system.
I took the exact same two engineering problems from my observability platform and asked each model to implement them directly inside my repository. No special prep, no fine-tuning, no scaffolding. Just: "Here is the context. Build it."
This is what happened.
TL;DR — Quick Results
| Model | Total Cost | Time | What It's Good For |
|---|---|---|---|
| Gemini 3 Pro | $0.25 | Fastest (~5–6m) | Fast prototyping, creative solutions |
| GPT-5.1 Codex | $0.51 | Medium (~5–6m) | Production-ready code that integrates cleanly |
| Claude Opus 4.5 | $1.76 | Slowest (~12m) | Deep architecture, system design |
What I tested (identical for all models)
I gave all three models two core components from my system.
1. Statistical anomaly detection
Requirements:
- Learn baseline error rates
- Use EWMA and z-scores
- Detect approximately 5x spike changes
- Handle more than 100,000 logs per minute
- Do not crash from NaN, Infinity, or zero division
- Adapt as the system evolves
2. Distributed alert deduplication
Requirements:
- Multiple processors detecting the same anomaly
- Up to 3 seconds of clock skew
- Survive crashes
- Enforce a 5-second dedupe window
- Avoid duplicate alerts
All implementations were tested inside my actual codebase.
Why this experiment matters
This was not about ranking models. It was about understanding their behavior where it actually matters: real systems with real traffic.
Some observations:
- Architectural intelligence is not the same as production safety
- Minimal designs often outperform complex ones when load is high
- Defensive programming is still an essential skill, even for AI models
- Agentic tooling like Composio can simplify integration work dramatically
Most importantly: model choice should be driven by the engineering problem, not leaderboard hype.
Claude Opus 4.5: "Let me architect this properly."
Claude treated the task like a platform redesign.
For anomaly detection, it produced:
- A complete statistical engine
- Welford variance
- Snapshotting and serialization
- Configuration layers
- A documentation-level explanation of every component
The architecture was genuinely impressive.
Where things failed was in execution. One edge case crashed the entire service:
const ratio = current / previous; // previous = 0 -> Infinity
ratio.toFixed(2); // Crash
After a restart, the serialized baseline was also reconstructed incorrectly, which left the system in a corrupted state.
My takeaway: Claude behaves like an architect, not a production IC. The design quality is excellent, but I needed to harden the output before trusting it in a high-volume ingestion path.
GPT-5.1: "Let us ship something that will not break."
Codex produced the most balanced and production-safe output in my tests.
For anomaly detection it used:
- A straightforward O(1) update loop
- EWMA with no unnecessary complexity
- Defensive programming on every numerical operation
- Clean integration with my existing pipeline on the first attempt
For deduplication it suggested:
- A simple reservation table
- Postgres row-level locks with FOR UPDATE
- TTL cleanup
- Clock skew handled at the database layer
It worked on the first run without crashes or inconsistencies.
My takeaway: this model behaves like a senior engineer who optimizes for reliability and failsafe conditions. It was not flashy but it was dependable.
Gemini 3.0 Pro: "Let us get something clean and fast into the repo."
Gemini felt like the fastest and most concise contributor.
For anomaly detection it gave:
- A compact EWMA implementation
- Minimal and readable code
- Proper epsilon checks
- Simple logic that was easy to review
For alert deduplication it produced:
- A Postgres INSERT ON CONFLICT design for atomic suppression
- No unnecessary layers
- The cleanest code to read among the three
The limitation was that some edge cases were left for me to think through manually, and the design was tied closely to Postgres.
My takeaway: Gemini is an excellent rapid prototyper. It is fast, clean, and efficient. I would simply perform an extra pass before deploying it to production.
What I learned from running all three in a live codebase
This experiment made something clear:
Models differ in engineering philosophy, not just accuracy.
- Some try to design a platform
- Some try to ship robust production code
- Some try to produce fast and usable prototypes
Depending on the problem, each approach can be the best one.
For my observability system, the style that emphasized correctness and integration performed best in this specific context.
The architectural depth from Claude and the simplicity and speed of Gemini were also valuable.
Integrating Composio Tool Router
For the Gemini branch, I also wired in Composio's Tool Router. It is essentially a unified way to give the agent access to Slack, Jira, PagerDuty, Gmail, and similar tools without hand-building each integration.
A simplified version of my setup looked like this:
const composioClient = new ComposioClient({
apiKey: process.env.COMPOSIO_API_KEY!,
userId: 'tracer-system',
toolkits: ['slack', 'jira', 'pagerduty'],
});
const mcp = await composioClient.createMCPClient();
await mcp.callAgent({
agentName: 'log-anomaly-alert-agent',
input: 'Anomaly detected in production...',
});
Tool Router streamlined agentic actions significantly and removed the overhead of wiring multiple third-party integrations manually.
Final thoughts
This was not a competition. It was an experiment inside a real, running observability pipeline.
Three models.
Same tasks.
Same repository.
Same constraints.
Each one delivered a different tradeoff, a different strength, and a different engineering personality.
If you build real systems, these differences matter more than leaderboard numbers.
Full Results & Code
Complete analysis: Read the full blog post
Note: This was an experimental comparison to understand model capabilities, not production deployment.
Top comments (0)