Vedant

Posted on May 17

How We Built “Captain Cool OS” : A Multi-Agent AI Tactical Intelligence System for Cricket Captains Using Gemini

#gdgcloudpune #gdgapl2026

I Built an AI Debate Room for Cricket Captains. Here's What Multi-Agent Systems Actually Feel Like in Practice.

A hackathon build story — Gemini 2.5 Flash, four specialized agents, and the moment I realized the disagreement was the product.

The brief was simple: use the Gemini Developer API to build something interesting.

I built a tactical command room for IPL captains.

Not a dashboard. Not a prediction widget. A system where four AI agents argue with each other about field placement, bowling matchups, and dew forecasts — in real time, with the debate streaming live to the screen — until they either reach consensus or publicly admit they can't.

This is the story of how Captain Cool OS came together, what I learned about multi-agent orchestration the hard way, and the one design decision that changed how I think about AI systems entirely.

The Problem I Was Actually Solving

Cricket captains make high-stakes tactical decisions in about 30 seconds. Change the bowler, shift the fielder to deep square leg, set the field for a slower ball outside off. One wrong read of a batter — or a missed dew forecast — and a winnable final over becomes a highlight reel for the other side.

There are tools that help with this. Most of them are dashboards. They show you what happened: win probability curves, matchup history, venue averages. They hand you the data and trust you to do the reasoning.

What they don't do is reason with you.

That gap is what I wanted to close. Not by building a smarter dashboard, but by building something that behaves more like a room full of analysts who disagree with each other — and have to work it out before the next ball is bowled.

Why One LLM Call Isn't Enough

The first prototype was exactly what you'd expect: a single Gemini API call with a long system prompt, a structured JSON schema for the output, and a nicely formatted frontend to display the recommendation.

It worked. It produced sensible-sounding tactical advice. And it was almost completely useless.

The problem wasn't the quality of the output. It was that there was no way to trust it. A single model call producing a single confident answer has no mechanism for surfacing its own uncertainty. It can't tell you which part of its reasoning is load-bearing and which part is a guess. It can't show you the scenario where its plan falls apart.

It just tells you what to do.

Real analysts don't work that way. Real analysts argue. Someone proposes the plan; someone else attacks it; a third person looks for internal contradictions; and eventually — sometimes through several rounds of revision — a consensus emerges that has been stress-tested enough to act on.

That's the system I needed to build.

The Architecture: Four Agents, One Debate

Captain Cool OS runs four specialized agents in sequence, orchestrated by a central state machine:

Stats Analyst goes first. It establishes the quantitative ground truth — matchup history, current pressure score, dew severity, live win probability. Every downstream agent reasons from this baseline and cannot contradict it without engaging the deterministic tool layer.

Strategist produces Plan A. Bowling selection, field placement geometry, delivery sequencing, risk-weighted reasoning. This is the opening position in the debate.

Devil's Advocate attacks Plan A. Specifically. It is not designed to be balanced — it is designed to find the one scenario where the Strategist is wrong. Wrong batter read, wrong dew assumption, wrong boundary geometry. Its job is adversarial pressure, not nuance.

Reflection Evaluator scans for internal contradictions between the Strategist and Devil's Advocate. If the plan cannot survive scrutiny, it triggers a revision loop. The Strategist revises. The Devil's Advocate attacks again. This can go up to three rounds before consensus is forced.

Narrative Layer translates the final consensus into broadcast commentary — directive, counterfactuals, risk profile, and a voice that sounds like it belongs on air.

What makes this different from just calling Gemini five times is the structured contract between agents. Every handoff uses a fully typed Pydantic v2 schema. Agents cannot produce malformed output. The Reflection Evaluator doesn't read prose — it reads structured contradiction flags that the prior agents have written in a format designed to be machine-interrogated.

The debate is structured. The disagreement is legible. And the whole thing streams live over WebSocket so you can watch it happen.

The Design Decision That Changed Everything

Halfway through the build, I added the Confidence Telemetry Graph — an SVG visualization that tracks the system's confidence score in real time as the debate progresses.

When the Strategist and Devil's Advocate disagree and trigger a revision loop, the graph dips. It turns red. It stays red if the agents can't resolve. It recovers green when the Reflection Evaluator certifies consensus.

I added it as a cosmetic feature. I thought it would look good in the demo.

Then I realized it was the most important thing in the product.

Here's why: the question judges (and real users) actually want answered isn't "what should I do?" It's "how sure are you?" A confident wrong answer is worse than an uncertain right one, because the confident wrong answer gets acted on.

The telemetry graph answers the trust question visually, in real time, without requiring the user to read agent logs or interpret probability scores. When the graph is red, the debate is unresolved. When it's green, the system has earned its recommendation.

This is something a single-agent system structurally cannot do. Uncertainty in a single LLM call is hidden inside the model weights. In a multi-agent system, uncertainty is a property of the debate — and if you instrument the debate correctly, uncertainty becomes visible.

What It Actually Looks Like Running

The Wankhede Final Over Thriller scenario — Suryakumar Yadav vs Jasprit Bumrah, heavy dew, death overs — is the clearest example of the system doing its job.

You select the scenario and hit Formulate Tactical Play. The HUD immediately starts cycling: COMPUTING → STRESS TESTING → CONTRADICTION SCAN.

The Stats Analyst comes back with a high dew severity score. The Strategist builds a plan around it — adjusted field for the wet ball, slower variations prioritized. The Devil's Advocate attacks the dew assumption directly: if the dew estimate is wrong by even one point on the severity scale, the field geometry inverts. The Reflection Evaluator flags the dew figure as a load-bearing assumption in the Strategist's plan.

The confidence graph dips red.

The Strategist revises — produces a contingency branch that works under both dew scenarios. The Devil's Advocate runs it again. The contradiction is resolved. The graph recovers.

The Tactical Field SVG renders: bowling trajectory arc from crease to impact zone, fielder positions geometrically placed, boundary geometry highlighted for the high-risk quadrants the Devil's Advocate identified. The Narrative Layer writes it up in commentary voice.

The whole process takes about 12 seconds. Every step of the debate was visible as it happened.

The Hard Parts

Gemini schema validation. The Gemini Developer API is strict about additionalProperties in JSON schemas — stricter than OpenAI's function calling and stricter than I expected. I spent an embarrassing amount of time hardening the Pydantic models before the structured output was reliable. The eventual solution: fully typed schemas with no optional fields that aren't explicitly marked, and a validation pass before every agent handoff.

Deterministic grounding. Early versions of the system let the agents reason freely about pressure scores and dew impact. The results were confident and wrong in subtle ways that were hard to catch. The fix was building a separate deterministic tool layer — PressureEngine, DewImpact, WinProbability — that agents are given as read-only context. They reason from the numbers; they don't produce the numbers themselves.

WebSocket streaming architecture. FastAPI's native WebSocket support is clean, but streaming partial agent output while maintaining state machine correctness across the debate loop required careful design. The MatchOrchestrator handles state routing; the WebSocket layer handles serialization. Keeping those concerns separated was the right call and took two refactors to get there.

Timeout handling. Gemini 2.5 Flash is fast, but multi-agent chains compound latency. An 8-second timeout per agent with graceful degradation to simulation mode was the right balance — it meant the demo never hung, even on a flaky connection.

What Multi-Agent Systems Are Actually Good For

The honest answer, after building this: multi-agent systems are not better than single-agent systems at most things.

They are better at one specific thing: decisions where uncertainty is as important as the answer itself.

If you need a summary, a draft, or a classification, one well-prompted model call is faster, cheaper, and simpler. Multi-agent overhead is real.

But if you need a decision that will be acted on under pressure, by a person who needs to trust it — a tactical call, a clinical recommendation, a financial judgment — then the debate is not overhead. The debate is the product. The disagreement between agents surfaces the scenarios where the plan fails before it gets acted on.

Captain Cool OS is a cricket system. But the architecture is not cricket-specific. Any domain where confident wrong answers are dangerous and visible uncertainty is valuable is a domain where this pattern applies.

What's Next

The system is demo-grade right now. Three pre-loaded scenarios, a hardened WebSocket backend, a custom SVG renderer that does more than it looks like it should.

The direction I'm most interested in: integrating real match data so the Stats Analyst operates on live ball-by-ball feeds rather than scenario presets. The agent debate architecture doesn't change — only the input changes. That's the point of building the orchestration layer correctly from the start.

If you're building in the multi-agent space and want to talk about structured debate architectures, agent schema design, or how to make uncertainty visible in AI systems, I'm around.

Captain Cool OS was built for the Google Developer Groups AI Hackathon 2026. The full architecture, agent design, and setup instructions are in the GitHub repository.

Powered by Gemini 2.5 Flash. Four agents. One debate. Thirty seconds.

DEV Community