DEV Community

Jack Elston
Jack Elston

Posted on • Originally published at madness.elstonj.com

I Gave 6 AI Models the Same March Madness Bracket. They All Agreed.

The guy who flew drones into tornadoes used 6 AI models to predict March Madness — and discovered they all think the same thing.

I spent last night doing what any engineer would do during March Madness: building a prediction system that pits Claude, GPT-4o, Gemini, Grok, Llama, and DeepSeek against each other.

The question was not just "who wins the tournament" — it was "do these models actually think differently, or are they all trained on the same data and producing the same outputs?"

The Setup

6 frontier AI models from 6 different companies:

Model Provider Cost
Claude Sonnet 4 Anthropic ~$5-10
GPT-4o OpenAI $0.27
Gemini 2.5 Flash Google ~$0.05
Grok 3 Mini xAI ~$0.10
Llama 3.3 70B Meta (via Groq) $0.00
DeepSeek Chat DeepSeek AI $0.01

Each model independently predicted all 32 first-round games. Then we ran Monte Carlo sensitivity analysis, cross-model adversarial debate, ML calibration on 10 years of data, and KenPom integration.

Total: 1,300+ API calls. ~$15.

The Big Finding

80.6% of predictions were identical across all six models.

Adding models 4, 5, and 6 changed zero bracket picks. DeepSeek at $0.01 produced the same predictions as GPT-4o at $0.27 — a 27x cost difference for the same output.

Why Do They All Agree?

They all learned basketball from the same internet. Same ESPN articles. Same KenPom data. Same Reddit takes.

Model diversity ≠ information diversity.

This is the most important finding. It has massive implications for anyone paying for multiple AI services.

The Human Insight That Beat 1,300 API Calls

The most impactful change to our bracket? Not the ML model (92.1% accuracy). Not the Monte Carlo analysis. Not the cross-model debate.

It was me saying "history says we need more upsets."

That single human observation — grounded in the fact that the average tournament has 10+ first-round upsets — changed more picks than all 1,300 API calls combined.

The UAS Analogy

I fly drones into tornadoes for a living (CEO of Black Swift Technologies). Same principle:

  • Two weather AI companies with the same NOAA data = same forecast
  • One company with sensors inside a hurricane = better forecast

The AI model is not the moat. The DATA is the moat.

The Numbers

  • 92.1% — ML model accuracy on 10 years of historical data
  • 80.6% — Agreement rate across 6 models
  • $0.01 — DeepSeek cost for same predictions as $0.27 GPT-4o
  • 1 in 763 — Perfect bracket odds even at 90% per-game accuracy
  • $22M — Kentucky NIL spend (most expensive roster, only a 7-seed)
  • 77th — Florida ranked in NIL spending (won championship last year)

The Seldon Parallel

"Psychohistory dealt not with man, but with man-masses. The reaction of one man could be forecast by no known mathematics; the reaction of a billion is something else again." — Isaac Asimov, Foundation

68 teams is somewhere between one man and a billion. We are using the same principle: aggregate enough independent signals and the noise cancels out, leaving the signal.

The ~8% we cannot capture? A player having the game of their life. A referee whistle. A lucky bounce. Even Hari Seldon could not predict that.

Try It Yourself

Champion pick: Duke over Arizona. Let us see if psychohistory holds up. 🏀


Jack Elston, Ph.D. — CEO, Black Swift Technologies — Boulder, CO
The guy who flew the first drone into a tornadic supercell thunderstorm.

Top comments (0)