The guy who flew drones into tornadoes used 6 AI models to predict March Madness — and discovered they all think the same thing.
I spent last night doing what any engineer would do during March Madness: building a prediction system that pits Claude, GPT-4o, Gemini, Grok, Llama, and DeepSeek against each other.
The question was not just "who wins the tournament" — it was "do these models actually think differently, or are they all trained on the same data and producing the same outputs?"
The Setup
6 frontier AI models from 6 different companies:
| Model | Provider | Cost |
|---|---|---|
| Claude Sonnet 4 | Anthropic | ~$5-10 |
| GPT-4o | OpenAI | $0.27 |
| Gemini 2.5 Flash | ~$0.05 | |
| Grok 3 Mini | xAI | ~$0.10 |
| Llama 3.3 70B | Meta (via Groq) | $0.00 |
| DeepSeek Chat | DeepSeek AI | $0.01 |
Each model independently predicted all 32 first-round games. Then we ran Monte Carlo sensitivity analysis, cross-model adversarial debate, ML calibration on 10 years of data, and KenPom integration.
Total: 1,300+ API calls. ~$15.
The Big Finding
80.6% of predictions were identical across all six models.
Adding models 4, 5, and 6 changed zero bracket picks. DeepSeek at $0.01 produced the same predictions as GPT-4o at $0.27 — a 27x cost difference for the same output.
Why Do They All Agree?
They all learned basketball from the same internet. Same ESPN articles. Same KenPom data. Same Reddit takes.
Model diversity ≠ information diversity.
This is the most important finding. It has massive implications for anyone paying for multiple AI services.
The Human Insight That Beat 1,300 API Calls
The most impactful change to our bracket? Not the ML model (92.1% accuracy). Not the Monte Carlo analysis. Not the cross-model debate.
It was me saying "history says we need more upsets."
That single human observation — grounded in the fact that the average tournament has 10+ first-round upsets — changed more picks than all 1,300 API calls combined.
The UAS Analogy
I fly drones into tornadoes for a living (CEO of Black Swift Technologies). Same principle:
- Two weather AI companies with the same NOAA data = same forecast
- One company with sensors inside a hurricane = better forecast
The AI model is not the moat. The DATA is the moat.
The Numbers
- 92.1% — ML model accuracy on 10 years of historical data
- 80.6% — Agreement rate across 6 models
- $0.01 — DeepSeek cost for same predictions as $0.27 GPT-4o
- 1 in 763 — Perfect bracket odds even at 90% per-game accuracy
- $22M — Kentucky NIL spend (most expensive roster, only a 7-seed)
- 77th — Florida ranked in NIL spending (won championship last year)
The Seldon Parallel
"Psychohistory dealt not with man, but with man-masses. The reaction of one man could be forecast by no known mathematics; the reaction of a billion is something else again." — Isaac Asimov, Foundation
68 teams is somewhere between one man and a billion. We are using the same principle: aggregate enough independent signals and the noise cancels out, leaving the signal.
The ~8% we cannot capture? A player having the game of their life. A referee whistle. A lucky bounce. Even Hari Seldon could not predict that.
Try It Yourself
- Dashboard: madness.elstonj.com
- Paper (9 pages): madness.elstonj.com/paper.pdf
- Code: github.com/elstonj/march-madness-2026
- Hidden Easter egg: Try the Konami code on the site 🦬
Champion pick: Duke over Arizona. Let us see if psychohistory holds up. 🏀
Jack Elston, Ph.D. — CEO, Black Swift Technologies — Boulder, CO
The guy who flew the first drone into a tornadic supercell thunderstorm.
Top comments (0)