Jack Elston

Posted on Mar 19 • Originally published at madness.elstonj.com

I Gave 6 AI Models the Same March Madness Bracket. They All Agreed.

#ai #machinelearning #datascience #sports

The guy who flew drones into tornadoes used 6 AI models to predict March Madness — and discovered they all think the same thing.

I spent last night doing what any engineer would do during March Madness: building a prediction system that pits Claude, GPT-4o, Gemini, Grok, Llama, and DeepSeek against each other.

The question was not just "who wins the tournament" — it was "do these models actually think differently, or are they all trained on the same data and producing the same outputs?"

The Setup

6 frontier AI models from 6 different companies:

Model	Provider	Cost
Claude Sonnet 4	Anthropic	~$5-10
GPT-4o	OpenAI	$0.27
Gemini 2.5 Flash	Google	~$0.05
Grok 3 Mini	xAI	~$0.10
Llama 3.3 70B	Meta (via Groq)	$0.00
DeepSeek Chat	DeepSeek AI	$0.01

Each model independently predicted all 32 first-round games. Then we ran Monte Carlo sensitivity analysis, cross-model adversarial debate, ML calibration on 10 years of data, and KenPom integration.

Total: 1,300+ API calls. ~$15.

The Big Finding

80.6% of predictions were identical across all six models.

Adding models 4, 5, and 6 changed zero bracket picks. DeepSeek at $0.01 produced the same predictions as GPT-4o at $0.27 — a 27x cost difference for the same output.

Why Do They All Agree?

They all learned basketball from the same internet. Same ESPN articles. Same KenPom data. Same Reddit takes.

Model diversity ≠ information diversity.

This is the most important finding. It has massive implications for anyone paying for multiple AI services.

The Human Insight That Beat 1,300 API Calls

The most impactful change to our bracket? Not the ML model (92.1% accuracy). Not the Monte Carlo analysis. Not the cross-model debate.

It was me saying "history says we need more upsets."

That single human observation — grounded in the fact that the average tournament has 10+ first-round upsets — changed more picks than all 1,300 API calls combined.

The UAS Analogy

I fly drones into tornadoes for a living (CEO of Black Swift Technologies). Same principle:

Two weather AI companies with the same NOAA data = same forecast
One company with sensors inside a hurricane = better forecast

The AI model is not the moat. The DATA is the moat.

The Numbers

92.1% — ML model accuracy on 10 years of historical data
80.6% — Agreement rate across 6 models
$0.01 — DeepSeek cost for same predictions as $0.27 GPT-4o
1 in 763 — Perfect bracket odds even at 90% per-game accuracy
$22M — Kentucky NIL spend (most expensive roster, only a 7-seed)
77th — Florida ranked in NIL spending (won championship last year)

The Seldon Parallel

"Psychohistory dealt not with man, but with man-masses. The reaction of one man could be forecast by no known mathematics; the reaction of a billion is something else again." — Isaac Asimov, Foundation

68 teams is somewhere between one man and a billion. We are using the same principle: aggregate enough independent signals and the noise cancels out, leaving the signal.

The ~8% we cannot capture? A player having the game of their life. A referee whistle. A lucky bounce. Even Hari Seldon could not predict that.

Try It Yourself

Dashboard: madness.elstonj.com
Paper (9 pages): madness.elstonj.com/paper.pdf
Code: github.com/elstonj/march-madness-2026
Hidden Easter egg: Try the Konami code on the site 🦬

Champion pick: Duke over Arizona. Let us see if psychohistory holds up. 🏀

Jack Elston, Ph.D. — CEO, Black Swift Technologies — Boulder, CO
The guy who flew the first drone into a tornadic supercell thunderstorm.

DEV Community