I read about this study on Twitter and couldn't stop thinking about it.
In 2009, neuroscientists put a dead Atlantic salmon in an fMRI scanner. They showed it pictures of humans in social situations. Asked it to determine what emotion the people were feeling.
The scanner detected brain activity. The salmon appeared to be thinking.
Obviously, the fish wasn't thinking. The "activity" was random noise. But here's the point: without proper statistical controls, your tools will find patterns where none exist.
This is happening in machine learning right now. We're celebrating model improvements that vanish when you add proper baselines. We're finding brain activity in dead fish, except now we call it architectural innovation.
Dead fish score 90%
Researchers submitted "null models" to LLM benchmarks. These models output constant responses regardless of input. They don't read your question. They just generate formatted text that looks good.
These null models achieved 80-90% win rates on AlpacaEval.
Think about that. A model that completely ignores your input can hit 90%. That's not measuring intelligence. That's measuring how well you format markdown.
The paper is called "Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates" and it should terrify you if you're making decisions based on leaderboard positions. (arxiv.org/abs/2410.07137)
This isn't isolated. "Shortcut Learning in Deep Neural Networks" shows ImageNet models learn texture instead of shape. Show them an elephant with cat texture and they say "cat" with confidence. They learned the wrong thing entirely, but the benchmark never caught it. (arxiv.org/abs/2004.07780)
We're celebrating dead salmon because nobody checked if they were alive.
"An embarrassingly simple approach"
There's a whole genre of papers with "An Embarrassingly Simple Approach" in the title. They keep beating state-of-the-art by just not doing the complex thing.
Zero-shot learning? Linear regression beat fancy meta-learning architectures.
One-shot learning? Just prune irrelevant features from a pretrained model. Set new records on miniImageNet and tieredImageNet. Beat all the complex meta-learning networks.
Imbalanced semi-supervised learning? Basic resampling won by 12-16% over complex balancing techniques.
The pattern is clear. These papers didn't discover new techniques. They just bothered to implement the baseline that everyone else skipped.
If you're not testing against simple baselines, you're not doing science. You're doing marketing.
XGBoost from 2016 keeps winning
The most damning evidence comes from tabular data.
"Tabular Data: Deep Learning Is Not All You Need" tested fancy deep learning models against XGBoost. That's a 2016 algorithm. Most of you already know it. (arxiv.org/abs/2106.03253)
XGBoost won on most datasets. XGBoost trained significantly faster. Each deep model performed best only on datasets from its own paper.
Move to new data and the architectural innovations collapsed. The researchers tested models from four recent papers across eleven datasets. Every single "novel architecture" dominated its home turf and failed everywhere else.
That's not innovation. That's p-hacking with neural nets.
Meanwhile, the 2016 baseline just kept winning.
To be fair, deep learning does pull ahead on tabular data in specific cases. Massive datasets over a million rows. When you need to learn complex feature interactions without manual engineering. But those scenarios are rarer than the hype suggests. For most tabular problems, XGBoost with good features beats any deep model with bad ones.
Data quality beats model architecture
Andrew Ng said it: "Improving data quality often beats developing a better model architecture."
Microsoft proved it with Phi models. A tiny model trained on high-quality synthetic textbooks outperformed massive models trained on web scrape. Not because of architecture. Because of data.
The pattern holds everywhere. XGBoost with good features beats any deep model with bad features. Good prompts on GPT-3.5 beat bad prompts on GPT-4. Clean data beats novel architecture.
But here's why we ignore this: "we cleaned our data better" doesn't win Best Paper awards. "Novel attention mechanism with architectural innovations" does.
"Troubling Trends in Machine Learning Scholarship" documents the problem. (arxiv.org/abs/1807.03341)
Papers claiming architectural innovations are actually just better hyperparameter tuning. They compare their tuned model to an untuned baseline and declare victory. Or they cherry-pick datasets where their approach works. Or they skip the simple baseline that would expose them. Or they use math to make trivial ideas sound profound.
We're drowning in innovations that don't replicate outside their original paper.
What actually transfers
If you're shipping AI products, this should feel liberating. You don't need early access to the latest model. You don't need to obsess over Claude vs GPT vs Gemini.
Build your baseline first. Before reaching for transformers, try the simple thing. For tabular data, that's XGBoost. For text classification, try TF-IDF with logistic regression. For code search, try cosine similarity. If your fancy approach doesn't convincingly beat this, you're staring at a dead fish.
Own what you control. Your model choice is temporary. Models change every few months. What you own lasts.
Fix your data pipeline. A great model can't save bad data, but good data can save a mediocre model. The Phi models proved this. Every winning Kaggle team spends 80% of their time on feature engineering and 20% on model selection. Invest in data cleaning, consistent labeling, removing duplicates, fixing class imbalance, adding proper nulls.
Master prompt engineering. This is model-agnostic and transfers. A well-crafted prompt works across Claude, GPT, and Gemini with minimal changes. Learn to break problems into steps, provide clear examples, use structured outputs, iterate based on failures. This transfers. Model choice doesn't.
Add proper controls. The dead salmon study taught neuroscientists to test null hypotheses. Do the same.
Does your model beat random guessing by enough to matter? Does it beat the simple baseline? Are you comparing tuned vs tuned, or tuned vs vanilla? Does it work beyond your training distribution?
If your gains disappear when you add these controls, you're celebrating noise.
Stop treating benchmarks like brain scans
That interpretability paper makes the same point. If your interpretability tool finds convincing patterns in randomly initialized, untrained models, you're not finding meaning. You're finding statistical noise. (arxiv.org/abs/2512.18792)
Saliency maps look plausible on random networks. Sparse autoencoders find "interpretable features" in random transformers. Benchmark scores improve with null models. Architectures beat baselines that were never properly tuned.
Before you chase the latest model release, try the simple baseline. Fix your data. Invest in prompts that transfer. Add controls to your evals.
You don't need insider access or the newest model. You need to own your data, your prompts, your retrieval, and your evaluation. The model is often the least interesting part of the system, which is exactly why it gets the most hype.
Your model choice doesn't matter nearly as much as you think. Once you accept that, you can focus on the things that do.
If your improvement disappears against a strong baseline on clean data, you're not measuring intelligence. You're measuring brain activity in a dead fish.
Top comments (0)