I used to say it confidently:
“The model learned this.”
It felt obvious. We initialize weights. We run gradient descent. We minimize loss. The network learns.
End of story.
But then I came across a research paper that genuinely disturbed that simple narrative.
While reading late one night, I stumbled upon a paper by Jonathan Frankle and Michael Carbin. At first, it looked like just another pruning paper. But the core claim made me stop and reread it twice.
It suggested something radical:
What if neural networks don’t build intelligence from scratch during training?
What if, hidden inside a randomly initialized network, there already exists a smaller subnetwork that is capable of solving the task and training merely discovers it?
That idea is known as the Lottery Ticket Hypothesis.
And if it’s even partially true, then gradient descent isn’t constructing intelligence.
It’s searching for a winning ticket inside structured randomness.
Before we go further, let’s unpack what that actually means.
The Idea That Quietly Shook Deep Learning
In 2018, Jonathan Frankle and Michael Carbin proposed something almost heretical:
Inside every randomly initialized neural network, there exists a smaller subnetwork that is already capable of learning the task.
They called it the Lottery Ticket Hypothesis (LTH).
The claim?
A huge neural network contains a sparse “winning ticket” —
a subnetwork that, when trained alone (starting from the original initialization), matches the full network’s performance.
Why “Lottery Ticket”?
Imagine you randomly initialize a huge neural network.
Most weights are useless.
But hidden inside that random initialization is a rare configuration of connections that is already aligned with the task.
Training doesn’t create intelligence from scratch.
It discovers and amplifies that lucky subnetwork.
Like buying lottery tickets, most are worthless, but a few are already winners.
The Experiment (Simplified)
Researchers performed a surprisingly simple experiment:
- Train a full network to convergence.
- Prune (remove) the smallest-magnitude weights.
- Reset the remaining weights back to their original random initialization.
- Retrain only the pruned subnetwork.
Shockingly:
The pruned network trained just as well.
Sometimes even faster.
This means the full network wasn’t necessary.
The winning subnetwork was already present at initialization.
Why This Is Mind-Blowing
It challenges the intuitive story:
❌ “Training builds intelligence.”
Instead, it suggests:
✅ “Initialization already contains many potential intelligent subnetworks.”
Training becomes less about constructing intelligence and more about searching for one good structure hidden inside randomness.
That shifts deep learning from:
Optimization theory
to
Combinatorial search through structured randomness.
The Mathematical Angle
Let a neural network be:
f(x; θ)
where:
- θ ∈ ℝᵈ
- d could be millions or billions.
LTH says:
There exists a binary mask m ∈ {0,1}ᵈ such that:
f(x; m ⊙ θ₀)
trained alone performs as well as the full model.
Where:
- θ₀ = original random initialization
- ⊙ = elementwise multiplication
So the “winning ticket” is:
m ⊙ θ₀
The crucial detail:
If you randomly reinitialize after pruning, performance drops.
That means the specific random draw of θ₀ contained structure.
Randomness wasn’t neutral noise.
It already encoded useful geometry.
Even Deeper Implications
Overparameterization Might Be a Search Strategy
Why are large language models enormous?
Maybe not because they need every parameter.
Maybe because a larger network increases the probability that a good subnetwork exists.
If each subnetwork has a tiny probability p of being “trainable,” and you have N possible subnetworks, then the probability at least one works is roughly:
1 − (1 − p)^N
As N grows, this rapidly approaches 1.
Bigger models might simply mean more lottery tickets.
Random Initialization Is Not “Just Noise”
We usually think:
Random weights = meaningless chaos.
But in high dimensions, strange things happen:
- Concentration of measure
- Emergent correlations
- Structured spectral properties of random matrices
Random high-dimensional systems often contain surprising regularities.
LTH suggests intelligence might emerge from those hidden regularities.
This connects to:
- Random matrix theory
- High-dimensional probability
- Sparse approximation theory
Compression and Pruning Make More Sense
LTH helps explain why:
- Pruned networks often retain performance
- Sparse models can match dense ones
- Quantized models still work
Redundancy may not be inefficiency.
It may be a probabilistic discovery mechanism.
LTH and Large Language Models
Now the dangerous thought.
Modern systems like GPT have billions of parameters.
What if:
- Only a fraction are truly necessary?
- Reasoning lives in sparse subnetworks?
- Scaling works because it increases the chance of containing rare reasoning-capable configurations?
Instead of:
“Bigger models are smarter.”
It might be:
“Bigger models are more likely to contain a rare intelligent structure.”
That’s a radically different interpretation of scaling laws.
But It Gets Stranger
Later research found that, for very large networks, resetting to early training weights works better than resetting to full initialization.
This led to ideas like:
- Early-bird tickets
- Mode connectivity
- Linear low-loss paths between solutions
The loss landscape appears smoother and more connected than classical intuition suggests.
The geometry of deep learning is far stranger than we once believed.
Open Questions
We still don’t fully understand:
- Why winning tickets exist
- How large they must be
- Whether the hypothesis scales cleanly to transformers
- Whether reasoning is localized or distributed
The theory is incomplete.
The implications are enormous.
The Radical Interpretation
Some researchers speculate:
Deep networks behave like massive random feature ensembles.
Training selects coherent sparse structures from that ensemble.
That comes dangerously close to saying:
Intelligence emerges from structured randomness.
Not from careful deterministic construction.
The Truly Mind-Bending Part
If the Lottery Ticket Hypothesis holds at scale…
Then scaling laws might reflect:
Extreme value statistics in high dimensions.
GPT’s performance curves might be explainable through probability theory not just optimization.
That would connect large language models to:
- Statistical physics
- Spin glass theory
- Phase transitions
And that’s when deep learning stops looking like engineering…
…and starts looking like high-dimensional statistical mechanics.
So next time someone says:
“GPT learned that.”
You might ask:
Did it learn?
Or did it find a winning ticket hidden inside randomness?
The Actual Research Paper You can find it here..
https://arxiv.org/abs/1803.03635
Top comments (0)