DEV Community: Maulik Sompura

What If GPT Didn’t “Learn”, It Just Found a Winning Lottery Ticket?

Maulik Sompura — Sat, 21 Feb 2026 16:10:36 +0000

I used to say it confidently:

“The model learned this.”

It felt obvious. We initialize weights. We run gradient descent. We minimize loss. The network learns.

End of story.

But then I came across a research paper that genuinely disturbed that simple narrative.

While reading late one night, I stumbled upon a paper by Jonathan Frankle and Michael Carbin. At first, it looked like just another pruning paper. But the core claim made me stop and reread it twice.

It suggested something radical:

What if neural networks don’t build intelligence from scratch during training?

What if, hidden inside a randomly initialized network, there already exists a smaller subnetwork that is capable of solving the task and training merely discovers it?

That idea is known as the Lottery Ticket Hypothesis.

And if it’s even partially true, then gradient descent isn’t constructing intelligence.

It’s searching for a winning ticket inside structured randomness.

Before we go further, let’s unpack what that actually means.

The Idea That Quietly Shook Deep Learning

In 2018, Jonathan Frankle and Michael Carbin proposed something almost heretical:

Inside every randomly initialized neural network, there exists a smaller subnetwork that is already capable of learning the task.

They called it the Lottery Ticket Hypothesis (LTH).

The claim?

A huge neural network contains a sparse “winning ticket” —
a subnetwork that, when trained alone (starting from the original initialization), matches the full network’s performance.

Why “Lottery Ticket”?

Imagine you randomly initialize a huge neural network.

Most weights are useless.

But hidden inside that random initialization is a rare configuration of connections that is already aligned with the task.

Training doesn’t create intelligence from scratch.
It discovers and amplifies that lucky subnetwork.

Like buying lottery tickets, most are worthless, but a few are already winners.

The Experiment (Simplified)

Researchers performed a surprisingly simple experiment:

Train a full network to convergence.
Prune (remove) the smallest-magnitude weights.
Reset the remaining weights back to their original random initialization.
Retrain only the pruned subnetwork.

Shockingly:

The pruned network trained just as well.

Sometimes even faster.

This means the full network wasn’t necessary.
The winning subnetwork was already present at initialization.

Why This Is Mind-Blowing

It challenges the intuitive story:

❌ “Training builds intelligence.”

Instead, it suggests:

✅ “Initialization already contains many potential intelligent subnetworks.”

Training becomes less about constructing intelligence and more about searching for one good structure hidden inside randomness.

That shifts deep learning from:

Optimization theory
to
Combinatorial search through structured randomness.

The Mathematical Angle

Let a neural network be:

f(x; θ)

where:

θ ∈ ℝᵈ
d could be millions or billions.

LTH says:

There exists a binary mask m ∈ {0,1}ᵈ such that:

f(x; m ⊙ θ₀)

trained alone performs as well as the full model.

Where:

θ₀ = original random initialization
⊙ = elementwise multiplication

So the “winning ticket” is:

m ⊙ θ₀

The crucial detail:

If you randomly reinitialize after pruning, performance drops.

That means the specific random draw of θ₀ contained structure.

Randomness wasn’t neutral noise.
It already encoded useful geometry.

Even Deeper Implications

Overparameterization Might Be a Search Strategy

Why are large language models enormous?

Maybe not because they need every parameter.

Maybe because a larger network increases the probability that a good subnetwork exists.

If each subnetwork has a tiny probability p of being “trainable,” and you have N possible subnetworks, then the probability at least one works is roughly:

1 − (1 − p)^N

As N grows, this rapidly approaches 1.

Bigger models might simply mean more lottery tickets.

Random Initialization Is Not “Just Noise”

We usually think:

Random weights = meaningless chaos.

But in high dimensions, strange things happen:

Concentration of measure
Emergent correlations
Structured spectral properties of random matrices

Random high-dimensional systems often contain surprising regularities.

LTH suggests intelligence might emerge from those hidden regularities.

This connects to:

Random matrix theory
High-dimensional probability
Sparse approximation theory

Compression and Pruning Make More Sense

LTH helps explain why:

Pruned networks often retain performance
Sparse models can match dense ones
Quantized models still work

Redundancy may not be inefficiency.

It may be a probabilistic discovery mechanism.

LTH and Large Language Models

Now the dangerous thought.

Modern systems like GPT have billions of parameters.

What if:

Only a fraction are truly necessary?
Reasoning lives in sparse subnetworks?
Scaling works because it increases the chance of containing rare reasoning-capable configurations?

Instead of:

“Bigger models are smarter.”

It might be:

“Bigger models are more likely to contain a rare intelligent structure.”

That’s a radically different interpretation of scaling laws.

But It Gets Stranger

Later research found that, for very large networks, resetting to early training weights works better than resetting to full initialization.

This led to ideas like:

Early-bird tickets
Mode connectivity
Linear low-loss paths between solutions

The loss landscape appears smoother and more connected than classical intuition suggests.

The geometry of deep learning is far stranger than we once believed.

Open Questions

We still don’t fully understand:

Why winning tickets exist
How large they must be
Whether the hypothesis scales cleanly to transformers
Whether reasoning is localized or distributed

The theory is incomplete.

The implications are enormous.

The Radical Interpretation

Some researchers speculate:

Deep networks behave like massive random feature ensembles.
Training selects coherent sparse structures from that ensemble.

That comes dangerously close to saying:

Intelligence emerges from structured randomness.

Not from careful deterministic construction.

The Truly Mind-Bending Part

If the Lottery Ticket Hypothesis holds at scale…

Then scaling laws might reflect:

Extreme value statistics in high dimensions.

GPT’s performance curves might be explainable through probability theory not just optimization.

That would connect large language models to:

Statistical physics
Spin glass theory
Phase transitions

And that’s when deep learning stops looking like engineering…

…and starts looking like high-dimensional statistical mechanics.

So next time someone says:

“GPT learned that.”

You might ask:

Did it learn?

Or did it find a winning ticket hidden inside randomness?

The Actual Research Paper You can find it here..
https://arxiv.org/abs/1803.03635

Stop Manual Segmentation: Meet NotumAi - An Open-Source AI Annotation Tool

Maulik Sompura — Sat, 14 Feb 2026 15:41:48 +0000

If you've ever built a computer vision model, you know this truth:

Data annotation is the slowest, most painful part of the pipeline.

You have thousands of images.
You need high-quality segmentation masks.
And your options usually look like this:

Use a clunky, outdated desktop tool.
Upload sensitive data to a cloud service and pay monthly.
Spend hours manually outlining objects.
Or build your own annotation tool from scratch.

There had to be a better way.

So we built one.

Meet NotumAi.

💡 Why We Built NotumAi

Modern segmentation models like Segment Anything Model 2 (SAM 2) from Meta can generate high-quality masks instantly.

But here’s the problem:

There isn’t a clean, developer-friendly, fully local tool that integrates these models into a smooth dataset creation workflow.

Most solutions are:

Cloud-based
Expensive
Closed source
Not optimized for custom pipelines

We wanted a tool that is:

🔒 100% Local — Your data never leaves your machine

⚡ GPU Accelerated — Real-time AI-assisted segmentation

🎨 Modern & Clean — Built for long annotation sessions

🌍 Open Source — Built with and for the community

That’s how NotumAi was born.

🛠️ What is NotumAi?
NotumAi is a professional-grade image annotation tool specifically designed for creating computer vision datasets. It combines a robust Python backend (handling the heavy lifting with PyTorch and SAM 2) with a modern Electron frontend (ensuring a responsive, beautiful interface).

Key Features
⚡ AI-Assisted Segmentation: Just click on an object, and NotumAi instantly generates a precise polygon mask using SAM 2.
🎨 Professional UI: A glassmorphism-inspired design with a focus on usability and aesthetics.
📂 Project Management: Organize your datasets, persist your progress, and manage multiple classes effortlessly.
💾 Flexible Export: Export your work in standard formats like COCO, YOLO, and Pascal VOC, ready for training.
🔒 Local & Secure: Your data stays with you. Perfect for sensitive or proprietary datasets.

🏗️ Under the Hood
For the developers out there, here's how we built it:

Frontend: Electron, HTML5, Vanilla JS, and CSS (no heavy frameworks, just pure performance).
Backend: Python with FastAPI and Uvicorn.
AI Engine: PyTorch running Meta's SAM 2.1 model.
Communication: Seamless HTTP communication between the frontend client and the local inference server

🌟 We Need You!
NotumAi is an Open Source project, and we are just getting started. We have a solid foundation, but to make this the ultimate annotation tool, we need the community's help.

Whether you are a:

Frontend Wizard: Help us polish the UI, add new interaction modes, or optimize the canvas rendering.
AI Engineer: Help us optimize the inference pipeline or integrate new models.
Pythonista: detailed backend logic, better file handling, or new export formats. ...your contributions are welcome!

🔗 Get Involved
Check out the repository, give us a star, and let's build the best open-source annotation tool together.

👉 GitHub Repository

Happy Annotating! 🖊️✨

Why Studying the Turing Machine Changed How I See AI And Why Every New AI Engineer Should Revisit It

Maulik Sompura — Thu, 27 Nov 2025 18:29:32 +0000

How a “boring theory subject” ended up shaping my entire AI career

When I was doing my Master’s in computer engineering at the University of Padova, there was one subject everyone whispered about:

Automata Theory & Computation.

Not because it was exciting…
…but because most students wanted to survive it.

I remember sitting in the lecture hall asking myself:

“Why are we learning about an imaginary tape machine in 2024?
I want to build AI systems, not decode puzzles from 1936.”

What I didn’t know was that this single subject—the one we all underestimated—would quietly reshape the way I think about AI, computation, and even my day-to-day engineering work.

Let me tell you how.

The Day I Realized a Neural Network Is Not Magic

Months later, when I was working on high-speed machine vision projects (with 1ms deadlines), something struck me:

Everything I was building, every pipeline, every RL loop, every segmentation model could be reduced to:

State -> Transition -> New State

Exactly like the thing I thought was useless in university.

Suddenly, the Turing Machine wasn’t a historical artifact.
It was a mirror showing me the essence of modern AI.

A lot of students think the Turing Machine is just a boring theoretical device.
But in reality, it answers two of the most important questions in modern AI:

What can be computed?
What cannot be computed by ANY machine — even GPT-50?

No matter how big or advanced a neural network becomes, it still cannot solve anything beyond what a Turing Machine can solve.
This means AI is still bound by:

undecidable problems
halting limitations
computational complexity

Modern AI might look magical, but it does not break the laws established in 1936.

The Classic Turing Machine We All Ignored.

Back then, it was just this.
A tape.
Some states.
A transition function.

But what I didn’t understand was:

This is literally the foundation of all computation including today’s AI.

Modern AI Model vs. Turing Machine

At a conceptual level:

A Transformer is a sophisticated state machine
built on a theory created in the 1930s.

Mind = blown.
Mine definitely was.

The Hidden Superpower of Automata Theory

Once you get it, something changes:

You stop thinking like a coder.
You start thinking like a computational architect.

Automata teaches you:

how to break problems down into minimal logic
how to reason in sequences (vital for NLP + RL)
why some problems are inherently slow
why some optimizations are impossible
how systems transition, not just compute

Most importantly:

It gives you a mental model so strong
that AI becomes less of a black box and more of a predictable system.

Every pipeline is a giant state machine.

Even the most advanced RLHF systems.
Even computer vision.
Even GPT.

This is what separates AI “users” from AI “engineers.”

Final Message to New AI Learners

If you’re entering AI today…

Don’t skip the fundamentals.
Don’t choose short-term speed over long-term mastery.
And don’t underestimate the Turing Machine the way I did.

Because when the hype fades and trust me, it will,
the engineers who understand theory
are the ones who keep building the future.

The Hidden Role of Probability in Large Language Models

Maulik Sompura — Tue, 06 May 2025 16:01:25 +0000

Have you ever wondered how this LLM works? this question sparks curiosity in me and lead me into deep research. Allow me to share some of my thoughts on this latest trend.

Most people believe large language models like GPT-4, Claude "understand" language and gives the best answer based on intelligent.

But the truth is every word these models generate is a mathematical gamble, A calculated probability distribution over thousands of possible next tokens.

In this post we will explore beyond usual "transformers and attention" explanation and explore how probability is the real hero behind an LLM does from creativity to hallucinations.

What really happening inside an LLM?

When you prompt a model with a sentence, it doesn't just look up an answer. instead, it

1.Converts the prompt into tokens (like "Hello","","how","are","you")
2.Passes those tokens through layers of neural network.
3.Produces a list of logits, Raw scores for each possible next token.
4.Applies a softmax function to convert those scores into a Probability distribution.
5.Samples or selects the next token based on that distribution.

This process repeats one token at a time.

How LLMs Choose words: it's All probabilities

Here's a simple example.

"The cat sat on the _____"

it might internally assign the probabilities like this.

Token--Probability
mat -> 0.64
floor -> 0.17
roof -> 0.08
table -> 0.04
car -> 0.01

It might choose "mat" 64% of the time, but the temperature adjustment or top-k sampling, it could choose "floor" or even "roof" to keep things creative.

Why this matters?

LLM don't know the facts they predict what's most probable.
This is why they hallucinate sometimes the most likely token just sounds right even if it is wrong.
Tools like temperature, top-k and top-p controls this randomness.
Even prompt engineering is really just guiding the probability space.

Takeaway

The next time ChatGPT feels smarter, remember it is not reasoning like human. it's rolling a weighted die, one token at a time and the die is shaped by your input, training data and probability.