Shivnath Tathe

Posted on Apr 17

I Tried Building GPT Without Training — Just Math. Here’s Where It Broke | Shivnath Tathe

#ai #llm #openai #gpt3

The Question

What if we skipped training entirely?

Every language model — GPT, LLaMA, BERT — learns by optimising a loss function over millions of gradient steps. But the underlying data is just text: words appearing near other words. Co-occurrence. Counting.

So I asked: how far can pure mathematics take us toward text generation, without a single training step?

I built the whole thing from scratch in Python with NumPy. No PyTorch, no TensorFlow, no model.train(). Just matrices, statistics, and formulas.

Here's what happened.

The Setup: nanoVectorDB

I started with nanoVectorDB — a vector database I'd built from scratch using only NumPy. The original goal was embeddings and similarity search. But then I thought: if I can build word vectors without training, can I also generate text without training?

The corpus: WikiText-103 — 80 million tokens of Wikipedia articles. 10,000 word vocabulary covering 89% of all tokens.

The math pipeline:

Co-occurrence matrix — For each word pair, count how often they appear within a window of 5 words. Forward-heavy weighting (0.7 forward, 0.3 backward) because "the king" tells you more about what follows than what came before.
PPMI (Positive Pointwise Mutual Information) — Raw counts are dominated by common words. PPMI asks: "does this word pair appear together MORE than chance would predict?" It's the formula: PMI(x,y) = log(P(x,y) / P(x)P(y)), clamped to zero for negative values.
SVD (Singular Value Decomposition) — Compress the sparse 10,000×10,000 PPMI matrix into dense 64-dimensional word embeddings. Each word becomes a vector of 64 numbers.
Bigram grammar matrix — Separately, count every word-to-word transition. P(next | last_word). A 10,000×10,000 matrix of raw transition probabilities.

No training. Just counting and matrix factorisation.

What Pure Math Gets Right: Meaning

The embeddings were shockingly good.

Word neighbours:

king  → heir, regent, throne, prince, emperor, pope
queen → princess, duchess, sophia, isabella, catherine
music → indie, pop, hop, jazz, rap, songs, dance
river → lake, creek, valley, upstream, canyon

Analogies (on real data, 80M tokens):

king:man :: queen:? → woman ✓
man:woman :: boy:?  → girl ✓
france:paris :: japan:? → tokyo ✓  
king:queen :: prince:? → princess ✓

5 out of 8 exact matches at rank 1. 7 out of 8 in the top 5.

This isn't a toy result. The SVD embeddings understand that king-queen has the same relationship as prince-princess. They understand that France-Paris maps to Japan-Tokyo. All from counting word co-occurrences in Wikipedia, factorising the matrix, and computing cosine similarities.

This validates Levy & Goldberg (2014) — Word2Vec is implicitly factorising a PMI co-occurrence matrix. We just did it explicitly.

Where It Breaks: Generation

Meaning was solved. Now I tried to generate text. Seed the system with "the king and queen" and predict the next word, then the next, and so on.

Attempt 1: Semantic only (cosine similarity to context)

the king and queen → isabella sophia catherine isabella sophia 
catherine isabella sophia catherine...

Pure synonym loop. The most similar word to "queen" is "isabella". The most similar word to "isabella" is "sophia". Then back to "catherine". Forever.

Attempt 2: Bigram grammar only

Grammar knew that "queen" is often followed by "anne", and "anne" is followed by "elizabeth". But it produced generic Wikipedia filler with no topic awareness.

Attempt 3: Two-stage (the breakthrough)

This was the key idea:

Semantic filter: Find the 20 words most similar to the current context (cosine similarity of SVD embeddings)
Grammar rerank: Among those 20, score each by bigram probability — how often does it actually follow the last word in real text?
Combine: final = 0.7 × grammar + 0.3 × semantic
No repeat: Block every word that's been used before

Semantic proposes. Grammar disposes. No-repeat forces forward motion.

The result:

the war → guerrilla forces fighting troops advancing germans 
italians retreated retreat battle captured ottoman army soldiers 
surrendered marched garrison surrender siege reinforcements 
assault force deployed units corps cavalry division th infantry 
regiment rd battalion nd brigade headquarters unit commanding 
anzac divisional artillery

That's a coherent military narrative — from guerrilla warfare through retreat, surrender, siege, to specific military units and hierarchy. Every transition makes bigram sense. The semantic filter keeps it on-topic. No-repeat pushes it forward.

More outputs from the same system:

the school was built → constructed building construction block 
tower walls towers arches columns carved wooden stone wall arch 
roof tiles marble floors panels decorated brick exterior 
decoration decorative sculptures paintings depicting figures

Architecture → materials → decoration → art. A visual journey through a building.

she won the award → winning medal awarded prize award recipient 
honorary academy graduate school student faculty students 
enrolled

Awards → academia → enrollment. A career trajectory.

The 15 Versions That Followed

The two-stage system generated impressive topic walks but not sentences. So I spent the next 15 versions trying to fix it.

v3.2 — Union pools. Instead of only semantic candidates, I combined semantic top-20 + grammar top-20 into a pool of ~40 candidates. Grammar words like "was", "of", "the" could now compete. Result: grammar words dominated after a few steps. Every seed converged to "...of his own right to be used as well known..." — the same generic Wikipedia filler.

v3.3 — Dual memory. Semantic context tracked only content words (skipping grammar picks). Grammar context used the full sentence. Result: semantic stayed on topic but grammar picks were random glue words. Content and structure weren't coordinated.

v3.4 — Forced alternation (SEM/GRAM/SEM/GRAM). Forced the system to alternate between semantic and grammar picks. Result: the most readable output yet:

the war → victorious IN surrender OF surrendered TO seized BY 
besieged AND captured ON

Grammar words (of, in, by, to, and) appeared as glue between content words. Almost readable — but the grammar words weren't chosen for the content words. "Of" appeared because it has a high bigram score after almost anything, not because the sentence needed it there.

v3.5 — Trigram grammar. Built a trigram dictionary from the corpus (5.9 million unique contexts). Trigrams captured real phrases that bigrams couldn't:

dining → hall
shopping → centre  
tourist → attraction
nobel → peace
honorary → degree

These are genuine multi-word expressions. The bigram only saw "dining → room" or "dining → area". The trigram saw "dining hall" as a unit. But trigram sparsity meant frequent fallback to bigram.

v3.7 — 4-gram. Even sparser, rarely fired, fell back to trigram → bigram. Marginal improvement.

v3.8 — Fuzzy n-grams. The most creative attempt. Instead of exact trigram lookup, find similar contexts via embedding cosine similarity. "emperor empress" could borrow predictions from "king queen" because their embeddings are close. Result: the fuzzy matching was too loose — it matched contexts that sounded similar but had completely different meanings. Pulled in noise.

v3.9 — Union pools + fuzzy trigram. Combined everything. Same gravity-well problem — converged to generic filler after ~8 steps.

v4.0 — Alpha sweep. Tested grammar weights from 0.3 to 0.9 across 10 seeds. Different seeds needed different alpha values. No single alpha worked universally.

v4.1 — MMR soft diversity. Instead of hard-blocking used words, computed max cosine similarity to all previously used word embeddings as a penalty. final = relevance - λ × redundancy. λ=0.4 forced exploration of adjacent semantic regions. "the war" at λ=0.4 traced history across civilizations:

guerrilla forces fighting retreat battle army troops captured 
turkish soldiers surrendered italians germans retreated 
outnumbered defenders withdrew exhausted armies marched siege 
ottoman turks byzantine empire conquered egypt syria lebanon 
palestine israel occupation vietnam cambodia independence

From guerrilla warfare → Ottoman Empire → Byzantine Empire → Egypt/Syria → Israel/Palestine → Vietnam/Cambodia → independence. A walk through centuries of military history, forced by diversity to keep exploring.

The Scorecard

Capability	Toy (213 words)	100k tokens	80M tokens
Similarity separation	0.93	0.21	0.87
Analogies @5	100%	33%	70%
NTP (token accuracy)	41%	2.7%	0%
Generation	Semantic chains	—	Topic walks, no sentences

PPMI + SVD solves meaning. Bigrams solve local transitions. Together they generate coherent topic walks. But they cannot generate grammatical sentences.

Why It Can't Generate Sentences

After 15 versions, the diagnosis is clear. Every fix solved one problem and created another:

What we tried	What it fixed	What it broke
SVD semantic only	Meaning	Loops, no grammar
+ Bigram grammar	Basic transitions	Generic glue chains
+ No repeat	No loops	Exhausts topic words
+ Dual pool	Grammar words appear	Grammar dominates
+ Dual memory	Topic stays alive	Grammar picks random glue
+ Alternation	Content+glue pattern	No coordination
+ Trigram	Real phrases	Sparsity
+ Fuzzy n-gram	Generalisation	Too loose, noise
+ MMR diversity	Explores new regions	Still no sentences

The missing piece is always the same: position-dependent context tracking.

After "the king ruled the", our system needs to know "we need a noun here — specifically an object of 'ruled'." But:

Semantic scoring only knows "what word is RELATED to the recent context" — it doesn't know about syntactic roles.
Grammar scoring only knows "what word commonly FOLLOWS the last word" — P(next | kingdom) doesn't know we're in the object position of "ruled".

A transformer solves this with attention over the full sequence. At position 5, it can look back at position 2 ("ruled") and learn that "ruled the ___" needs a noun object. Our system can only look at the last 1-4 words, and it can't learn positional patterns because there's no learning.

Static embeddings give every word one fixed vector regardless of context. "King" after "the" (needs a verb next) has the same vector as "king" after "became" (needs a determiner). Dynamic, context-dependent representations require attention — and attention requires training.

What This Proves

Levy & Goldberg (2014) proved that Word2Vec implicitly factorises a PMI matrix. Zhao et al. (2025) proved that next-token prediction training converges to SVD factors of co-occurrence structure.

Our experiments confirm both from the other direction: we built the SVD factorisation explicitly and got embeddings that rival Word2Vec quality. But we also proved WHERE that equivalence breaks down — at generation.

Transformers aren't doing something fundamentally different from SVD for meaning. But they add the crucial missing piece: positional, context-dependent reweighting of those factors at every step.

The map of meaning can be built with pure math. The navigator through that map requires learning.

The Stack

Language: Python
Core: NumPy, SciPy (sparse SVD)
GPU acceleration: CuPy (cupyx.scatter_add for co-occurrence matrix building on CUDA)
Data: WikiText-103 via HuggingFace datasets (80M tokens)
Hardware: Kaggle T4 GPU
Training: Zero. None. Not a single gradient step.

What I'd Build Next

This isn't a dead end — it's a foundation. The experiments point to several directions:

Retrieval instead of generation. The embeddings are excellent for finding relevant content. Instead of generating word-by-word, use the SVD vectors to RETRIEVE real sentences from the corpus that match the semantic context. That's what vector databases are actually for.
Hybrid systems. Use the pure-math embeddings as a pre-computed semantic layer, then a small trained model (even a simple RNN) just for the sequential state tracking. The heavy lifting of meaning is already done.
Educational tool. This entire pipeline is transparent — every number is interpretable. No black boxes. Perfect for teaching how language models work from first principles.

Try It Yourself

All you need is NumPy, SciPy, and WikiText-103. Build a
co-occurrence matrix, apply PPMI, run SVD, add a bigram
grammar matrix. Two matrices. Two stages. No training.
Just math.

And now you know exactly where the math stops and the

learning begins.

This research was conducted as an independent exploration. Thanks to Levy & Goldberg, 2014 for the theoretical foundation and to Zhao et al. (2025) for extending the connection to next-token prediction.

DEV Community