The Question
What if we skipped training entirely?
Every language model — GPT, LLaMA, BERT — learns by optimising a loss function over millions of gradient steps. But the underlying data is just text: words appearing near other words. Co-occurrence. Counting.
So I asked: how far can pure mathematics take us toward text generation, without a single training step?
I built the whole thing from scratch in Python with NumPy. No PyTorch, no TensorFlow, no model.train(). Just matrices, statistics, and formulas.
Here's what happened.
The Setup: nanoVectorDB
I started with nanoVectorDB — a vector database I'd built from scratch using only NumPy. The original goal was embeddings and similarity search. But then I thought: if I can build word vectors without training, can I also generate text without training?
The corpus: WikiText-103 — 80 million tokens of Wikipedia articles. 10,000 word vocabulary covering 89% of all tokens.
The math pipeline:
Co-occurrence matrix — For each word pair, count how often they appear within a window of 5 words. Forward-heavy weighting (0.7 forward, 0.3 backward) because "the king" tells you more about what follows than what came before.
PPMI (Positive Pointwise Mutual Information) — Raw counts are dominated by common words. PPMI asks: "does this word pair appear together MORE than chance would predict?" It's the formula:
PMI(x,y) = log(P(x,y) / P(x)P(y)), clamped to zero for negative values.SVD (Singular Value Decomposition) — Compress the sparse 10,000×10,000 PPMI matrix into dense 64-dimensional word embeddings. Each word becomes a vector of 64 numbers.
Bigram grammar matrix — Separately, count every word-to-word transition.
P(next | last_word). A 10,000×10,000 matrix of raw transition probabilities.
No training. Just counting and matrix factorisation.
What Pure Math Gets Right: Meaning
The embeddings were shockingly good.
Word neighbours:
king → heir, regent, throne, prince, emperor, pope
queen → princess, duchess, sophia, isabella, catherine
music → indie, pop, hop, jazz, rap, songs, dance
river → lake, creek, valley, upstream, canyon
Analogies (on real data, 80M tokens):
king:man :: queen:? → woman ✓
man:woman :: boy:? → girl ✓
france:paris :: japan:? → tokyo ✓
king:queen :: prince:? → princess ✓
5 out of 8 exact matches at rank 1. 7 out of 8 in the top 5.
This isn't a toy result. The SVD embeddings understand that king-queen has the same relationship as prince-princess. They understand that France-Paris maps to Japan-Tokyo. All from counting word co-occurrences in Wikipedia, factorising the matrix, and computing cosine similarities.
This validates Levy & Goldberg (2014) — Word2Vec is implicitly factorising a PMI co-occurrence matrix. We just did it explicitly.
Where It Breaks: Generation
Meaning was solved. Now I tried to generate text. Seed the system with "the king and queen" and predict the next word, then the next, and so on.
Attempt 1: Semantic only (cosine similarity to context)
the king and queen → isabella sophia catherine isabella sophia
catherine isabella sophia catherine...
Pure synonym loop. The most similar word to "queen" is "isabella". The most similar word to "isabella" is "sophia". Then back to "catherine". Forever.
Attempt 2: Bigram grammar only
Grammar knew that "queen" is often followed by "anne", and "anne" is followed by "elizabeth". But it produced generic Wikipedia filler with no topic awareness.
Attempt 3: Two-stage (the breakthrough)
This was the key idea:
- Semantic filter: Find the 20 words most similar to the current context (cosine similarity of SVD embeddings)
- Grammar rerank: Among those 20, score each by bigram probability — how often does it actually follow the last word in real text?
-
Combine:
final = 0.7 × grammar + 0.3 × semantic - No repeat: Block every word that's been used before
Semantic proposes. Grammar disposes. No-repeat forces forward motion.
The result:
the war → guerrilla forces fighting troops advancing germans
italians retreated retreat battle captured ottoman army soldiers
surrendered marched garrison surrender siege reinforcements
assault force deployed units corps cavalry division th infantry
regiment rd battalion nd brigade headquarters unit commanding
anzac divisional artillery
That's a coherent military narrative — from guerrilla warfare through retreat, surrender, siege, to specific military units and hierarchy. Every transition makes bigram sense. The semantic filter keeps it on-topic. No-repeat pushes it forward.
More outputs from the same system:
the school was built → constructed building construction block
tower walls towers arches columns carved wooden stone wall arch
roof tiles marble floors panels decorated brick exterior
decoration decorative sculptures paintings depicting figures
Architecture → materials → decoration → art. A visual journey through a building.
she won the award → winning medal awarded prize award recipient
honorary academy graduate school student faculty students
enrolled
Awards → academia → enrollment. A career trajectory.
The 15 Versions That Followed
The two-stage system generated impressive topic walks but not sentences. So I spent the next 15 versions trying to fix it.
v3.2 — Union pools. Instead of only semantic candidates, I combined semantic top-20 + grammar top-20 into a pool of ~40 candidates. Grammar words like "was", "of", "the" could now compete. Result: grammar words dominated after a few steps. Every seed converged to "...of his own right to be used as well known..." — the same generic Wikipedia filler.
v3.3 — Dual memory. Semantic context tracked only content words (skipping grammar picks). Grammar context used the full sentence. Result: semantic stayed on topic but grammar picks were random glue words. Content and structure weren't coordinated.
v3.4 — Forced alternation (SEM/GRAM/SEM/GRAM). Forced the system to alternate between semantic and grammar picks. Result: the most readable output yet:
the war → victorious IN surrender OF surrendered TO seized BY
besieged AND captured ON
Grammar words (of, in, by, to, and) appeared as glue between content words. Almost readable — but the grammar words weren't chosen for the content words. "Of" appeared because it has a high bigram score after almost anything, not because the sentence needed it there.
v3.5 — Trigram grammar. Built a trigram dictionary from the corpus (5.9 million unique contexts). Trigrams captured real phrases that bigrams couldn't:
dining → hall
shopping → centre
tourist → attraction
nobel → peace
honorary → degree
These are genuine multi-word expressions. The bigram only saw "dining → room" or "dining → area". The trigram saw "dining hall" as a unit. But trigram sparsity meant frequent fallback to bigram.
v3.7 — 4-gram. Even sparser, rarely fired, fell back to trigram → bigram. Marginal improvement.
v3.8 — Fuzzy n-grams. The most creative attempt. Instead of exact trigram lookup, find similar contexts via embedding cosine similarity. "emperor empress" could borrow predictions from "king queen" because their embeddings are close. Result: the fuzzy matching was too loose — it matched contexts that sounded similar but had completely different meanings. Pulled in noise.
v3.9 — Union pools + fuzzy trigram. Combined everything. Same gravity-well problem — converged to generic filler after ~8 steps.
v4.0 — Alpha sweep. Tested grammar weights from 0.3 to 0.9 across 10 seeds. Different seeds needed different alpha values. No single alpha worked universally.
v4.1 — MMR soft diversity. Instead of hard-blocking used words, computed max cosine similarity to all previously used word embeddings as a penalty. final = relevance - λ × redundancy. λ=0.4 forced exploration of adjacent semantic regions. "the war" at λ=0.4 traced history across civilizations:
guerrilla forces fighting retreat battle army troops captured
turkish soldiers surrendered italians germans retreated
outnumbered defenders withdrew exhausted armies marched siege
ottoman turks byzantine empire conquered egypt syria lebanon
palestine israel occupation vietnam cambodia independence
From guerrilla warfare → Ottoman Empire → Byzantine Empire → Egypt/Syria → Israel/Palestine → Vietnam/Cambodia → independence. A walk through centuries of military history, forced by diversity to keep exploring.
The Scorecard
| Capability | Toy (213 words) | 100k tokens | 80M tokens |
|---|---|---|---|
| Similarity separation | 0.93 | 0.21 | 0.87 |
| Analogies @5 | 100% | 33% | 70% |
| NTP (token accuracy) | 41% | 2.7% | 0% |
| Generation | Semantic chains | — | Topic walks, no sentences |
PPMI + SVD solves meaning. Bigrams solve local transitions. Together they generate coherent topic walks. But they cannot generate grammatical sentences.
Why It Can't Generate Sentences
After 15 versions, the diagnosis is clear. Every fix solved one problem and created another:
| What we tried | What it fixed | What it broke |
|---|---|---|
| SVD semantic only | Meaning | Loops, no grammar |
| + Bigram grammar | Basic transitions | Generic glue chains |
| + No repeat | No loops | Exhausts topic words |
| + Dual pool | Grammar words appear | Grammar dominates |
| + Dual memory | Topic stays alive | Grammar picks random glue |
| + Alternation | Content+glue pattern | No coordination |
| + Trigram | Real phrases | Sparsity |
| + Fuzzy n-gram | Generalisation | Too loose, noise |
| + MMR diversity | Explores new regions | Still no sentences |
The missing piece is always the same: position-dependent context tracking.
After "the king ruled the", our system needs to know "we need a noun here — specifically an object of 'ruled'." But:
- Semantic scoring only knows "what word is RELATED to the recent context" — it doesn't know about syntactic roles.
- Grammar scoring only knows "what word commonly FOLLOWS the last word" —
P(next | kingdom)doesn't know we're in the object position of "ruled".
A transformer solves this with attention over the full sequence. At position 5, it can look back at position 2 ("ruled") and learn that "ruled the ___" needs a noun object. Our system can only look at the last 1-4 words, and it can't learn positional patterns because there's no learning.
Static embeddings give every word one fixed vector regardless of context. "King" after "the" (needs a verb next) has the same vector as "king" after "became" (needs a determiner). Dynamic, context-dependent representations require attention — and attention requires training.
What This Proves
Levy & Goldberg (2014) proved that Word2Vec implicitly factorises a PMI matrix. Zhao et al. (2025) proved that next-token prediction training converges to SVD factors of co-occurrence structure.
Our experiments confirm both from the other direction: we built the SVD factorisation explicitly and got embeddings that rival Word2Vec quality. But we also proved WHERE that equivalence breaks down — at generation.
Transformers aren't doing something fundamentally different from SVD for meaning. But they add the crucial missing piece: positional, context-dependent reweighting of those factors at every step.
The map of meaning can be built with pure math. The navigator through that map requires learning.
The Stack
- Language: Python
- Core: NumPy, SciPy (sparse SVD)
-
GPU acceleration: CuPy (
cupyx.scatter_addfor co-occurrence matrix building on CUDA) - Data: WikiText-103 via HuggingFace datasets (80M tokens)
- Hardware: Kaggle T4 GPU
- Training: Zero. None. Not a single gradient step.
What I'd Build Next
This isn't a dead end — it's a foundation. The experiments point to several directions:
Retrieval instead of generation. The embeddings are excellent for finding relevant content. Instead of generating word-by-word, use the SVD vectors to RETRIEVE real sentences from the corpus that match the semantic context. That's what vector databases are actually for.
Hybrid systems. Use the pure-math embeddings as a pre-computed semantic layer, then a small trained model (even a simple RNN) just for the sequential state tracking. The heavy lifting of meaning is already done.
Educational tool. This entire pipeline is transparent — every number is interpretable. No black boxes. Perfect for teaching how language models work from first principles.
Try It Yourself
All you need is NumPy, SciPy, and WikiText-103. Build a
co-occurrence matrix, apply PPMI, run SVD, add a bigram
grammar matrix. Two matrices. Two stages. No training.
Just math.
And now you know exactly where the math stops and the
learning begins.
This research was conducted as an independent exploration. Thanks to Levy & Goldberg, 2014 for the theoretical foundation and to Zhao et al. (2025) for extending the connection to next-token prediction.
Top comments (0)