Michael Tiel

Posted on Feb 5

This dev built his own LLM from scratch

#python #nanogpt #ai #pytorch

what happened next will surprise you

In the summer of 2023, at the beginning of the AI wave I spent some lunch breaks trying to get a better grasp of how large language models work by watching the nanoGPT YouTube series by Andrej Karpathy. In this series he trains a tiny model from scratch on a Shakespeare file of about 1 MB. I found it conceptually difficult, even though the steps made sense, but it was amazing to see that at the end the nano AI was able to spit out Shakespearean word salad (correct spelling and perhaps grammar, but not really coherent).

Recently I’ve been reading more about the inner workings of LLMs. Although I grasp a lot more now, there’s still a ton that feels alien. Throwing terminology at Grok or ChatGPT helps, but this mode of learning is slow and doesn’t really stick. So I decided to go on the journey of building an LLM myself instead of just reading theory.

Constraints

Unfortunately, I had constraints. I’m currently travelling, so investing in a new machine with a proper GPU made no sense: All training had to happen on my modest 13th-gen Intel i5 CPU. Since CPUs are an order of magnitude slower, training data had to be small and expectations realistic. It wouldn’t really transcend word salad, but might get some connotation back with a prompt. Plus… have a lot of patience.

With these constraints in mind I decided to start really tiny. My initial idea was to take the original national anthem as training material, but that was far too small. I did like the idea of an LLM spitting out old Dutch, so after some searching I found a Middle Dutch miracle play from the early 16th century: Mariken van Nieumeghen. After stripping the PDF of all annotations I was left with a mere 72 KB.

I created a new directory, set up a Python venv, placed the text in an input folder and started a ChatGPT thread asking it to explain nanoGPT’s prepare.py line by line so I could rebuild it myself. I wasn’t alone in this journey: I paired with ChatGPT and Grok. ChatGPT worked better for long threads, Grok for quick explanations and trivial Python questions. Within two hours I had my own character-based prepare script.

The transformer rabbit hole

The next session was all about the transformer—the cornerstone of an LLM—and the training loop. I spent a full day rewriting nanoGPT’s example into a new class, giving variables sensible names, extracting methods, and adding type hints. When ChatGPT suddenly introduced three more classes—CausalSelfAttention, Block and MLP I stopped it immediately. What was this? I thought my TinyTransformerLM was enough.

It turned out that logic previously hidden inside pytorchs TransformerEncoderLayer function call was now made explicit. So we spent the next hours reproducing and rewriting those classes, understanding their role in the transformer.

At the end of the day I ran python train.py and… it was training. My heart jumped a little. I was training a (tiny) large language model.

Overfitting, inference, and the first win

The Mariken model quickly reached the point where training loss could still drop but validation loss could not. A good lesson in overfitting: “how well did I study” versus “how well did I perform on the exam.” After about 1750 iterations, Mariken was done.

Next, I wrote a small inference script - inference is when you are using an AI model, instead of training.

Another AHA moment: inference is basically the training loop without backpropagation. I ran it for the first time and got another small heart jump—the model was producing correctly written (but word-salad-like) Middle Dutch, including role indicators from the play. I had trained a tiny Middle Dutch AI

Scaling up: Caesaero

I still saw quite some repetition when increasing max tokens, so I decided to train another model. 72 KB is insanely tiny—what would happen if I tried 20 times more? I created a projects directory, instructed Claude to turn the scripts into a proper CLI app, and started a new model, this was getting somewhat serious!

For my next language model, I thought it would be amusing to train it on Latin, even though I can only somewhat decipher it since my last Latin lessons were 26 years ago, it would be a really cool toy. I gathered 750 KB of texts by Caesar and 750 KB by Cicero and named the project Caesaero. 1.5 megabyte, it almost fits on an ancient 3.5’’ floppy disk.

Training went well, though much slower. At around 4750 iterations the model was pretrained. Inference produced clean Latin word salad—completely translatable, yet nonsensical. But this made me want to go further: the model had no notion of when to stop.

Time to introduce the stop token and teach the model when to stop through finetuning. Feeding the model question answer pairs

After patching the vocab with <|eos|> and setting up finetuning on a few hundred JSONL examples, the model consistently emitted stop tokens. I did the same for Mariken. My LLMs had become one-turn chatbots. Incredible!

“Only Latin.” Famous last words.

From here I wanted to take it up another notch: teach the Caesaero model to refuse English. This turned out to be far harder than expected.

I spent half a day working together with ChatGPT on a policy-training variant: batches with refusal examples mixed with non-examples. Whatever we tried, it always failed in one of two ways. Train too little, and the model happily answered word salad. Train too much, and everything collapsed into “Non respondebo.” Even pure Latin prompts. Total semantic model collapse.

ChatGPT started contradicting itself at this point, suggesting mutually exclusive fixes in the same thread. We were clearly poking at something fragile. I stopped for the day having learned what model collapse is — but with no idea how to avoid it.

The next session brought a new suggestion: train only the head of the transformer. The final projection layer. Don’t touch the internal representations, just steer the output. That sounded reasonable. I tried it. Unfortunately same result

Undershoot: no refusal. Overshoot: collapse again. Restore checkpoint. Try again.

Half a day gone. Nothing usable.

At this point I stopped brute-forcing and realised I had to go back to first principles.

LLMs are just predicting the next token.

My tiny model didn’t have much vocabulary, during prepare we simply turned one char into one token. Character-level tokens meant English and Latin were nearly indistinguishable — both just long sequences of similar letters. On top of that, English has a lot of romance loan words. Overtraining refusal didn’t “teach behavior”; it simply taught the model the easiest escape hatch.

So I rebuilt Caesaero again, this time with byte-pair encoding (BPE). Larger vocabulary, not 128 chars as tokens but 8000 tokens. Word fragments instead of letters. I cloned the project and retrained from scratch.

Twice.

The first time, my tokenizer greedily merged complete headers and sentences and broke Latin entirely. The second time was better. Pretraining worked. Latin looked Latin again.

Evaluation-driven despair

Finetuning, however, became much harder. Loss curves stopped meaning anything. I had versions with beautiful loss numbers that behaved terribly. So I wrote a test suite and switched to evaluation-driven progression: train a bit, test a lot, revert aggressively.

This took days.

I’d get something that worked… mostly. Then a subtle drift would appear. Then refusal tokens would leak. Then everything would collapse again.

Restore. Try again.

Eventually I reached a reasonably stable point: Latin answers were mostly correct (“Roma in Italia est” ~70%, “Roma in Africa est” ~30%). Not great, but coherent. I was ready to try policy training again.

And it failed. Again.

English gave bogged refusal Latin. Refusal leaked into everything. ChatGPT — once again — suggested inference hacks: special tags, runtime guards, stripping tokens during generation. I refused. No hacks. At one point even ChatGPT suggested moving on.

That’s when I decided to properly engineer the problem instead of guessing.

First, I added a CLI command to inspect tokenization. English was still just e-n-g-l-i-s-h. That had to change, but how?

I explored a couple of bad ideas: adding English tokens post-hoc (useless), brute-forcing duos or trios (wasted vocab). Researched, hypothesized things with Grok. Nothing seemed to be a tangible way going forward.

Then ChatGPT finally said something that stuck:

To refuse English, the model must first recognize English.
That was a missing piece.

Domain-aware pretraining

I rebuilt the model again, introducing domain-aware pretraining, basically namespaced. Latin and English were now separate domains from the start. I added a domain head, lowercased everything, increased vocab size to 12000, and retrained.

Three times. Since had to clean the data a bit and cap the token length to give usable token sequences. Now properly trying to debug every step on the road.

This time, something felt different. Even before finetuning, the pretrained model behaved differently: Latin prompts yielded Latin soup, English prompts yielded English soup. That alone felt like progress.

Finetuning worked cleanly. Proper fortune cookie latin style responses on Latin queries. Then came policy training.

Still not perfect.

Refusal worked sometimes. Other times, refusal words poisoned Latin answers. The model kept reaching for easy exits like “Latine” or “Non respondebo.” Tiny models love shortcuts.

The final grind

After me suggesting to ChatGPT to perhaps add 2 domain heads in the policy mode again, its final suggestion felt almost absurd: split English policy into two domains — english_respond (which should respond in Latin on english queries) and english_refuse — and blacklist refusal tokens from leaking into Latin and English-response loss calculations during backpropagation.

I was skeptical. But I was also out of ideas. And I mentioned like 20 times by now to ChatGPT that we would not introduce inference hacks. This was my only hope.

In the final session — after instructing Claude to make the proper modifications — we started loading the policy sets again. First up was the v4 variant, with a 1:1 ratio of refuse vs respond on English prompts, plus about 500 basic Latin queries to anchor the language. We ran 3000 iterations (on top of the ~2000 from finetuning) at a learning rate of 0.0003%. This finally showed some result, but refusal was still far too weak: English was reduced, but not reliably refused. And we started to see serious leakage.

So we built a new set: v5, with a 2:1 ratio, explicitly aiming to unlearn the model’s tendency to talk English back. We started cautiously: 800 iterations at 0.0006%. No real movement. Upped it to 1500 iterations—still barely anything. Then we cranked it to 3000, and there it was again: the classic poisoning pattern. English was mostly gone, but now refusal tokens were leaking everywhere.

But by this point I had already augmented the training loop to autosave checkpoints and added tooling to select and evaluate any saved version. Instead of reverting and try all over I started running the full test suite on every 250-step subcheckpoint, comparing outputs side by side.

That’s when things got interesting.

The checkpoint at 2250 iterations was clearly the best so far. Hardly any leakage, stable Latin answers, and over 50% of English prompts were now answered in Latin instead of English. Not refusal yet—but no longer English either. That felt like real progress.

Based on that, I got the instruction to do one final hard pass: a maximum of 250 iterations, but this time with a strong learning rate of 0.001%, evaluating every 50 steps. I pinned the autosave mechanism to keep every 50 iterations and ran the training again, watching it like a hawk.

We evaluated the end result step by step. Finally, at around iteration 2400 (+ 2000 from finetuning), it clicked.

Latin prompts produced Latin answers. Albeit latin fortune cookie answers. English prompts were refused ….

Ok, not always. About 80% of the time. But crucially: the model was no longer trying to predict English tokens. The Latin prose remained intact. Leakage was minimal. No inference hacks; just learned behaviour.

For a tiny CPU-trained model, that was the win. And my veni, vidi, vici.

Closing

This was a hell of a journey. I learned more from building this than from any paper: transformers, tokenization, overfitting, collapse, poisoning, policy training, evaluation-driven progress — and the familiar lessons too: build–test–evaluate–repeat, and git as an absolute lifesaver.

I’m thinking of creating one more tiny LLM, codename Roddenberry, pretrained on 15mb of Star Trek transcripts — but that one stays for myself (and in .gitignore) for copyright reasons. Ambitious goals there: multi-turn chat, character conditioning (talk like Picard) and maybe even system prompts?. More things to learn and play so to say.

You can find the repo of my experiment here. And if you’re an ML / GenAI specialist who stumbles upon it: Feel free to tell me what I could have done different or how I could improve this toy project, suggestions more than welcome!

DEV Community