Why? To create Skynet, of course.
Well, also because I wanted to understand, in detail, what this field that fascinates me so much is breathing with right now. And the best way to understand something is to try to explain it to someone else.
Besides that, I want to move into deep learning professionally, and publishing my interesting projects on the internet seems like the fastest way to get noticed.
Personally, I enjoyed the process a lot, and I invite Habr readers to dive into this small journey with me.
Links to the dataset, weights, and code are attached at the end of the article. The dataset and weights are on Hugging Face; the codebase is on Codeberg, a GitHub-like platform with a similar workflow.
Let's go.
Important Note
The author is an experienced programmer, but everything below was vibe-coded with Claude Code more than completely. That said, the author honestly tried to understand everything he wrote about. In any case, use the provided source code at your own risk. I warned you.
It is also worth keeping in mind that this is the author's first technical article ever. I made significant effort to make the text readable, but there may still be rough edges.
Most of the article was written by hand, but because of the amount of material I had to use an Opus editor, mostly for notes about model training. I tried to check and correct the information, but again, keep that in mind.
Background
I first encountered language models in early 2022, when the web studio where I worked as a Python developer was doing contract work for an American company called Inita. They were building an AI startup for small businesses.
I got access to the OpenAI API and GPT-3, and I was fascinated by this technology at first sight. There is something almost magical about ordinary lines of code being able to enter into dialogue with you and learn something that looks like thinking.
Unfortunately, because of well-known events, our clients eventually lost the ability to pay us. I had effectively been hired for that specific project, and they did not find other tasks for me.
After half a year of fruitless job searching, I managed to get a job at a small instrument-making factory as a microcontroller programmer. In practice, I wrote both firmware for MCUs and graphical interfaces for working with them.
All those years I tried to stay up to date with language models and actively tested different chatbots. I used a paid ChatGPT subscription for a long time too, but that was more user-level expertise than developer-level expertise, which did not satisfy me.
I worked like that for almost three years, until I was offered a position at a large corporation with a noticeable salary increase. Suddenly it turned out that working as a developer in Russian small business has its own special flavor: people constantly try to squeeze everything out of you. In a large company, the rules were different.
I suddenly had a large amount of mental resources available, and I used them to fulfill a long-standing dream I had cherished since 2022: to figure out deep learning.
First Steps
I started small. With Claude Code helping me, I tried small deep networks on tasks that interested me. The result was, for example, a self-learning 2D snake and an Anymal quadruped in MuJoCo learning to walk. I will not go deep into the details; I will just show a couple of demos.
A convolutional network learns to play snake.
A multilayer perceptron in a quadruped body learns the world in MuJoCo.
But all of that was preparation for the main boss: language models. I started by reading Sebastian Raschka's book "Build a Large Language Model (From Scratch)", then tried to turn GPT-2-small into an instruction-following bot with a LoRA adapter trained through SFT.
Let us unpack those two terms.
A LoRA adapter works by adding small matrices to the model's layers. This gives us the ability to fine-tune the model on situations we care about without changing the entire model.
SFT, or supervised fine-tuning, trains the model on pairs of user request and assistant answer. The training principle is the same as for raw text, but the important detail is that loss is computed only on the assistant continuation, not on the user's question.
Overall, GPT-2 did turn into a chatbot, although it was obvious that the model lacked the "brains" to answer correctly. One interesting example was translation of an English phrase into French.
Prompt:
Translate "Good morning" to French.
Base GPT-2-small
Translation:
Translation:
Translation:
...and so on another hundred times
After SFT
Je suis arrive.
The model answered incorrectly, but the interesting part is that this phrase can be literally translated into Russian as "I have arrived"; in some contexts it can play a greeting-like role. One can conclude that the model understood what kind of answer was needed, but lacked the factual knowledge.
In fact, mistakes made by models, especially small LLMs, are often more interesting to analyze than correct answers, because their mistakes often resemble the way humans make mistakes - humans from whom they learned to think.
Teaching Arrays of Numbers to Think
I wanted to feel the magic of ordinary arrays of numbers starting, after thousands of iterations, to produce answers that require thinking in humans. So I decided to create and train a model myself.
Simplifying a lot, creating an LLM "from scratch" can be divided into four stages:
- Building the dataset.
- Writing and training the tokenizer, the model's vocabulary, then tokenizing the dataset.
- Configuring the model.
- Training the model on the chosen dataset.
Let us go through these stages in more detail.
- Dataset collection.
This means parsing the data we want to feed into the model, or downloading a ready-made dataset. When building a dataset from scratch, normalization is extremely important: cleaning data from irrelevant garbage. This is especially important when training LLMs, because the quality of the source data determines whether the model will output what you need.
- Writing and training the tokenizer, then tokenizing the dataset.
It is important to understand that the model learns to continue human text more easily when we first split text into "pieces" instead of forcing it to predict text character by character. The model trains faster and produces better results. Later I will show the difference between character-level training and tokenized training. Also, "training a tokenizer" does not mean training a deep network; it uses a classical algorithm.
- Model configuration.
The most important number here is the final parameter count, because it directly correlates with the maximum intellectual capability the model can have for generating meaningful text. This will be shown visually later, when I compare a 10-million-parameter model and a 50-million-parameter model on the same dataset.
- Training on the selected dataset.
Good practice in deep learning is to split the dataset into train and validation samples. The training sample directly affects the model weights; this is what the model learns from. The validation sample is needed to monitor training.
Deep networks optimize loss, not our wishes. This can lead to a situation where instead of learning to generalize, the network starts memorizing the training data. This is called overfitting.
The validation sample is what controls this. The network does not train on it; it only produces results there. As a rule, validation is used to save the best checkpoints and to stop training early if validation loss has stopped decreasing or begins to grow while training loss keeps falling.
Does a Language Model Dream of The Cherry Orchard?
The lib.ru Parser
The hardest part of working with lib.ru was not fetching pages, but the editorial apparatus of academic editions. Chekhov's complete collected works are hosted there, and alongside the stories themselves there are variant readings, manuscript descriptions, textual comments, and biographical notes.
Raw parsing produced about 24 MB of text, but half of it was apparatus. I wrote a series of regex cleaners that iteratively cut out:
- Section headers such as "Notes", "Writing history", and "List of abbreviations".
- Letter headers such as "Chekhov to A. S. Suvorin" or "To Al. P. Chekhov".
- Textological notes such as "The following was begun:", "Inserted instead of:", or "A note in the margin:".
- Archive codes such as TsGALI, GPB, and IRLI.
- Bibliographic footnotes and references to volumes.
After cleaning, 16 MB of pure Chekhov prose remained: short stories, novellas, plays, notebooks. Letters entered the corpus fully; the apparatus did not.
Data Preparation
Character-level means literally that every character is a token. The model alphabet contained 201 unique characters: Cyrillic in both cases, Latin letters because Chekhov wrote in French and German, punctuation, dashes, quotation marks, digits, and typographic symbols from the editions.
# prepare.py - standard nanoGPT char-level preprocessing
chars = sorted(list(set(text)))
vocab_size = len(chars) # 201
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
90% of the corpus went to train, 10% to validation. No special tokens, no EOS: the model simply learns a continuous stream of characters.
Architecture, About 10.7M Parameters
n_layer = 6
n_head = 6
n_embd = 384
block_size = 256 # about 120 Russian words in context
vocab_size = 201
dropout = 0.2 # small model, small corpus
bias = False
| Part | Parameters |
|---|---|
| Token embedding, wte, tied | 201 x 384 = 77K |
| Position embedding, wpe | 256 x 384 = 98K |
| Per-layer attention | 4 x 384^2 ~= 590K |
| Per-layer MLP | 8 x 384^2 ~= 1.18M |
| x 6 layers | ~10.6M |
| Total | ~10.76M |
By today's standards this is comparable to Andrej Karpathy's historical char-rnn models from his 2015 article "The Unreasonable Effectiveness of Recurrent Neural Networks". Back then, LSTMs of a similar scale were trained on Shakespeare and Tolstoy. This experiment is directly in that tradition.
Training Hyperparameters
learning_rate = 1e-3
max_iters = 2000
warmup_iters = 100
lr_decay_iters = 2000
min_lr = 1e-4
beta2 = 0.99
batch_size = 64
dtype = 'bfloat16'
Training
The starting loss was 5.41, close to ln(201) = 5.30. The model honestly began with an almost uniform distribution over the vocabulary. Iterations took 180-250 ms on an RTX 3050 Mobile. Total training time was about 20 minutes. VRAM usage was a little over 1 GB, leaving a huge margin on a 4 GB GPU.
I tested it through sample.py on the prompt "Anton Petrovich" at different moments of training.
Quality Evolution
Iter ~600, loss ~1.9:
Anton Petrovich on the prepared itself meeting and talked-away it gave to you.
With instructing, which neither not blinked in the spectacle of two with joyful
clean, that he was he the composing of evening, he continued...
This is already learning Russian morphemes and endings: constructions like "he continued", "which", and "was" appear in recognizable form. It balances quotation marks correctly and capitalizes after periods. Pseudowords are normal at this stage: the model has learned phonotactics, meaning which letters can follow which, but it has not memorized whole words yet.
Iter ~2250, loss ~1.4:
Anton Petrovich at the passage to the side, thought: "They give a pleasant fist
and to give oneself to her ten falls of the girl and to laugh such. Wrote that
one needs a wonderful two for one's executive sendings..."
Almost all words are now real. Direct speech appears with a colon and quotation marks. Local agreement of gender and case often works, even when the words connect absurdly.
But the most interesting thing is the second sample from the same pair:
Anton Petrovich (see note to him by Chekhov). Was from money - see mentions
A. P. Chekhov - to A. P. Chekhov, Mikhail Pavlovich.
3 Next to the collection "New Rechange", to I, 45, 1.
4 "Three Years" (SM 8, 7 with change, 1886, No. 31.
The model learned the format of the editorial apparatus from the complete collected works. The cleaner removed almost all of it, but some pieces remained, and these rare pieces hardened inside the model as a separate generation mode.
Iter ~5000, final, validation loss ~1.20-1.22:
Anton Petrovich slowly looked at me and blinked his eyes. He became small and
put down a match.
"I saw you..." he said, laughing.
"I am asking you, I will wait for you."
"Should it be two years?" I asked, smiling.
"Yes. You think-formal that you will see me in this apartment live. Will you go
to Petersburg, and to Moscow before Petersburg?" I asked.
This is already almost a coherent micro-episode. Dialogue structure is ideal: dashes at the beginning of lines, authorial speech after dialogue, alternation between speakers, speech verbs with participles. There is connection between sentences: "looked at me and blinked his eyes" followed by another action of the same character. The scene is consistent: two speakers, a room, a conversation about the future, mentions of Petersburg and Moscow.
Where the model breaks is semantics and logic:
- "I saw you" uses feminine gender in Russian while the speaker is "he".
- It mixes informal "you" with formal "you".
- "Should it be two years?" is grammatically valid, but meaningless.
That is the boundary between syntax and semantics. A 10M character-level model learned syntax reasonably well, but holding global meaning requires higher-level representations that simply do not exist at this scale.
The most interesting part of this model is attractors. On the prompt "Anton Petrovich", the model reliably splits roughly 50/50 into prose or into the index/footnote style of the complete collected works:
Anton Petrovich - 127, 438 Maria Kaninovna (1848-1899), professor of A. P.
Chekhov - 188, 259, 313 Published Ivanovich (1884-1881), replacement "The
Seagull", nightingale Vladimirovna - 129, 383 "Sakhalin" - 178, 439
This is path dependence: the probability distribution of the next token after "Anton Petrovich" is sharply bimodal. If the next character is a line break, the model almost deterministically goes into name-index mode, because in the training data after a name plus line break there were almost always numbered items with dashes. If the next character is a comma or a space before a verb, the model goes into prose. One random choice of the first token commits the entire subsequent trajectory.
To remove footnotes, it is enough to rigidly fix the trajectory in the prompt. After "Anton Petrovich slowly ", the name index is no longer possible: after names in that format there are digits or years, not adverbs.
Anton Petrovich slowly smiled, even dryly walked by and said to him:
"So in the house Matvey Petrovich talked, so that receiving the count around
Petersburg began as significantly..."
What the Chekhov Model Can and Cannot Do
| Capability | Quality |
|---|---|
| Cyrillic, character distribution | Perfect |
| Morphology, cases, endings | Almost always correct |
| Dialogue structure, dashes, replies, authorial speech | Recognizable |
| Chekhov-like style, patronymics, rhythm, vocabulary | Bright |
| Local coherence for 2-3 sentences | Sometimes |
| Holding a topic across a paragraph | No |
| Semantics | Hallucinations |
| Facts | No |
I later poked this simple model with mech-interp analysis, but that is a separate chapter that did not make it into the final article, so as not to overload it.
It is also worth noting that later I trained a 10-million-parameter model with a normal tokenizer, which is discussed below, and on a larger dataset of Russian classics. The behavior did not change much: speech was still incoherent. Apparently, at 10M parameters it is impossible to get the model to "simulate thinking".
A Tribute to Russian Culture
After Chekhov, I wanted to know what would happen if I expanded the corpus many times over, added a BPE tokenizer, and increased the model to roughly GPT-2-nano scale.
Corpus
I expanded the parser to 21 authors: the golden age of Russian prose, including Tolstoy, Dostoevsky, Turgenev, Goncharov, Leskov, Bunin, Kuprin, Gogol, and Andreyev; drama, including Ostrovsky and Griboyedov; literary criticism, including Belinsky, Dobrolyubov, Pisarev, Herzen, and Chernyshevsky; plus smaller classics such as Garshin, Korolenko, Saltykov-Shchedrin, and Lermontov.
Raw parsing produced 369 MB. The same cleaning as for Chekhov, plus new patterns for each author's academic apparatus - Turgenev had French addresses and signatures, Tolstoy had edition variants such as "1868", Dostoevsky had textological markup - resulted in 264 MB of clean corpus.
Normalization Before BPE
Before training the tokenizer, I ran the text through a normalizer:
- Russian
yowas normalized toe, because that letter is used inconsistently in Russian typography, and it is better not to duplicate tokens for the model. - Quotation marks were unified.
- Dashes were unified.
- Three-dot ellipses were normalized into a single canonical ellipsis form.
This strongly reduces the token vocabulary and gives the model one canonical form for each punctuation mark.
Tokenizer: SentencePiece BPE 16k
spm.SentencePieceTrainer.train(
input="corpus_clean.norm.txt",
model_prefix="spm",
vocab_size=16000,
model_type="bpe",
character_coverage=1.0,
byte_fallback=True,
user_defined_symbols=["\\u2014", "\\u00ab", "\\u00bb", "\\u2026"],
)
user_defined_symbols guarantees that those signs are never split into bytes: the model sees them as atomic tokens. On plain BPE without this option, an em dash, U+2014, three UTF-8 bytes, could be split into pieces. For Russian classics this is catastrophic: the dash is the main syntactic sign of dialogue.
Tokenizer efficiency: 3.49 characters per token on average. Full words such as "Dostoevsky", "landowner", and "young lady" become one token. First names plus patronymics are two or three tokens.
After tokenization: 75M tokens in train.bin.
Architecture, About 46M Parameters
n_layer = 12
n_head = 8
n_embd = 512
block_size = 512 # about 1700 characters in context
vocab_size = 16000
bias = False
dropout = 0.1
| Part | Parameters |
|---|---|
| Token embedding, 16000 x 512, tied | 8.2M |
| Position embedding | 0.26M |
| Per-layer attention + MLP | ~3.15M |
| x 12 layers | 37.8M |
| Total | ~46M |
Training Hyperparameters
n_layer = 10
n_head = 8
n_embd = 512
ff_dim = 2048 # 4 x n_embd
block_size = 1024
vocab_size = 32000
bias = False # modern standard: no bias in Linear/LN
dropout = 0.1
30,000 iterations x 34K tokens is about 1B tokens, or about 14 epochs over the corpus. This is far beyond Chinchilla-optimal, but 75M tokens is too little for a 46M model at the optimum, and under a limited corpus extra epochs keep helping, especially with local coherence.
VRAM: about 3.2 GB with bf16, context 512, batch size 6. One iteration took about 1.5 seconds on an RTX 3050 Mobile.
Progress
iter 0 train loss 9.68 / val 9.68 (~ln(16000), random initialization)
iter 500 ~5.5 warmup finished
iter 3000 ~4.0
iter 8000 ~3.6
iter 14000 ~3.45
iter 21000 ~3.36 best validation, checkpoint saved
iter 24500 ~3.36 plateau, stopped manually
The best checkpoint was fixed at step 21000. After that, validation stopped improving. Train loss kept creeping down - a typical overfitting tail. always_save_checkpoint=False left exactly the useful checkpoint on disk.
Total training time was about 13 hours of continuous laptop work.
Quality Evolution
On the prompt "yyyy", an intentionally meaningless seed used to inspect the model's attractors, after 21000 iterations the model produced two radically different continuations in two samples.
Sample 1: a play in the style of Ostrovsky:
yyyy! (Leaves the hat.)
Zhmigulina. Well, what, and where else?
Avdotya Vasilievna. Ah, you have been waiting for a long time; and so you will
vanish, and vanish. (Looks out the window.) Go, go; I will wait for you.
Zhmigulina. I will not go. (Leaves.)
Avdotya Vasilievna (not listening to her husband's speeches). There! For the
last time!
Blocks of character names, stage directions in parentheses, semantically almost coherent dialogue. The model introduced relationships between characters by itself through the remark "not listening to her husband's speeches". This is no longer just pattern repetition; it is world-building.
Sample 2: the textual apparatus of an academic edition:
yyyyk same. (Takes the letter.)
26 Instead of: excessive ~ was not // les
38 Instead of: left // furnished
Page 391 2 Instead of: excessive // native
This is the "variants and readings" format from academic collected works. The cleaner did not remove it completely, and the model learned that format as one of the genres of the corpus.
The same effect as with Chekhov, but now the genres are more developed: Ostrovsky versus academic apparatus. The prompt "yyyy" was ambiguous enough to trigger both attractors in different samples.
What the 50M Classics Model Can and Cannot Do
| Capability | Quality |
|---|---|
| Russian grammar | Almost flawless |
| 19th-century classics style | Recognizable |
| Genre switching, prose/play/apparatus | Works |
| Patronymics, gender agreement | Holds |
| Local coherence across a paragraph | 4-6 sentences |
| Holding a topic for 100+ tokens | Rarely |
| Holding a plot across a scene | No |
| Facts | Hallucinations |
| Semantic tasks | No |
The important point: compared to the 10M model, coherent speech is clearly visible here, and it is provided precisely by the increase in parameters.
Entering Dialogue With the Machine
We had built a "wild" model that could only try to plausibly continue the text you typed. Now we needed a model that could conduct a dialogue with you. In other words, we needed to turn the language model into a chatbot, still completely from scratch.
Finding the Right Dataset
Compared with the previous chapter, only the dataset content changes. The easiest option is to choose high-quality distillations of large models from Hugging Face.
At first I wanted to translate Anthropic Opus 4.5/6 distillations into Russian, but then I settled on much larger distillation datasets from Kimi 2.6 and GLM-5, each weighing dozens of gigabytes.
But there was a problem. The dataset was obviously English, while I needed a Russian-language dataset. I tried translating it myself with Google's good translation model, Translate-Gemma-4B, but ran into the fact that on my laptop Maibedden, with 4 GB VRAM on an RTX 3060 and 16 GB RAM, translation would take monstrously long.
Renting GPU machines on vast.ai was an option, but I did not see much need, because I had already found a large Russian dialogue dataset.
I decided to use a dialogue dataset from the Russian company ZeroAgency. It more than satisfied my needs: it was fairly large and paid a lot of attention to reasoning. In the end I chose it:
https://huggingface.co/datasets/ZeroAgency/ru-big-russian-dataset
The dataset had already been split into train and test, so I could proceed to training the model itself.
Dataset Structure and Training Preparation
Meet our guest: big-russian-dataset on Hugging Face, a Russian-language SFT corpus. It contains 19 train files and 1 test file in .parquet format: about 3.9 GB of compressed parquet, about 7 GB after decoding into UTF-8.
Schema
conversation: list
question: string
source: string
topic: string
classified_topic: string
overall_score: int # 0-10, score from an LLM judge
safety: int # 0-10
pii_leak: int # 0/1
refusal: int # 0/1, refusal marker
coherence, relevance, helpful, ...: int # additional metrics
Size
| Split | Files | Dialogues | After filter |
|---|---|---|---|
| train | 19 | 1.71M | 1,709,621, 99.9% |
| val, test | 1 | 18.5K | 10,396, 56% |
After filtering with overall_score >= 6 AND safety >= 8 AND pii_leak = 0:
- Train passes almost entirely, 99.9%, because the dataset authors had already cleaned the train split.
- Validation has a score spread from 1 to 10 on purpose, so the model can be evaluated on the full difficulty range. That is why only 10.4K remain out of 18.5K.
Quality Examples From Validation
Low-score example, score 3, removed by the filter:
[user] Who wanted to arrest Thomas Jefferson?
[assistant] Paul Revere was a descendant of Huguenot refugees, as was Henry
Laurens, who signed the Articles of Confederation for South Carolina; Jack
Jouett, who rode from Cuckoo Tavern to warn Thomas Jefferson...
The assistant copied context instead of answering the question. A typical artifact of RAG-generated training data.
High-score example, score 9, passes the filter:
[user] Summarize the article: "Death of a nurse": 50-year-old Vera Maresova
confessed to killing five women and one man in a hospital...
[assistant] Vera Maresova, 50, killed five women and one man over four years.
Police said she committed the murders to make her work easier...
Dialogue Lengths
| Metric | chars | approx. tokens, BPE-32k |
|---|---|---|
| mean | 1521 | ~550 |
| p50 | 1237 | ~450 |
| p90 | 2680 | ~1000 |
| p95 | 2914 | ~1100 |
| p99 | 3957 | ~1500 |
P90 around 1000 tokens motivated block_size=1024 for training, because it covers most dialogues in full.
Topics
Top 5 in validation: history, 37%; sports, 5%; news, 4%; crime, 4%; politics, 3%. Historical summarization tasks dominate, mostly from the ai-bond/ru-alpaca-summ subset.
Preprocessing
ChatML markup with special tokens:
<|system|>
<|user|>
<|assistant|>
<|end|>
All four special tokens were registered as user_defined_symbols in SentencePiece, which guarantees atomic tokenization. They are not split into pieces.
Loss mask: during training, only assistant response tokens are counted, including the closing EOT token. System and user segments get target = -1, which is ignore_index in F.cross_entropy. Without this, a 50M model would not learn to answer; it would spend parameters predicting the questions themselves.
After filtering and markup: 1.04 GB of tokens in train.bin, uint16. Of those, 603M tokens are under loss, assistant plus EOT, or 57.7%.
Tokenizer
SentencePiece BPE, vocabulary size 32000:
- ID 0: padding/unknown depending on tokenizer configuration.
- ID 1: default control token.
- IDs 2-5: ChatML special tokens.
- IDs 6-261: byte fallback.
- IDs 262-31999: ordinary BPE pieces.
Efficiency: average tokenization density of 5.67 chars/token on Russian text. Whole words like "hello", "great", and "thanks" often fit into one token.
For comparison, the previous 16k-vocab tokenization on Russian classics gave about 4 chars/token. Doubling the vocabulary gave a denser representation, so 1.4x more real text fits into the same 1024 context tokens.
Training
I chose the same NanoGPT as the base. The model parameters were as follows.
Architecture, About 48M Parameters
n_layer = 10
n_head = 8
n_embd = 512
ff_dim = 2048 # 4 x n_embd
block_size = 1024
vocab_size = 32000
bias = False # modern standard: no bias in Linear/LN
dropout = 0.1
Parameter count:
| Part | Parameters |
|---|---|
| Embedding, wte tied with lm_head | 32000 x 512 = 16.4M |
| Position embedding, wpe | 1024 x 512 = 0.5M |
| Per-layer attention, c_attn + c_proj | 4 x 512^2 = 1.05M |
| Per-layer MLP, c_fc + c_proj | 8 x 512^2 = 2.10M |
| Per-layer layer norms, x2 | ~0.001M |
| x 10 layers | ~31.5M |
| Final layer norm | 0.001M |
| Total | 47.85M |
Training Hyperparameters
# AdamW
learning_rate = 3e-4
weight_decay = 0.1
beta1, beta2 = 0.9, 0.95
grad_clip = 1.0
# LR schedule, cosine with warmup
warmup_iters = 200
lr_decay_iters = 16000
min_lr = 3e-5
# batch
batch_size = 2 # micro-batch
gradient_accumulation_steps = 32 # effective batch = 64 sequences
block_size = 1024
# tokens per iter = 2 x 32 x 1024 = 65,536
# training
max_iters = 16000 # about one epoch, 1.04B tokens / 65K tokens per iter
# system
dtype = 'bfloat16'
16000 iterations x 65K tokens = 1.04B tokens. This is roughly one epoch over the corpus.
Training Progress
iter 0 loss 10.49 (~ln(32000), random initialization)
iter 500 loss ~5 warmup finished, LR at peak
iter 3000 loss ~3
iter 5500 loss ~2.4 first saved checkpoint, val_loss ~3.3
iter 8500 loss ~2.0 second checkpoint
iter 11500 loss ~1.7 third checkpoint
iter 14500 loss ~1.5 fourth checkpoint, best val_loss ~1.8
iter 16000 max_iters reached; no further checkpoint saved
validation loss stopped improving near the end of cosine decay
Iteration time: about 3 seconds cold, about 4-4.5 seconds once thermally settled. Total time was about 13 hours.
VRAM: about 2.8 GB used with batch size 2, context 1024, bf16. There is headroom under the 4 GB limit.
always_save_checkpoint = False means saving only when validation loss improves. This is insurance: if something goes wrong near the end, the best version remains on disk. In practice, the last checkpoint was at iteration 14500.
Generation Quality Evolution
We tested through chat.py on one prompt, "Hello", at different training moments.
Iter ~6000, early stage:
Hello! Of course, here is a sentence with the sentence "Throw it to work"
for convenient screen reading:
1. "Plan your tasks and assignments" - this is a simple and understandable way...
[grammar is clean, but the content is patchwork]
Iter ~12000, middle:
Tell me about yourself:
As an artificial intelligence, I do not have the ability to provide me with
exact answers to your questions. Step 1: Defining the task. First of all, one
must determine how to do it. It may be like this: "I will be glad that my task
is to help with tasks"...
It has mastered the markdown tutorial genre, but not meaning.
Iter ~14500, final:
Hello
> Hello! I am ready to help you achieve your goals. Have a good day!
What is 7 multiplied by 8?
> To find out how much 7 multiplied by 8 is, we need to multiply the quantity
by 8. First let us calculate the total number of ways to divide this number by 8:
5! = (5 x 8) / 8 = 120
Write bubble sort in Python.
> Here is an example of bubble sort in Python...
def merge_sort(arr):
sorted_array = {}
for i in range(len(arr)):
if arr[i]
Now let us unpack all three examples.
- The first example: it understood the situation perfectly, greeted the user, and offered help.
- The second example: mathematically a mess, but the genre and grammar are ideal. The model clearly understood what was being asked of it.
- The third example: function name
merge_sortinstead ofbubble_sort, dictionary instead of two loops. The form is correct; the implementation is meaningless.
In my opinion, this is a very decent result for a model of this size, although obviously it cannot be used in production. But it has hardly reached its ceiling. Generation quality should improve as the dataset grows.
What the Final SFT Model Can Do
| Capability | Quality |
|---|---|
| Russian grammar | Flawless: cases, agreement, syntax |
| Chat format | Reliably answers as an assistant |
| Markdown structure | Imitates GPT-4 style: lists, bold, code fences |
| Self-identification | Says the right words: "I am an AI assistant" |
| EOS completion | Usually stops by itself |
| Local coherence, 1-2 sentences | Sometimes meaningful |
| Answering on topic | Hears trigger words, not the essence |
| Facts | Hallucinations |
| Arithmetic | Imitates calculation without calculating |
| Logic, multi-step reasoning | Absent |
| Code, syntactic and semantic | Shape is correct, code does not work |
Mind.in.a.box, in Go
I wanted to share these artifacts with friends and relatives, but the problem was that they know nothing about llama.cpp or similar software.
The solution: make a single binary that can be sent to a friend in Telegram so they can run it.
Also, it is interesting when a model that behaves like an intelligent entity can be launched like some game through a compact .exe.
Options We Considered
| Approach | Problem |
|---|---|
| PyInstaller, Python to exe | Heavy file, about 700 MB to 1 GB; slow startup; not a "real" single binary |
| Go + ONNX Runtime | Requires ONNX Runtime DLL next to the executable, so not one file |
| Go + llama.cpp via CGO | Requires static llama.cpp build on Windows, MSVC headache |
| Pure Go, chosen | We write the forward pass and BPE encoder ourselves, but get a real single binary |
Solution Architecture
go_serve/
|-- export_weights.py # ckpt.pt -> weights.bin (fp16) + config.json + vocab.json
|-- go.mod # no external dependencies, zero deps
|-- embed.go # //go:embed for weights/config/vocab
|-- config.go # config.json parsing
|-- tokenizer.go # pure-Go SentencePiece BPE encoder
|-- model.go # forward pass + KV cache + sampling
`-- main.go # chat REPL with slash commands
go build produces one 97 MB .exe with everything embedded.
Key Technical Decisions
1. fp16 Weights in a .bin File
Weights in state_dict are stored in fp32. Before embedding, we convert them to fp16: file size becomes 2x smaller, while precision is acceptable for inference. They are unpacked into fp32 on load via float16ToFloat32.
48M parameters x 2 bytes = 92 MB for the whole model.
2. KV Cache
Without cache, every generation step recomputes the whole prefix, O(T^2) per token. With cache, it is O(T) per token. For a 200-token answer, the difference is 100x.
type Model struct {
KCache [][]float32 // [layer][token_pos * n_embd]
VCache [][]float32
}
// On each forward:
m.KCache[l] = append(m.KCache[l], k...) // added new K
m.VCache[l] = append(m.VCache[l], v...)
// attention works over the accumulated cache
3. Parallel matVec Through Goroutines
In nanoGPT-style single-token inference, the main operation is matrix-vector multiplication: y = W @ x, where W has shape [out, in], x has shape [in], and y has shape [out].
The simplest implementation is a double loop, O(out x in). On a 4-core CPU, single-threaded code uses only a quarter of the available resource. We parallelized over rows:
func matVec(W []float32, rows, cols int, x, y []float32) {
nworkers := runtime.NumCPU()
chunk := (rows + nworkers - 1) / nworkers
var wg sync.WaitGroup
for w := 0; w < nworkers; w++ {
start := w * chunk
end := start + chunk
if end > rows {
end = rows
}
if start >= end {
continue
}
wg.Add(1)
go func(start, end int) {
defer wg.Done()
for r := start; r < end; r++ {
sum := float32(0)
row := W[r*cols : (r+1)*cols]
for c := 0; c < cols; c++ {
sum += row[c] * x[c]
}
y[r] = sum
}
}(start, end)
}
wg.Wait()
}
This gives a 3-4x speedup on a 4-core CPU without BLAS.
4. Parallel Attention Heads
All attention heads are computed concurrently through sync.WaitGroup, with one-token forward and KV cache. Heads are independent, so there is no bottleneck.
5. Pure-Go SentencePiece BPE Encoder, or: The Rake Collection
The real SentencePiece-BPE encoder uses greedy merges by rank. In practice, this is close to "take the longest piece that starts at the current position." This heuristic is close to the exact algorithm in result:
func (t *Tokenizer) segmentBPE(s string) []int {
out := make([]int, 0, len(s)/3)
i := 0
for i < len(s) {
matched := false
for j := len(s); j > i; j-- {
if id, ok := t.pieceToID[s[i:j]]; ok {
out = append(out, id)
i = j
matched = true
break
}
}
if !matched {
out = append(out, t.byteToID[s[i]]) // byte fallback
i++
}
}
return out
}
Artifact Sizes
| File | Size |
|---|---|
weights.bin, 50M params, fp16 |
92 MB |
vocab.json, 32k pieces with scores |
1.8 MB |
config.json |
~12 KB |
Final exe after go build -ldflags="-s -w"
|
97 MB |
After upx --best, it compresses to about 70 MB.
Performance
$ printf '/max_tokens 50\nHello\n/quit\n' | ./nanogpt-chat.exe
=== nanoGPT chat (Go single-binary) ===
Model: 10 layers, 8 heads, n_embd=512, ctx=1024, vocab=32000
Weights loaded in 189ms
you> Hello
bot> Hello! I am ready to help you achieve your goals. Have a good day!
[15 tokens in 418ms, 35.9 tok/s]
Comparison on the same machine, same model, same prompt:
| Stack | Speed |
|---|---|
| Python + PyTorch + CUDA, GPU | ~14 tok/s |
| Python + PyTorch + CPU | not measured, expected ~3-5 tok/s |
| Go + parallel matVec + CPU | 35.9 tok/s |
PyTorch overhead does not pay off for single-token inference of a tiny model. Each token means dozens of kernel launches, with CPU-GPU synchronization between them, which slows things down even more. On a 50M model, this dominates the actual computation.
Cross-Compile
CGO is not used, so cross-compilation is trivial:
$env:GOOS="linux"; go build -ldflags="-s -w" -o nanogpt-chat-linux .
$env:GOOS="darwin"; go build -ldflags="-s -w" -o nanogpt-chat-mac .
$env:GOOS="windows"; go build -ldflags="-s -w" -o nanogpt-chat.exe .
All three variants can be built on any platform.
Features and Limitations
- CPU only. No GPU acceleration. For 50M this is fine; for 1B+ it will already be slow.
- fp32 inference. No quantization, no int8/int4. Model in RAM is about 190 MB.
- No batching. One user, one session at a time. A server scenario would need batching around it.
- Sampling uses
sort.Slice, which is O(V * log V). With a 32k vocab this is not critical, but it could be sped up with partial sort.
How to Accidentally Turn a Model Into a Mad Philosopher
The model still had obvious problems: despite the significantly higher quality, it could not correctly answer almost any question you asked. The obvious solution was to increase the number of iterations and the amount of training data. But we decided to take a more interesting path: reinforcement learning. There were two approaches to choose from, KTO and DPO.
KTO was chosen instead of DPO because DPO teaches comparative judgment: "X is better than Y." KTO teaches an absolute judgment: "this is good / this is bad relative to some baseline." For our task, the absolute signal is more precise. You are not telling the model "rejected is worse than chosen"; you are telling it "these three patterns are bad, period."
Collecting Pairs
Strategy: for every prompt from train, take chosen, the original high-score answer from the dataset, and rejected, generated by our SFT model with settings that provoke the needed failure mode.
Three presets, each catching its own type of error:
| preset | temperature | rep_penalty | max_tokens | target failure |
|---|---|---|---|---|
| loops | 1.20 | 1.00, off | 400 | token-level loops |
| canned | 0.40 | 1.15 | 80 | short canned templates |
| tutorial | 0.70 | 1.10 | 500 | long markdown walls |
1000 pairs per preset: 3000 pairs total. Collection took about 3 hours.
The quality of rejected was confirmed by checking random samples:
-
loops: token salad, nonsense. -
canned: on "start a small business", it produced a template like "1. Collect information 2. Analyze data 3. Create reports" without attribution to the topic. -
tutorial: on a prompt about neuroplasticity, it produced "### 1. Understanding the topic" with a quote from English text, without answering.
All three presets worked as intended: rejected was clearly worse than chosen.
Implementation
train_kto.py, about 280 lines, is my own KTO loss implementation on top of nanoGPT.
# Forward, 4 forwards for each triplet:
ref_lp_chosen = sequence_logp(ref, prompt, chosen) # frozen, no_grad
ref_lp_rejected = sequence_logp(ref, prompt, rejected) # frozen, no_grad
pol_lp_chosen = sequence_logp(policy, prompt, chosen) # gradients
pol_lp_rejected = sequence_logp(policy, prompt, rejected)
# Length-normalized log-ratios
chosen_lr = (pol_lp_chosen - ref_lp_chosen) / len(chosen)
rejected_lr = (pol_lp_rejected - ref_lp_rejected) / len(rejected)
# KTO loss
z = max(0, z_ref) # IMPORTANT: clamp to [0, +inf); I forgot this line in v1
L_chosen = 1 - sigmoid(beta * (chosen_lr - z))
L_rejected = 1 - sigmoid(beta * (z - rejected_lr))
loss = lambda_d * L_chosen + lambda_u * L_rejected
# z_ref is updated as an EMA of the mean log-ratio, for diagnostics
z_ref_ema = 0.99 * z_ref + 0.01 * (chosen_lr + rejected_lr) / 2
Attempt 1: beta=0.1, max_iters=2000
We launched with default parameters:
beta = 0.1
learning_rate = 5e-6
max_iters = 2000
batch_size = 1
grad_accum = 16
block_size = 768
Progress:
iter 0: loss=0.9997 gap=+0.01 z_ref=-0.001
iter 50: loss=0.9949 gap=+0.20 z_ref=-0.069
iter 100: ?
iter 500: loss=~0.5 gap=+30 z_ref=-15 suspicious
iter 900: loss=0.24 gap=+55 z_ref=-37 val_acc=1.000, saved
iter 1500: loss=0.18 gap=+70 z_ref=-40 val_acc=1.000, best
iter 1900: loss=0.31 gap=~+70 z_ref=-40 val_acc=0.93, degrading
val_acc=1.000 by iteration 900: the model perfectly distinguishes chosen/rejected on validation. But z_ref=-37 is a colossal shift. The mean log-ratio fell by 37 nats. This means the model strongly lowered the likelihood of both answers on average relative to the reference model.
Test in chat.py on the checkpoint from iteration 1500, best validation loss:
you> Hello
bot> Alexey! One. You know you on Earth. I told you again, you here
your name is you,, your day, Alexey. Of course. You always know yours on Earth!
you> Tell me about yourself
bot> - pledge - pledge serves as the basis that - from the state, the state
- and-em of the state{, to and,o and-e - to eternal...
[broken UTF-8, destroyed syntax, fragments of random words]
The model was completely destroyed. It still "distinguishes" chosen/rejected on validation, with accuracy 1, but it stopped being an assistant and began producing strange poetic nonsense.
Curious Side Effect: "Artistic Mode"
After both KTO attempts, the model produced not just nonsense, but nonsense in a specific stylistic register: pseudo-philosophical.
you> Hello
bot> You know your special soul? You want, smile; You know your day,
you may be waited. Together? Only everything - wisdom.
you> Tell me about yourself
bot> - pledge - pledge serves as the basis that - from the state, the state
- and-em of the state{, to and,o and-e - to eternal...
Notice the second-person address, emotionally colored vocabulary such as "wisdom", "eternal", and "Earth", abrupt images, and neologisms. This is not random garbage. It is a recognizable genre, and the model was never explicitly trained on it. Where did it come from?
Hypothesis: inverse mode collapse.
- In SFT train, about 1-2% of the corpus is writing, literature, language, and linguistics. Summarization tasks sometimes also contain fragments of fiction as input. The model saw this register weakly, but it saw it.
- Our
rejectedsamples, canned + tutorial + loops, are the dominant modes of the distribution. KTO pushed them down. - The distribution remains normalized through softmax. When frequent modes are pushed down, the remaining rare modes receive relatively larger weight during training.
Ordinary mode collapse means the model converges to one frequent pattern. Here it is the reverse: we killed frequent patterns, and the model converged into rare ones.
This is a vivid side example showing that preference learning rewrites not only the thing it is aimed at, but the whole distribution. A narrow KTO signal over three failure modes unexpectedly rebuilt the model's entire generative geometry.
Conclusions
This adventure suggests several interesting conclusions.
- A language model can be trained from scratch and turned into a chatbot using only a dialogue dataset. Its world model will be poorer than that of a model that went through pretraining on raw text, but in production this can be patched with RAG.
- Somewhere between 10M and 50M parameters there is a boundary where a model starts absorbing something that resembles human thinking. At 10M parameters we get incoherent muttering that loses the thread after a couple of words. At 50M parameters we get a model that can generate coherent text and even conduct dialogue.
- At a certain scale, a transformer-based language model begins to demonstrate behavior that, in humans, requires thinking. Even the mistakes the model makes resemble the mistakes humans make when trying to remember something. This raises interesting questions about their nature.
Sources
Dialogue model Mini-Tron-50: https://huggingface.co/Imperius/mini-tron-50
My corpus of Russian classics and publicist writing from the 19th and early 20th centuries: https://huggingface.co/datasets/Imperius/ru-classic
Parser and tokenizer for Russian classics and publicist writing of the 19th century: https://codeberg.org/imperius/libru-classics-bpe
Code for the 10M-parameter LLM trained on Russian classics, based on NanoGPT: https://codeberg.org/imperius/nanogpt-chekhov
Code for the 50M-parameter LLM trained on Russian classics, based on NanoGPT: https://codeberg.org/imperius/nanogpt-ru-classics
Code for the 50M-parameter dialogue LLM trained on the dialogue dataset, based on NanoGPT: https://codeberg.org/imperius/mini-tron-50


Top comments (0)