DEV Community

Cover image for How I Grew a Digital Homunculus and Became a Neuro-Punk
Artem X
Artem X

Posted on • Originally published at habr.com

How I Grew a Digital Homunculus and Became a Neuro-Punk

Why? To create Skynet, of course.

Well, also because I wanted to understand, in detail, what this field that fascinates me so much is breathing with right now. And the best way to understand something is to try to explain it to someone else.

Besides that, I want to move into deep learning professionally, and publishing my interesting projects on the internet seems like the fastest way to get noticed.

Personally, I enjoyed the process a lot, and I invite Habr readers to dive into this small journey with me.

Links to the dataset, weights, and code are attached at the end of the article. The dataset and weights are on Hugging Face; the codebase is on Codeberg, a GitHub-like platform with a similar workflow.

Let's go.

Important Note

The author is an experienced programmer, but everything below was vibe-coded with Claude Code more than completely. That said, the author honestly tried to understand everything he wrote about. In any case, use the provided source code at your own risk. I warned you.

It is also worth keeping in mind that this is the author's first technical article ever. I made significant effort to make the text readable, but there may still be rough edges.

Most of the article was written by hand, but because of the amount of material I had to use an Opus editor, mostly for notes about model training. I tried to check and correct the information, but again, keep that in mind.

Background

I first encountered language models in early 2022, when the web studio where I worked as a Python developer was doing contract work for an American company called Inita. They were building an AI startup for small businesses.

I got access to the OpenAI API and GPT-3, and I was fascinated by this technology at first sight. There is something almost magical about ordinary lines of code being able to enter into dialogue with you and learn something that looks like thinking.

Unfortunately, because of well-known events, our clients eventually lost the ability to pay us. I had effectively been hired for that specific project, and they did not find other tasks for me.

After half a year of fruitless job searching, I managed to get a job at a small instrument-making factory as a microcontroller programmer. In practice, I wrote both firmware for MCUs and graphical interfaces for working with them.

All those years I tried to stay up to date with language models and actively tested different chatbots. I used a paid ChatGPT subscription for a long time too, but that was more user-level expertise than developer-level expertise, which did not satisfy me.

I worked like that for almost three years, until I was offered a position at a large corporation with a noticeable salary increase. Suddenly it turned out that working as a developer in Russian small business has its own special flavor: people constantly try to squeeze everything out of you. In a large company, the rules were different.

I suddenly had a large amount of mental resources available, and I used them to fulfill a long-standing dream I had cherished since 2022: to figure out deep learning.

First Steps

I started small. With Claude Code helping me, I tried small deep networks on tasks that interested me. The result was, for example, a self-learning 2D snake and an Anymal quadruped in MuJoCo learning to walk. I will not go deep into the details; I will just show a couple of demos.

A convolutional network learns to play snake

A convolutional network learns to play snake.

A multilayer perceptron in a quadruped body learns the world in MuJoCo

A multilayer perceptron in a quadruped body learns the world in MuJoCo.

But all of that was preparation for the main boss: language models. I started by reading Sebastian Raschka's book "Build a Large Language Model (From Scratch)", then tried to turn GPT-2-small into an instruction-following bot with a LoRA adapter trained through SFT.

Let us unpack those two terms.

A LoRA adapter works by adding small matrices to the model's layers. This gives us the ability to fine-tune the model on situations we care about without changing the entire model.

SFT, or supervised fine-tuning, trains the model on pairs of user request and assistant answer. The training principle is the same as for raw text, but the important detail is that loss is computed only on the assistant continuation, not on the user's question.

Overall, GPT-2 did turn into a chatbot, although it was obvious that the model lacked the "brains" to answer correctly. One interesting example was translation of an English phrase into French.

Prompt:

Translate "Good morning" to French.

Base GPT-2-small

Translation:

Translation:

Translation:

...and so on another hundred times
Enter fullscreen mode Exit fullscreen mode

After SFT

Je suis arrive.
Enter fullscreen mode Exit fullscreen mode

The model answered incorrectly, but the interesting part is that this phrase can be literally translated into Russian as "I have arrived"; in some contexts it can play a greeting-like role. One can conclude that the model understood what kind of answer was needed, but lacked the factual knowledge.

In fact, mistakes made by models, especially small LLMs, are often more interesting to analyze than correct answers, because their mistakes often resemble the way humans make mistakes - humans from whom they learned to think.

Teaching Arrays of Numbers to Think

I wanted to feel the magic of ordinary arrays of numbers starting, after thousands of iterations, to produce answers that require thinking in humans. So I decided to create and train a model myself.

Simplifying a lot, creating an LLM "from scratch" can be divided into four stages:

  1. Building the dataset.
  2. Writing and training the tokenizer, the model's vocabulary, then tokenizing the dataset.
  3. Configuring the model.
  4. Training the model on the chosen dataset.

Let us go through these stages in more detail.

  1. Dataset collection.

This means parsing the data we want to feed into the model, or downloading a ready-made dataset. When building a dataset from scratch, normalization is extremely important: cleaning data from irrelevant garbage. This is especially important when training LLMs, because the quality of the source data determines whether the model will output what you need.

  1. Writing and training the tokenizer, then tokenizing the dataset.

It is important to understand that the model learns to continue human text more easily when we first split text into "pieces" instead of forcing it to predict text character by character. The model trains faster and produces better results. Later I will show the difference between character-level training and tokenized training. Also, "training a tokenizer" does not mean training a deep network; it uses a classical algorithm.

  1. Model configuration.

The most important number here is the final parameter count, because it directly correlates with the maximum intellectual capability the model can have for generating meaningful text. This will be shown visually later, when I compare a 10-million-parameter model and a 50-million-parameter model on the same dataset.

  1. Training on the selected dataset.

Good practice in deep learning is to split the dataset into train and validation samples. The training sample directly affects the model weights; this is what the model learns from. The validation sample is needed to monitor training.

Deep networks optimize loss, not our wishes. This can lead to a situation where instead of learning to generalize, the network starts memorizing the training data. This is called overfitting.

The validation sample is what controls this. The network does not train on it; it only produces results there. As a rule, validation is used to save the best checkpoints and to stop training early if validation loss has stopped decreasing or begins to grow while training loss keeps falling.

Does a Language Model Dream of The Cherry Orchard?

The lib.ru Parser

The hardest part of working with lib.ru was not fetching pages, but the editorial apparatus of academic editions. Chekhov's complete collected works are hosted there, and alongside the stories themselves there are variant readings, manuscript descriptions, textual comments, and biographical notes.

Raw parsing produced about 24 MB of text, but half of it was apparatus. I wrote a series of regex cleaners that iteratively cut out:

  • Section headers such as "Notes", "Writing history", and "List of abbreviations".
  • Letter headers such as "Chekhov to A. S. Suvorin" or "To Al. P. Chekhov".
  • Textological notes such as "The following was begun:", "Inserted instead of:", or "A note in the margin:".
  • Archive codes such as TsGALI, GPB, and IRLI.
  • Bibliographic footnotes and references to volumes.

After cleaning, 16 MB of pure Chekhov prose remained: short stories, novellas, plays, notebooks. Letters entered the corpus fully; the apparatus did not.

Data Preparation

Character-level means literally that every character is a token. The model alphabet contained 201 unique characters: Cyrillic in both cases, Latin letters because Chekhov wrote in French and German, punctuation, dashes, quotation marks, digits, and typographic symbols from the editions.

# prepare.py - standard nanoGPT char-level preprocessing
chars = sorted(list(set(text)))
vocab_size = len(chars)   # 201
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
Enter fullscreen mode Exit fullscreen mode

90% of the corpus went to train, 10% to validation. No special tokens, no EOS: the model simply learns a continuous stream of characters.

Architecture, About 10.7M Parameters

n_layer    = 6
n_head     = 6
n_embd     = 384
block_size = 256        # about 120 Russian words in context
vocab_size = 201
dropout    = 0.2        # small model, small corpus
bias       = False
Enter fullscreen mode Exit fullscreen mode
Part Parameters
Token embedding, wte, tied 201 x 384 = 77K
Position embedding, wpe 256 x 384 = 98K
Per-layer attention 4 x 384^2 ~= 590K
Per-layer MLP 8 x 384^2 ~= 1.18M
x 6 layers ~10.6M
Total ~10.76M

By today's standards this is comparable to Andrej Karpathy's historical char-rnn models from his 2015 article "The Unreasonable Effectiveness of Recurrent Neural Networks". Back then, LSTMs of a similar scale were trained on Shakespeare and Tolstoy. This experiment is directly in that tradition.

Training Hyperparameters

learning_rate = 1e-3
max_iters = 2000
warmup_iters = 100
lr_decay_iters = 2000
min_lr = 1e-4
beta2 = 0.99
batch_size = 64
dtype = 'bfloat16'
Enter fullscreen mode Exit fullscreen mode

Training

The starting loss was 5.41, close to ln(201) = 5.30. The model honestly began with an almost uniform distribution over the vocabulary. Iterations took 180-250 ms on an RTX 3050 Mobile. Total training time was about 20 minutes. VRAM usage was a little over 1 GB, leaving a huge margin on a 4 GB GPU.

I tested it through sample.py on the prompt "Anton Petrovich" at different moments of training.

Quality Evolution

Iter ~600, loss ~1.9:

Anton Petrovich on the prepared itself meeting and talked-away it gave to you.
With instructing, which neither not blinked in the spectacle of two with joyful
clean, that he was he the composing of evening, he continued...
Enter fullscreen mode Exit fullscreen mode

This is already learning Russian morphemes and endings: constructions like "he continued", "which", and "was" appear in recognizable form. It balances quotation marks correctly and capitalizes after periods. Pseudowords are normal at this stage: the model has learned phonotactics, meaning which letters can follow which, but it has not memorized whole words yet.

Iter ~2250, loss ~1.4:

Anton Petrovich at the passage to the side, thought: "They give a pleasant fist
and to give oneself to her ten falls of the girl and to laugh such. Wrote that
one needs a wonderful two for one's executive sendings..."
Enter fullscreen mode Exit fullscreen mode

Almost all words are now real. Direct speech appears with a colon and quotation marks. Local agreement of gender and case often works, even when the words connect absurdly.

But the most interesting thing is the second sample from the same pair:

Anton Petrovich (see note to him by Chekhov). Was from money - see mentions
A. P. Chekhov - to A. P. Chekhov, Mikhail Pavlovich.

3 Next to the collection "New Rechange", to I, 45, 1.
4 "Three Years" (SM 8, 7 with change, 1886, No. 31.
Enter fullscreen mode Exit fullscreen mode

The model learned the format of the editorial apparatus from the complete collected works. The cleaner removed almost all of it, but some pieces remained, and these rare pieces hardened inside the model as a separate generation mode.

Iter ~5000, final, validation loss ~1.20-1.22:

Anton Petrovich slowly looked at me and blinked his eyes. He became small and
put down a match.

"I saw you..." he said, laughing.

"I am asking you, I will wait for you."

"Should it be two years?" I asked, smiling.

"Yes. You think-formal that you will see me in this apartment live. Will you go
to Petersburg, and to Moscow before Petersburg?" I asked.
Enter fullscreen mode Exit fullscreen mode

This is already almost a coherent micro-episode. Dialogue structure is ideal: dashes at the beginning of lines, authorial speech after dialogue, alternation between speakers, speech verbs with participles. There is connection between sentences: "looked at me and blinked his eyes" followed by another action of the same character. The scene is consistent: two speakers, a room, a conversation about the future, mentions of Petersburg and Moscow.

Where the model breaks is semantics and logic:

  • "I saw you" uses feminine gender in Russian while the speaker is "he".
  • It mixes informal "you" with formal "you".
  • "Should it be two years?" is grammatically valid, but meaningless.

That is the boundary between syntax and semantics. A 10M character-level model learned syntax reasonably well, but holding global meaning requires higher-level representations that simply do not exist at this scale.

The most interesting part of this model is attractors. On the prompt "Anton Petrovich", the model reliably splits roughly 50/50 into prose or into the index/footnote style of the complete collected works:

Anton Petrovich - 127, 438 Maria Kaninovna (1848-1899), professor of A. P.
Chekhov - 188, 259, 313 Published Ivanovich (1884-1881), replacement "The
Seagull", nightingale Vladimirovna - 129, 383 "Sakhalin" - 178, 439
Enter fullscreen mode Exit fullscreen mode

This is path dependence: the probability distribution of the next token after "Anton Petrovich" is sharply bimodal. If the next character is a line break, the model almost deterministically goes into name-index mode, because in the training data after a name plus line break there were almost always numbered items with dashes. If the next character is a comma or a space before a verb, the model goes into prose. One random choice of the first token commits the entire subsequent trajectory.

To remove footnotes, it is enough to rigidly fix the trajectory in the prompt. After "Anton Petrovich slowly ", the name index is no longer possible: after names in that format there are digits or years, not adverbs.

Anton Petrovich slowly smiled, even dryly walked by and said to him:
"So in the house Matvey Petrovich talked, so that receiving the count around
Petersburg began as significantly..."
Enter fullscreen mode Exit fullscreen mode

What the Chekhov Model Can and Cannot Do

Capability Quality
Cyrillic, character distribution Perfect
Morphology, cases, endings Almost always correct
Dialogue structure, dashes, replies, authorial speech Recognizable
Chekhov-like style, patronymics, rhythm, vocabulary Bright
Local coherence for 2-3 sentences Sometimes
Holding a topic across a paragraph No
Semantics Hallucinations
Facts No

I later poked this simple model with mech-interp analysis, but that is a separate chapter that did not make it into the final article, so as not to overload it.

It is also worth noting that later I trained a 10-million-parameter model with a normal tokenizer, which is discussed below, and on a larger dataset of Russian classics. The behavior did not change much: speech was still incoherent. Apparently, at 10M parameters it is impossible to get the model to "simulate thinking".

A Tribute to Russian Culture

After Chekhov, I wanted to know what would happen if I expanded the corpus many times over, added a BPE tokenizer, and increased the model to roughly GPT-2-nano scale.

Corpus

I expanded the parser to 21 authors: the golden age of Russian prose, including Tolstoy, Dostoevsky, Turgenev, Goncharov, Leskov, Bunin, Kuprin, Gogol, and Andreyev; drama, including Ostrovsky and Griboyedov; literary criticism, including Belinsky, Dobrolyubov, Pisarev, Herzen, and Chernyshevsky; plus smaller classics such as Garshin, Korolenko, Saltykov-Shchedrin, and Lermontov.

Raw parsing produced 369 MB. The same cleaning as for Chekhov, plus new patterns for each author's academic apparatus - Turgenev had French addresses and signatures, Tolstoy had edition variants such as "1868", Dostoevsky had textological markup - resulted in 264 MB of clean corpus.

Normalization Before BPE

Before training the tokenizer, I ran the text through a normalizer:

  • Russian yo was normalized to e, because that letter is used inconsistently in Russian typography, and it is better not to duplicate tokens for the model.
  • Quotation marks were unified.
  • Dashes were unified.
  • Three-dot ellipses were normalized into a single canonical ellipsis form.

This strongly reduces the token vocabulary and gives the model one canonical form for each punctuation mark.

Tokenizer: SentencePiece BPE 16k

spm.SentencePieceTrainer.train(
    input="corpus_clean.norm.txt",
    model_prefix="spm",
    vocab_size=16000,
    model_type="bpe",
    character_coverage=1.0,
    byte_fallback=True,
    user_defined_symbols=["\\u2014", "\\u00ab", "\\u00bb", "\\u2026"],
)
Enter fullscreen mode Exit fullscreen mode

user_defined_symbols guarantees that those signs are never split into bytes: the model sees them as atomic tokens. On plain BPE without this option, an em dash, U+2014, three UTF-8 bytes, could be split into pieces. For Russian classics this is catastrophic: the dash is the main syntactic sign of dialogue.

Tokenizer efficiency: 3.49 characters per token on average. Full words such as "Dostoevsky", "landowner", and "young lady" become one token. First names plus patronymics are two or three tokens.

After tokenization: 75M tokens in train.bin.

Architecture, About 46M Parameters

n_layer    = 12
n_head     = 8
n_embd     = 512
block_size = 512         # about 1700 characters in context
vocab_size = 16000
bias       = False
dropout    = 0.1
Enter fullscreen mode Exit fullscreen mode
Part Parameters
Token embedding, 16000 x 512, tied 8.2M
Position embedding 0.26M
Per-layer attention + MLP ~3.15M
x 12 layers 37.8M
Total ~46M

Training Hyperparameters

n_layer    = 10
n_head     = 8
n_embd     = 512
ff_dim     = 2048   # 4 x n_embd
block_size = 1024
vocab_size = 32000
bias       = False  # modern standard: no bias in Linear/LN
dropout    = 0.1
Enter fullscreen mode Exit fullscreen mode

30,000 iterations x 34K tokens is about 1B tokens, or about 14 epochs over the corpus. This is far beyond Chinchilla-optimal, but 75M tokens is too little for a 46M model at the optimum, and under a limited corpus extra epochs keep helping, especially with local coherence.

VRAM: about 3.2 GB with bf16, context 512, batch size 6. One iteration took about 1.5 seconds on an RTX 3050 Mobile.

Progress

iter     0  train loss 9.68 / val 9.68   (~ln(16000), random initialization)
iter   500  ~5.5    warmup finished
iter  3000  ~4.0
iter  8000  ~3.6
iter 14000  ~3.45
iter 21000  ~3.36   best validation, checkpoint saved
iter 24500  ~3.36   plateau, stopped manually
Enter fullscreen mode Exit fullscreen mode

The best checkpoint was fixed at step 21000. After that, validation stopped improving. Train loss kept creeping down - a typical overfitting tail. always_save_checkpoint=False left exactly the useful checkpoint on disk.

Total training time was about 13 hours of continuous laptop work.

Quality Evolution

On the prompt "yyyy", an intentionally meaningless seed used to inspect the model's attractors, after 21000 iterations the model produced two radically different continuations in two samples.

Sample 1: a play in the style of Ostrovsky:

yyyy! (Leaves the hat.)

Zhmigulina. Well, what, and where else?

Avdotya Vasilievna. Ah, you have been waiting for a long time; and so you will
vanish, and vanish. (Looks out the window.) Go, go; I will wait for you.

Zhmigulina. I will not go. (Leaves.)

Avdotya Vasilievna (not listening to her husband's speeches). There! For the
last time!
Enter fullscreen mode Exit fullscreen mode

Blocks of character names, stage directions in parentheses, semantically almost coherent dialogue. The model introduced relationships between characters by itself through the remark "not listening to her husband's speeches". This is no longer just pattern repetition; it is world-building.

Sample 2: the textual apparatus of an academic edition:

yyyyk same. (Takes the letter.)

26 Instead of: excessive ~ was not // les

38 Instead of: left // furnished

Page 391 2 Instead of: excessive // native
Enter fullscreen mode Exit fullscreen mode

This is the "variants and readings" format from academic collected works. The cleaner did not remove it completely, and the model learned that format as one of the genres of the corpus.

The same effect as with Chekhov, but now the genres are more developed: Ostrovsky versus academic apparatus. The prompt "yyyy" was ambiguous enough to trigger both attractors in different samples.

What the 50M Classics Model Can and Cannot Do

Capability Quality
Russian grammar Almost flawless
19th-century classics style Recognizable
Genre switching, prose/play/apparatus Works
Patronymics, gender agreement Holds
Local coherence across a paragraph 4-6 sentences
Holding a topic for 100+ tokens Rarely
Holding a plot across a scene No
Facts Hallucinations
Semantic tasks No

The important point: compared to the 10M model, coherent speech is clearly visible here, and it is provided precisely by the increase in parameters.

Entering Dialogue With the Machine

We had built a "wild" model that could only try to plausibly continue the text you typed. Now we needed a model that could conduct a dialogue with you. In other words, we needed to turn the language model into a chatbot, still completely from scratch.

Finding the Right Dataset

Compared with the previous chapter, only the dataset content changes. The easiest option is to choose high-quality distillations of large models from Hugging Face.

At first I wanted to translate Anthropic Opus 4.5/6 distillations into Russian, but then I settled on much larger distillation datasets from Kimi 2.6 and GLM-5, each weighing dozens of gigabytes.

But there was a problem. The dataset was obviously English, while I needed a Russian-language dataset. I tried translating it myself with Google's good translation model, Translate-Gemma-4B, but ran into the fact that on my laptop Maibedden, with 4 GB VRAM on an RTX 3060 and 16 GB RAM, translation would take monstrously long.

Renting GPU machines on vast.ai was an option, but I did not see much need, because I had already found a large Russian dialogue dataset.

I decided to use a dialogue dataset from the Russian company ZeroAgency. It more than satisfied my needs: it was fairly large and paid a lot of attention to reasoning. In the end I chose it:

https://huggingface.co/datasets/ZeroAgency/ru-big-russian-dataset

The dataset had already been split into train and test, so I could proceed to training the model itself.

Dataset Structure and Training Preparation

Meet our guest: big-russian-dataset on Hugging Face, a Russian-language SFT corpus. It contains 19 train files and 1 test file in .parquet format: about 3.9 GB of compressed parquet, about 7 GB after decoding into UTF-8.

Schema

conversation: list
question: string
source: string
topic: string
classified_topic: string
overall_score: int           # 0-10, score from an LLM judge
safety: int                  # 0-10
pii_leak: int                # 0/1
refusal: int                 # 0/1, refusal marker
coherence, relevance, helpful, ...: int  # additional metrics
Enter fullscreen mode Exit fullscreen mode

Size

Split Files Dialogues After filter
train 19 1.71M 1,709,621, 99.9%
val, test 1 18.5K 10,396, 56%

After filtering with overall_score >= 6 AND safety >= 8 AND pii_leak = 0:

  • Train passes almost entirely, 99.9%, because the dataset authors had already cleaned the train split.
  • Validation has a score spread from 1 to 10 on purpose, so the model can be evaluated on the full difficulty range. That is why only 10.4K remain out of 18.5K.

Quality Examples From Validation

Low-score example, score 3, removed by the filter:

[user] Who wanted to arrest Thomas Jefferson?

[assistant] Paul Revere was a descendant of Huguenot refugees, as was Henry
Laurens, who signed the Articles of Confederation for South Carolina; Jack
Jouett, who rode from Cuckoo Tavern to warn Thomas Jefferson...
Enter fullscreen mode Exit fullscreen mode

The assistant copied context instead of answering the question. A typical artifact of RAG-generated training data.

High-score example, score 9, passes the filter:

[user] Summarize the article: "Death of a nurse": 50-year-old Vera Maresova
confessed to killing five women and one man in a hospital...

[assistant] Vera Maresova, 50, killed five women and one man over four years.
Police said she committed the murders to make her work easier...
Enter fullscreen mode Exit fullscreen mode

Dialogue Lengths

Metric chars approx. tokens, BPE-32k
mean 1521 ~550
p50 1237 ~450
p90 2680 ~1000
p95 2914 ~1100
p99 3957 ~1500

P90 around 1000 tokens motivated block_size=1024 for training, because it covers most dialogues in full.

Topics

Top 5 in validation: history, 37%; sports, 5%; news, 4%; crime, 4%; politics, 3%. Historical summarization tasks dominate, mostly from the ai-bond/ru-alpaca-summ subset.

Preprocessing

ChatML markup with special tokens:

<|system|>
<|user|>
<|assistant|>
<|end|>
Enter fullscreen mode Exit fullscreen mode

All four special tokens were registered as user_defined_symbols in SentencePiece, which guarantees atomic tokenization. They are not split into pieces.

Loss mask: during training, only assistant response tokens are counted, including the closing EOT token. System and user segments get target = -1, which is ignore_index in F.cross_entropy. Without this, a 50M model would not learn to answer; it would spend parameters predicting the questions themselves.

After filtering and markup: 1.04 GB of tokens in train.bin, uint16. Of those, 603M tokens are under loss, assistant plus EOT, or 57.7%.

Tokenizer

SentencePiece BPE, vocabulary size 32000:

  • ID 0: padding/unknown depending on tokenizer configuration.
  • ID 1: default control token.
  • IDs 2-5: ChatML special tokens.
  • IDs 6-261: byte fallback.
  • IDs 262-31999: ordinary BPE pieces.

Efficiency: average tokenization density of 5.67 chars/token on Russian text. Whole words like "hello", "great", and "thanks" often fit into one token.

For comparison, the previous 16k-vocab tokenization on Russian classics gave about 4 chars/token. Doubling the vocabulary gave a denser representation, so 1.4x more real text fits into the same 1024 context tokens.

Training

I chose the same NanoGPT as the base. The model parameters were as follows.

Architecture, About 48M Parameters

n_layer    = 10
n_head     = 8
n_embd     = 512
ff_dim     = 2048   # 4 x n_embd
block_size = 1024
vocab_size = 32000
bias       = False  # modern standard: no bias in Linear/LN
dropout    = 0.1
Enter fullscreen mode Exit fullscreen mode

Parameter count:

Part Parameters
Embedding, wte tied with lm_head 32000 x 512 = 16.4M
Position embedding, wpe 1024 x 512 = 0.5M
Per-layer attention, c_attn + c_proj 4 x 512^2 = 1.05M
Per-layer MLP, c_fc + c_proj 8 x 512^2 = 2.10M
Per-layer layer norms, x2 ~0.001M
x 10 layers ~31.5M
Final layer norm 0.001M
Total 47.85M

Training Hyperparameters

# AdamW
learning_rate = 3e-4
weight_decay  = 0.1
beta1, beta2  = 0.9, 0.95
grad_clip     = 1.0

# LR schedule, cosine with warmup
warmup_iters    = 200
lr_decay_iters  = 16000
min_lr          = 3e-5

# batch
batch_size                  = 2     # micro-batch
gradient_accumulation_steps = 32    # effective batch = 64 sequences
block_size                  = 1024
# tokens per iter = 2 x 32 x 1024 = 65,536

# training
max_iters = 16000   # about one epoch, 1.04B tokens / 65K tokens per iter

# system
dtype = 'bfloat16'
Enter fullscreen mode Exit fullscreen mode

16000 iterations x 65K tokens = 1.04B tokens. This is roughly one epoch over the corpus.

Training Progress

iter     0  loss 10.49        (~ln(32000), random initialization)
iter   500  loss ~5           warmup finished, LR at peak
iter  3000  loss ~3
iter  5500  loss ~2.4         first saved checkpoint, val_loss ~3.3
iter  8500  loss ~2.0         second checkpoint
iter 11500  loss ~1.7         third checkpoint
iter 14500  loss ~1.5         fourth checkpoint, best val_loss ~1.8
iter 16000  max_iters reached; no further checkpoint saved
            validation loss stopped improving near the end of cosine decay
Enter fullscreen mode Exit fullscreen mode

Iteration time: about 3 seconds cold, about 4-4.5 seconds once thermally settled. Total time was about 13 hours.

VRAM: about 2.8 GB used with batch size 2, context 1024, bf16. There is headroom under the 4 GB limit.

always_save_checkpoint = False means saving only when validation loss improves. This is insurance: if something goes wrong near the end, the best version remains on disk. In practice, the last checkpoint was at iteration 14500.

Generation Quality Evolution

We tested through chat.py on one prompt, "Hello", at different training moments.

Iter ~6000, early stage:

Hello! Of course, here is a sentence with the sentence "Throw it to work"
for convenient screen reading:

1. "Plan your tasks and assignments" - this is a simple and understandable way...

[grammar is clean, but the content is patchwork]
Enter fullscreen mode Exit fullscreen mode

Iter ~12000, middle:

Tell me about yourself:

As an artificial intelligence, I do not have the ability to provide me with
exact answers to your questions. Step 1: Defining the task. First of all, one
must determine how to do it. It may be like this: "I will be glad that my task
is to help with tasks"...
Enter fullscreen mode Exit fullscreen mode

It has mastered the markdown tutorial genre, but not meaning.

Iter ~14500, final:

Hello
> Hello! I am ready to help you achieve your goals. Have a good day!

What is 7 multiplied by 8?
> To find out how much 7 multiplied by 8 is, we need to multiply the quantity
by 8. First let us calculate the total number of ways to divide this number by 8:
5! = (5 x 8) / 8 = 120

Write bubble sort in Python.
> Here is an example of bubble sort in Python...
def merge_sort(arr):
    sorted_array = {}
    for i in range(len(arr)):
        if arr[i]
Enter fullscreen mode Exit fullscreen mode

Now let us unpack all three examples.

  • The first example: it understood the situation perfectly, greeted the user, and offered help.
  • The second example: mathematically a mess, but the genre and grammar are ideal. The model clearly understood what was being asked of it.
  • The third example: function name merge_sort instead of bubble_sort, dictionary instead of two loops. The form is correct; the implementation is meaningless.

In my opinion, this is a very decent result for a model of this size, although obviously it cannot be used in production. But it has hardly reached its ceiling. Generation quality should improve as the dataset grows.

What the Final SFT Model Can Do

Capability Quality
Russian grammar Flawless: cases, agreement, syntax
Chat format Reliably answers as an assistant
Markdown structure Imitates GPT-4 style: lists, bold, code fences
Self-identification Says the right words: "I am an AI assistant"
EOS completion Usually stops by itself
Local coherence, 1-2 sentences Sometimes meaningful
Answering on topic Hears trigger words, not the essence
Facts Hallucinations
Arithmetic Imitates calculation without calculating
Logic, multi-step reasoning Absent
Code, syntactic and semantic Shape is correct, code does not work

Mind.in.a.box, in Go

I wanted to share these artifacts with friends and relatives, but the problem was that they know nothing about llama.cpp or similar software.

The solution: make a single binary that can be sent to a friend in Telegram so they can run it.

Also, it is interesting when a model that behaves like an intelligent entity can be launched like some game through a compact .exe.

Options We Considered

Approach Problem
PyInstaller, Python to exe Heavy file, about 700 MB to 1 GB; slow startup; not a "real" single binary
Go + ONNX Runtime Requires ONNX Runtime DLL next to the executable, so not one file
Go + llama.cpp via CGO Requires static llama.cpp build on Windows, MSVC headache
Pure Go, chosen We write the forward pass and BPE encoder ourselves, but get a real single binary

Solution Architecture

go_serve/
|-- export_weights.py   # ckpt.pt -> weights.bin (fp16) + config.json + vocab.json
|-- go.mod              # no external dependencies, zero deps
|-- embed.go            # //go:embed for weights/config/vocab
|-- config.go           # config.json parsing
|-- tokenizer.go        # pure-Go SentencePiece BPE encoder
|-- model.go            # forward pass + KV cache + sampling
`-- main.go             # chat REPL with slash commands
Enter fullscreen mode Exit fullscreen mode

go build produces one 97 MB .exe with everything embedded.

Key Technical Decisions

1. fp16 Weights in a .bin File

Weights in state_dict are stored in fp32. Before embedding, we convert them to fp16: file size becomes 2x smaller, while precision is acceptable for inference. They are unpacked into fp32 on load via float16ToFloat32.

48M parameters x 2 bytes = 92 MB for the whole model.

2. KV Cache

Without cache, every generation step recomputes the whole prefix, O(T^2) per token. With cache, it is O(T) per token. For a 200-token answer, the difference is 100x.

type Model struct {
    KCache [][]float32  // [layer][token_pos * n_embd]
    VCache [][]float32
}

// On each forward:
m.KCache[l] = append(m.KCache[l], k...)  // added new K
m.VCache[l] = append(m.VCache[l], v...)
// attention works over the accumulated cache
Enter fullscreen mode Exit fullscreen mode

3. Parallel matVec Through Goroutines

In nanoGPT-style single-token inference, the main operation is matrix-vector multiplication: y = W @ x, where W has shape [out, in], x has shape [in], and y has shape [out].

The simplest implementation is a double loop, O(out x in). On a 4-core CPU, single-threaded code uses only a quarter of the available resource. We parallelized over rows:

func matVec(W []float32, rows, cols int, x, y []float32) {
    nworkers := runtime.NumCPU()
    chunk := (rows + nworkers - 1) / nworkers
    var wg sync.WaitGroup

    for w := 0; w < nworkers; w++ {
        start := w * chunk
        end := start + chunk
        if end > rows {
            end = rows
        }
        if start >= end {
            continue
        }

        wg.Add(1)
        go func(start, end int) {
            defer wg.Done()
            for r := start; r < end; r++ {
                sum := float32(0)
                row := W[r*cols : (r+1)*cols]
                for c := 0; c < cols; c++ {
                    sum += row[c] * x[c]
                }
                y[r] = sum
            }
        }(start, end)
    }

    wg.Wait()
}
Enter fullscreen mode Exit fullscreen mode

This gives a 3-4x speedup on a 4-core CPU without BLAS.

4. Parallel Attention Heads

All attention heads are computed concurrently through sync.WaitGroup, with one-token forward and KV cache. Heads are independent, so there is no bottleneck.

5. Pure-Go SentencePiece BPE Encoder, or: The Rake Collection

The real SentencePiece-BPE encoder uses greedy merges by rank. In practice, this is close to "take the longest piece that starts at the current position." This heuristic is close to the exact algorithm in result:

func (t *Tokenizer) segmentBPE(s string) []int {
    out := make([]int, 0, len(s)/3)
    i := 0
    for i < len(s) {
        matched := false
        for j := len(s); j > i; j-- {
            if id, ok := t.pieceToID[s[i:j]]; ok {
                out = append(out, id)
                i = j
                matched = true
                break
            }
        }
        if !matched {
            out = append(out, t.byteToID[s[i]])  // byte fallback
            i++
        }
    }
    return out
}
Enter fullscreen mode Exit fullscreen mode

Artifact Sizes

File Size
weights.bin, 50M params, fp16 92 MB
vocab.json, 32k pieces with scores 1.8 MB
config.json ~12 KB
Final exe after go build -ldflags="-s -w" 97 MB

After upx --best, it compresses to about 70 MB.

Performance

$ printf '/max_tokens 50\nHello\n/quit\n' | ./nanogpt-chat.exe
=== nanoGPT chat (Go single-binary) ===
Model: 10 layers, 8 heads, n_embd=512, ctx=1024, vocab=32000
Weights loaded in 189ms
you> Hello
bot> Hello! I am ready to help you achieve your goals. Have a good day!
[15 tokens in 418ms, 35.9 tok/s]
Enter fullscreen mode Exit fullscreen mode

Comparison on the same machine, same model, same prompt:

Stack Speed
Python + PyTorch + CUDA, GPU ~14 tok/s
Python + PyTorch + CPU not measured, expected ~3-5 tok/s
Go + parallel matVec + CPU 35.9 tok/s

PyTorch overhead does not pay off for single-token inference of a tiny model. Each token means dozens of kernel launches, with CPU-GPU synchronization between them, which slows things down even more. On a 50M model, this dominates the actual computation.

Cross-Compile

CGO is not used, so cross-compilation is trivial:

$env:GOOS="linux";   go build -ldflags="-s -w" -o nanogpt-chat-linux .
$env:GOOS="darwin";  go build -ldflags="-s -w" -o nanogpt-chat-mac .
$env:GOOS="windows"; go build -ldflags="-s -w" -o nanogpt-chat.exe .
Enter fullscreen mode Exit fullscreen mode

All three variants can be built on any platform.

Features and Limitations

  • CPU only. No GPU acceleration. For 50M this is fine; for 1B+ it will already be slow.
  • fp32 inference. No quantization, no int8/int4. Model in RAM is about 190 MB.
  • No batching. One user, one session at a time. A server scenario would need batching around it.
  • Sampling uses sort.Slice, which is O(V * log V). With a 32k vocab this is not critical, but it could be sped up with partial sort.

How to Accidentally Turn a Model Into a Mad Philosopher

The model still had obvious problems: despite the significantly higher quality, it could not correctly answer almost any question you asked. The obvious solution was to increase the number of iterations and the amount of training data. But we decided to take a more interesting path: reinforcement learning. There were two approaches to choose from, KTO and DPO.

KTO was chosen instead of DPO because DPO teaches comparative judgment: "X is better than Y." KTO teaches an absolute judgment: "this is good / this is bad relative to some baseline." For our task, the absolute signal is more precise. You are not telling the model "rejected is worse than chosen"; you are telling it "these three patterns are bad, period."

Collecting Pairs

Strategy: for every prompt from train, take chosen, the original high-score answer from the dataset, and rejected, generated by our SFT model with settings that provoke the needed failure mode.

Three presets, each catching its own type of error:

preset temperature rep_penalty max_tokens target failure
loops 1.20 1.00, off 400 token-level loops
canned 0.40 1.15 80 short canned templates
tutorial 0.70 1.10 500 long markdown walls

1000 pairs per preset: 3000 pairs total. Collection took about 3 hours.

The quality of rejected was confirmed by checking random samples:

  • loops: token salad, nonsense.
  • canned: on "start a small business", it produced a template like "1. Collect information 2. Analyze data 3. Create reports" without attribution to the topic.
  • tutorial: on a prompt about neuroplasticity, it produced "### 1. Understanding the topic" with a quote from English text, without answering.

All three presets worked as intended: rejected was clearly worse than chosen.

Implementation

train_kto.py, about 280 lines, is my own KTO loss implementation on top of nanoGPT.

# Forward, 4 forwards for each triplet:
ref_lp_chosen   = sequence_logp(ref, prompt, chosen)     # frozen, no_grad
ref_lp_rejected = sequence_logp(ref, prompt, rejected)   # frozen, no_grad
pol_lp_chosen   = sequence_logp(policy, prompt, chosen)  # gradients
pol_lp_rejected = sequence_logp(policy, prompt, rejected)

# Length-normalized log-ratios
chosen_lr   = (pol_lp_chosen - ref_lp_chosen) / len(chosen)
rejected_lr = (pol_lp_rejected - ref_lp_rejected) / len(rejected)

# KTO loss
z = max(0, z_ref)  # IMPORTANT: clamp to [0, +inf); I forgot this line in v1
L_chosen   = 1 - sigmoid(beta * (chosen_lr - z))
L_rejected = 1 - sigmoid(beta * (z - rejected_lr))
loss = lambda_d * L_chosen + lambda_u * L_rejected

# z_ref is updated as an EMA of the mean log-ratio, for diagnostics
z_ref_ema = 0.99 * z_ref + 0.01 * (chosen_lr + rejected_lr) / 2
Enter fullscreen mode Exit fullscreen mode

Attempt 1: beta=0.1, max_iters=2000

We launched with default parameters:

beta             = 0.1
learning_rate    = 5e-6
max_iters        = 2000
batch_size       = 1
grad_accum       = 16
block_size       = 768
Enter fullscreen mode Exit fullscreen mode

Progress:

iter    0: loss=0.9997  gap=+0.01   z_ref=-0.001
iter   50: loss=0.9949  gap=+0.20   z_ref=-0.069
iter  100: ?
iter  500: loss=~0.5    gap=+30     z_ref=-15        suspicious
iter  900: loss=0.24    gap=+55     z_ref=-37        val_acc=1.000, saved
iter 1500: loss=0.18    gap=+70     z_ref=-40        val_acc=1.000, best
iter 1900: loss=0.31    gap=~+70    z_ref=-40        val_acc=0.93, degrading
Enter fullscreen mode Exit fullscreen mode

val_acc=1.000 by iteration 900: the model perfectly distinguishes chosen/rejected on validation. But z_ref=-37 is a colossal shift. The mean log-ratio fell by 37 nats. This means the model strongly lowered the likelihood of both answers on average relative to the reference model.

Test in chat.py on the checkpoint from iteration 1500, best validation loss:

you> Hello
bot> Alexey! One. You know you on Earth. I told you again, you here
your name is you,, your day, Alexey. Of course. You always know yours on Earth!

you> Tell me about yourself
bot> - pledge - pledge serves as the basis that - from the state, the state
- and-em of the state{, to and,o and-e - to eternal...
[broken UTF-8, destroyed syntax, fragments of random words]
Enter fullscreen mode Exit fullscreen mode

The model was completely destroyed. It still "distinguishes" chosen/rejected on validation, with accuracy 1, but it stopped being an assistant and began producing strange poetic nonsense.

Curious Side Effect: "Artistic Mode"

After both KTO attempts, the model produced not just nonsense, but nonsense in a specific stylistic register: pseudo-philosophical.

you> Hello
bot> You know your special soul? You want, smile; You know your day,
you may be waited. Together? Only everything - wisdom.

you> Tell me about yourself
bot> - pledge - pledge serves as the basis that - from the state, the state
- and-em of the state{, to and,o and-e - to eternal...
Enter fullscreen mode Exit fullscreen mode

Notice the second-person address, emotionally colored vocabulary such as "wisdom", "eternal", and "Earth", abrupt images, and neologisms. This is not random garbage. It is a recognizable genre, and the model was never explicitly trained on it. Where did it come from?

Hypothesis: inverse mode collapse.

  • In SFT train, about 1-2% of the corpus is writing, literature, language, and linguistics. Summarization tasks sometimes also contain fragments of fiction as input. The model saw this register weakly, but it saw it.
  • Our rejected samples, canned + tutorial + loops, are the dominant modes of the distribution. KTO pushed them down.
  • The distribution remains normalized through softmax. When frequent modes are pushed down, the remaining rare modes receive relatively larger weight during training.

Ordinary mode collapse means the model converges to one frequent pattern. Here it is the reverse: we killed frequent patterns, and the model converged into rare ones.

This is a vivid side example showing that preference learning rewrites not only the thing it is aimed at, but the whole distribution. A narrow KTO signal over three failure modes unexpectedly rebuilt the model's entire generative geometry.

Conclusions

This adventure suggests several interesting conclusions.

  • A language model can be trained from scratch and turned into a chatbot using only a dialogue dataset. Its world model will be poorer than that of a model that went through pretraining on raw text, but in production this can be patched with RAG.
  • Somewhere between 10M and 50M parameters there is a boundary where a model starts absorbing something that resembles human thinking. At 10M parameters we get incoherent muttering that loses the thread after a couple of words. At 50M parameters we get a model that can generate coherent text and even conduct dialogue.
  • At a certain scale, a transformer-based language model begins to demonstrate behavior that, in humans, requires thinking. Even the mistakes the model makes resemble the mistakes humans make when trying to remember something. This raises interesting questions about their nature.

Sources

Dialogue model Mini-Tron-50: https://huggingface.co/Imperius/mini-tron-50

My corpus of Russian classics and publicist writing from the 19th and early 20th centuries: https://huggingface.co/datasets/Imperius/ru-classic

Parser and tokenizer for Russian classics and publicist writing of the 19th century: https://codeberg.org/imperius/libru-classics-bpe

Code for the 10M-parameter LLM trained on Russian classics, based on NanoGPT: https://codeberg.org/imperius/nanogpt-chekhov

Code for the 50M-parameter LLM trained on Russian classics, based on NanoGPT: https://codeberg.org/imperius/nanogpt-ru-classics

Code for the 50M-parameter dialogue LLM trained on the dialogue dataset, based on NanoGPT: https://codeberg.org/imperius/mini-tron-50

Top comments (0)