SEN LLC

Posted on Apr 15

Text Generation Before Transformers: Building a Markov Chain in 200 Lines of Python

#python #cli #nlp #tutorial

Text Generation Before Transformers: Building a Markov Chain in 200 Lines of Python

A zero-dependency CLI that trains a Markov model on any text file, saves it as JSON, and generates new text from it. The point isn't the tool — the point is to understand what's actually happening when a model "generates text", in the one setting simple enough that you can watch every step.

Language models used to be simple enough to fit on a whiteboard. Then they became too big to fit in a data center. Somewhere between those two extremes is the Markov chain, and if you've never implemented one, you've missed the single best pedagogical experience in NLP — not because Markov chains are what we use today, but because they're exactly comprehensible. Every step of the algorithm is inspectable. There's no "the model learned it". There's no embedding space, no attention, no gradient. There's just counting, and then rolling a weighted die.

markov-gen is a small CLI I built to make that experience easy to reach. Three commands, zero dependencies, about 200 lines of Python. You train a model on a text file, save it as JSON, and generate new text. The model is a dict — a literal dict[tuple[str, ...], Counter] — and you can open the JSON in your editor if you want to see what it learned.

This post walks through what Markov chains actually capture, the tradeoff between "order" and creativity, the little problems you only notice when you implement one yourself, and an honest accounting of what they can't do (which is a lot).

📦 GitHub: https://github.com/sen-ltd/markov-gen

The problem, stated without hand-waving

You have a text. You want new text that "sounds like" it. How?

The Markov-chain answer is: look at what followed each short phrase, and when generating, reuse those continuations at the same frequencies you saw them at. That's it. A Markov model of order N is a table keyed by N-grams (tuples of N consecutive tokens), where each value is a frequency distribution over "tokens that followed this N-gram in the training corpus".

If the training text is:

the quick brown fox jumps over the lazy dog the quick red fox

and the order is 2, then the model is:

("the", "quick")   -> {"brown": 1, "red": 1}
("quick", "brown") -> {"fox": 1}
("brown", "fox")   -> {"jumps": 1}
("fox", "jumps")   -> {"over": 1}
("jumps", "over")  -> {"the": 1}
("over", "the")    -> {"lazy": 1}
("the", "lazy")    -> {"dog": 1}
("lazy", "dog")    -> {"the": 1}
("dog", "the")     -> {"quick": 1}
("quick", "red")   -> {"fox": 1}

To generate, pick a starting N-gram, sample from its follow-up distribution to get the next token, slide the window forward by one, repeat. That's the whole algorithm. The only "model" in "language model" here is literally the dict above.

This is unglamorous, and that's the point. Everything you might be tempted to wave your hands at in a transformer — "the model attends to relevant context", "it learns representations of the words", "it implicitly learns syntax" — has no analogue here. The Markov chain has no representations. It has no syntax. It doesn't even know that "fox" and "foxes" are related words. All it has is: after this specific two-word phrase, here are the specific tokens I saw next, and how often.

That constraint is the whole educational value. You can see the limits of the model because the model has no hidden state you could blame things on.

The algorithm, in code

Training is one function:

from collections import Counter, defaultdict

def build_model(tokens: list[str], order: int) -> dict[tuple[str, ...], Counter]:
    if not (1 <= order <= 5):
        raise ValueError(f"order must be between 1 and 5, got {order}")
    if len(tokens) <= order:
        raise ValueError(f"corpus too small for order {order}")

    model = defaultdict(Counter)
    for i in range(len(tokens) - order):
        key = tuple(tokens[i : i + order])
        nxt = tokens[i + order]
        model[key][nxt] += 1
    return dict(model)

That's literally it. Slide a window of size N across the token stream, record what follows each window. I've never been able to beat those seven lines in terms of "understanding returned per line of code".

Generation is one more function:

import random

def generate(model, order, length, rng: random.Random, start=None):
    current = start or _pick_start_key(model, rng)
    out = list(current)
    while len(out) < length:
        counter = model.get(current)
        if not counter:
            break  # dead end — terminate cleanly
        nxt = _weighted_choice(counter, rng)
        out.append(nxt)
        current = tuple(out[-order:])
    return out

def _weighted_choice(counter, rng: random.Random) -> str:
    items = sorted(counter.items())  # sort for determinism
    choices = [k for k, _ in items]
    weights = [w for _, w in items]
    return rng.choices(choices, weights=weights, k=1)[0]

The only subtle detail is sorted(counter.items()) in _weighted_choice. Python's dict is insertion-ordered since 3.7, but across different trainings or different JSON load orders you can end up with different insertion histories, which would make the RNG produce different output from "the same" seed depending on how you got there. Sorting the items before sampling pins the iteration order and makes the output reproducible bit-for-bit across runs. That sounds like a detail until you try to write a test for it.

Speaking of which — the RNG is passed in explicitly. rng: random.Random. Not import random; random.choices(...). This is the single most important design decision in the whole codebase, because it means the test suite can inject a Random(42) and assert on exact output:

def test_generate_is_deterministic_with_same_seed():
    model = build_model(corpus, order=2)
    a = generate(model, 2, 20, random.Random(42))
    b = generate(model, 2, 20, random.Random(42))
    assert a == b

Every time I've written "generative" code without injectable RNG, I've regretted it. Tests for probabilistic code that can't be made deterministic are either flaky or don't actually test the thing.

Serializing a dict keyed by tuples

This is the one JSON detail that trips people up. You can't do this:

json.dumps({("the", "quick"): {"brown": 1, "red": 1}})
# TypeError: keys must be str, int, float, bool or None, not tuple

Python tuples aren't valid JSON object keys. You have three options:

Stringify the tuple — "the|quick" as the key. Fast to write, but now you've invented a separator and any token containing it is broken.
Use a list of [key_list, value_dict] pairs — ugly for small examples, but correct for arbitrary tokens.
Nest — {"the": {"quick": {"brown": 1, "red": 1}}}. Pretty for word models, but quickly becomes a nightmare at order 4 with char-mode.

I went with option 2 because it's the only one that round-trips correctly for any input:

def to_json(model, order, mode):
    transitions = []
    for key, counter in sorted(model.items()):
        transitions.append([list(key), dict(counter)])
    return {
        "version": 1,
        "order": order,
        "mode": mode,
        "transitions": transitions,
    }

def from_json(data):
    model = {}
    for entry in data["transitions"]:
        key_list, counter_dict = entry
        if len(key_list) != data["order"]:
            raise ValueError("transition key length doesn't match order")
        model[tuple(key_list)] = Counter(counter_dict)
    return model, data["order"], data["mode"]

Two notes. First, the version: 1 field is there so that if I ever change the format, I can refuse to load old files instead of silently producing garbage. This costs one line and has saved me more than once. Second, validation is strict on load — wrong version, wrong order type, wrong mode, mismatched key length are all immediate ValueErrors. It would be easy to skip these and let generate() explode deep inside the sampler. Don't. A clear error at load time is much easier to debug than a weird output two seconds later.

Order vs creativity

This is the part of Markov chains I find most interesting, because it's a clean illustration of the bias-variance tradeoff in a setting where you can actually see what's happening.

Order 1 (unigram context): at each step, the chain looks at just the previous token and picks a continuation. This has almost no structure. Output will be technically grammatical-looking (because the follow-up distribution after, say, "the" is all words that followed "the" somewhere in the corpus, and those tend to be nouns) but semantically total word salad. "the dog over jumps the brown lazy the fox red…"

Order 2 is usually the sweet spot for medium-sized corpora. Enough context to track short phrases, not so much that you lose all variation. This is where output starts to feel like it could be a sentence.

Order 3 or 4: output becomes noticeably more coherent at a local level, but also noticeably less creative. Each 3-word phrase has fewer possible follow-ups, so you get long runs that are verbatim from the training corpus with small perturbations at the joins.

Order 5: on most real corpora, this degenerates to literally reproducing the source. If your corpus has 10,000 tokens and your order is 5, the vast majority of your 5-grams appear exactly once, which means each N-gram has exactly one successor, which means the chain is fully deterministic. You've effectively just memorized the training text. This is the ur-example of overfitting, and it's satisfying to see it happen in real time when you crank up --order.

You can watch this directly with the stats subcommand:

$ markov-gen train sherlock.txt --order 2 --out o2.json
$ markov-gen stats o2.json
avg entropy bits: 1.21

$ markov-gen train sherlock.txt --order 5 --out o5.json
$ markov-gen stats o5.json
avg entropy bits: 0.08

That "avg entropy bits" number is the Shannon entropy of the next-token distribution, averaged over all states weighted by how often each state was visited. It's a one-number summary of "how many choices does the chain have at each step, on average". 1.21 bits is real branching — a couple of plausible continuations at each state. 0.08 bits means the chain is effectively deterministic: at almost every state, there's just one successor the model has ever seen.

The lesson — and this generalizes far beyond Markov chains — is that more context is not always more intelligence. With a fixed corpus, every additional token of context makes each "state" rarer, makes each distribution sparser, and eventually makes the model memorize instead of generalize. It's the small-corpus version of the same problem that transformer trainers fight with pretraining datasets measured in terabytes.

Word mode vs character mode

The CLI supports two tokenizations, and they're useful for different things.

Word mode — whitespace-split. Use for sentence-level generation on prose corpora. Punctuation stays glued to its word ("fox." is one token), which is fine: the chain learns that after "fox." comes a capital-letter word, because that's what the corpus showed it.

Character mode — one token per character. Use for generating pronounceable pseudo-words. Train on a list of fantasy names with order 3, and you get output that looks like plausible fantasy names because the chain has learned the local letter-transition statistics: "th" is often followed by a vowel, "qu" is often followed by a vowel, etc. Before neural text models, this was the standard way to build random-name generators in games, and it still works fine for that use case.

Char mode also makes a nice demo of the order tradeoff, because at order 1 you get gibberish, at order 3 you get almost-words, and at order 5 you get verbatim names from your training list. You can watch the model cross from "creative" to "memorizing" live.

What Markov chains genuinely can't do

It's worth being honest about this, because a lot of Markov-chain tutorials hand-wave past the limits, and those limits are the reason you don't see Markov chains in production NLP anymore.

No context beyond order N. A word at position 100 has no influence on a word at position 103 unless there's a chain of N-grams linking them. If your corpus has a sentence "The cat, after a long journey through the house and the yard and the garden, finally slept", an order-2 model has no way to know that "slept" was caused by "cat". All it knows is ("finally",) → "slept". Long-range dependencies, pronoun resolution, subject-verb agreement over a subordinate clause — none of these work. A transformer's whole pitch is that it does handle these, via attention.

No semantic understanding. The chain has no concept that "dog" and "cat" are similar kinds of things, or that "run" is related to "ran". Each token is an opaque string. Smoothing techniques can help a bit (backoff to shorter N-grams for rare states, etc.) but they don't add semantics.

Sparse-data problem. Mentioned above in the order-vs-creativity section. Higher orders would give better local coherence in theory, but in practice the combinatorial explosion of possible N-grams means most of them are seen zero or one time, and the chain becomes useless. The whole history of statistical NLP from the 1990s through 2010s was about sophisticated smoothing techniques to paper over this (Kneser-Ney smoothing, interpolation, backoff) — and they only kind of worked.

The regurgitation failure mode. At very high order, or very small corpora, the chain's "generation" is indistinguishable from "copy the training text". This is the exact same failure mode LLMs have when they memorize training examples, except that with a Markov chain you can prove it's happening just by looking at the entropy number.

If you squint, an LLM is a massively-over-parameterized Markov model with a much larger effective context window and a learned representation of tokens instead of opaque strings. That's not technically accurate — the differences really matter — but it's a useful intuition, and the thing that makes it useful is that you've actually implemented the Markov chain and know exactly what it does and doesn't do. You have a baseline.

Try it in 30 seconds

git clone https://github.com/sen-ltd/markov-gen
cd markov-gen
docker build -t markov-gen .

# Train a small model.
mkdir -p /tmp/mg && cat > /tmp/mg/corpus.txt << 'EOT'
The quick brown fox jumps over the lazy dog.
The brown fox runs fast.
A lazy dog sleeps all day.
The quick brown dog chases the slow fox.
EOT

docker run --rm -v /tmp/mg:/work markov-gen train corpus.txt --order 2 --out model.json
docker run --rm -v /tmp/mg:/work markov-gen generate model.json --length 20 --seed 42
docker run --rm -v /tmp/mg:/work markov-gen stats model.json

Then drop in any text file you like — a Project Gutenberg book, a list of names, your own journal, whatever. Train at order 2, generate. Train at order 4, generate. Train at order 1 and laugh at the word salad. Flip to --mode char with a list of place names and try to design a plausible fictional city. Watch the avg entropy bits drop as you raise the order.

At some point, if you haven't before, you'll get the sensation I got the first time I implemented this — the sudden feeling that you know exactly what the model is doing, because you wrote the dict. That feeling doesn't happen with transformers, and it's a surprisingly useful thing to have in your back pocket when you're later trying to reason about why a larger model is behaving strangely. The smaller model is a baseline. Build it once, keep the intuition forever.

Code

All the code is on GitHub under MIT: https://github.com/sen-ltd/markov-gen. It's stdlib-only Python, about 200 lines of non-test code, with 48 tests. The generator takes an injected random.Random so the tests are deterministic. The JSON format is documented. The Dockerfile is alpine multi-stage and the image is under 60 MB. Tear it apart and rewrite it in your favorite language — that's probably the best way to really internalize what's going on.

DEV Community

Text Generation Before Transformers: Building a Markov Chain in 200 Lines of Python

Text Generation Before Transformers: Building a Markov Chain in 200 Lines of Python

The problem, stated without hand-waving

The algorithm, in code

Serializing a dict keyed by tuples

Order vs creativity

Word mode vs character mode

What Markov chains genuinely can't do

Try it in 30 seconds

Code

Top comments (0)