Big LLMs are opaque. Billions of parameters, months of training, layers of RLHF and safety filters—good luck tracing why a specific output came out wrong. So I took a different approach: I grabbed a 4,192-parameter GPT you can fit in your head and ran a few experiments to see how it actually fails.
Here's a sketch of what I did, without full code—you can wire it up yourself in an afternoon.
Step 1: Get a Tiny GPT Running
I used Andrej Karpathy's microGPT—about 200 lines of pure Python, no PyTorch, no NumPy. One attention layer, four heads, 16-dimensional embeddings, 16-token context window. It trains on a list of 32,000 names and learns to generate new name-like strings.
The rough idea:
model = MicroGPT(vocab=27, n_embd=16, n_head=4, block_size=16)
for step in range(500):
loss = model.train_step(batch)
samples = [model.generate(max_len=12) for _ in range(5)]
print(samples)
# >>> ['neleren', 'soleran', 'breran', 'meberer', 'lelerey']
Loss drops from ~3.3 (random guessing across 27 characters) to ~2.2. The model learns character-level statistics—common bigrams, plausible syllable patterns, roughly how long names should be. Not real names, but name-shaped.
You just need it to train and sample. You don't need to understand every line of the autograd yet.
Step 2: Change the Data, Watch It Drift
The idea: Make a copy of the training data with each name reversed (emma → amme, olivia → aivilo). Retrain from scratch. Compare outputs.
For a simple divergence metric, compute character frequency distributions from a batch of generated samples, then calculate KL divergence between the baseline and drifted distributions. Nothing fancy—just count letters and compare.
Baseline samples:
neleren, soleran, breran, meberer, lelerey
Drifted samples (trained on reversed names):
narela, sorelel, relere, nolnam, neredll
The KL divergence came out to 0.20—the distributions are measurably different, and you can see it in the samples. The model shifted its entire output surface because the training data changed.
Why this matters:
- The model happily learns whatever distribution you feed it. It has no notion of "correct" training signals versus altered ones.
- Fine-tuning on skewed data will move behavior, and the substrate can't tell you it drifted. It just learns the new pattern and keeps going.
Step 3: Push Past the Context Window
The idea: The model has a 16-token context window. Force it to generate 32 tokens—twice the window—and log what happens.
What comes out:
sannene|||donnek|z|k|k|k|k|k|te|
liindey|||j|n|k|k|k|k|th|k|ter|j
erinin||||kelenle|k|k|k|k|k|k|k|
The first ~16 tokens look normal. After that, it collapses into repetitive single-character patterns. Every trial does this. The positional encoding can't represent positions it wasn't trained on, so attention degenerates into a fixed loop.
Why this matters:
- Finite context is architectural, not a training problem. No amount of better data fixes a window that's too small for the task.
- This is the toy version of why long-context models can still "lose the plot" at the tail. The mechanism is the same—positional representations have limits, and past those limits the model doesn't gracefully degrade. It breaks.
Step 4: Nudge the Weights, Measure the Damage
The idea: Save a trained checkpoint. Add small Gaussian noise to every weight at different scales (0.001, 0.01, 0.05, 0.1). Sample from the noisy model and check how often the outputs match the original.
| Noise Scale | Baseline Overlap |
|---|---|
| 0.001 | 4% |
| 0.01 | 2% |
| 0.05 | 2% |
| 0.1 | 0% |
At noise scale 0.1—which is small relative to typical weight magnitudes—the model produces completely different outputs. Zero overlap. Not "slightly different." Unrecognizably different.
Why this matters:
- The learned behavior lives on a narrow ridge in parameter space. Small weight changes can produce very different outputs.
- This is why "slight" changes from fine-tuning, quantization, model merging, or even checkpoint selection can have outsized effects in practice. The surface isn't smooth—it's spiky, and you're always closer to an edge than you think.
Step 5: Check If It Can Follow Rules
The idea: Generate 100 names and check, after the fact, which ones violate simple constraints:
- Length ≤ 5 characters
- Must start with a vowel
- No letter 'z'
Results:
| Constraint | Violation Rate |
|---|---|
| Max length ≤ 5 | 92% |
| Vowel start | 88% |
| No letter 'z' | 1% |
The 'z' constraint is met almost by accident—'z' is just rare in the training data. The other two constraints are violated at rates that directly reflect the training distribution. There is no constraint layer. There is nothing in the architecture that can say "this output is structurally invalid, reject it."
Why this matters:
- "Always reply in JSON" is a hope, not a guarantee, unless you enforce it outside the model. The model can only produce what its probability distribution favors—it can't rule things out.
- Every constraint you care about in production—format, safety, content boundaries—needs enforcement at a layer above the model. The substrate won't do it for you.
The Punchline
None of these failures are bugs in a toy. They're what you should expect from this architecture:
- Drift when data changes—the model learns whatever you feed it, with no self-awareness about distribution shifts.
- Brittleness under small perturbations—learned behavior is fragile, not robust.
- No built-in constraints—everything is probability, and probability doesn't say "no."
Bigger models add more layers on top—RLHF, safety training, output filters, prompt engineering. Those layers are valuable. But they're mitigations, not fixes. The core substrate underneath behaves the same way this 4,192-parameter model does.
If you're building on LLMs, you need to assume these properties and design around them. Not because the model is bad, but because this is what the architecture is.
Try It Yourself
Karpathy's microGPT is here—200 lines, MIT License, pure Python. The dataset is names.txt from makemore. You can wire up every experiment described above in an afternoon. No GPU needed. If you want the formal write-up, the technical report is on Zenodo: Substrate-Layer Failure Taxonomy: Drift, Brittleness, and Desynchronization in a Minimal GPT (Truong, 2026).
Top comments (2)
I'm struck by how the failures of this GPT model mirror our own struggles with language and creativity. From what I've learned, it appears that the model's limitations serve as a kind of cautionary tale for the potential pitfalls of large language models - namely, their tendency to drift and lack built-in constraints. It's almost as if the model is a reflection of our own imperfect processes, a reminder that even the most advanced language generation is still rooted in the flaws and biases of the data that trained it.
I appreciate the perspective—I approached drift as an architectural phenomenon rather than a psychological one, but it’s interesting how people map these behaviors onto human processes.