TL;DR
Building a minimal language model from scratch takes fewer than 300 lines of Python. The process shows how tokenization, attention, and inference work, which makes you a better API consumer when integrating production LLMs into applications.
Introduction
Most developers treat language models as black boxes: send text in, get tokens out, and assume the middle layer is magic. That works until you need to debug a broken API integration, tune sampling parameters, or understand why a model keeps failing to return valid structured data.
GuppyLM, a project that recently reached the Hacker News front page with 842 points, makes the internals visible. It is an 8.7M-parameter transformer written from scratch in Python. It trains in under an hour on a consumer GPU, and the code fits in a single file. The goal is not to compete with GPT-4; it is to make LLM mechanics understandable.
This article walks through how to build a tiny LLM, what each component does, and what that teaches you when you work with AI APIs professionally.
💡 If you're testing AI API integrations, Apidog's Test Scenarios can help you verify streaming responses, assert token-related fields, and simulate edge-case completions without burning production credits.
What makes a language model "tiny"?
A production LLM like GPT-4 has hundreds of billions of parameters. A tiny LLM usually sits around 1M to 25M parameters. Projects like GuppyLM, Karpathy's nanoGPT, and MicroLM fall into this category.
Tiny LLMs can:
- Train on a laptop or Google Colab
- Fit entirely in CPU memory
- Be inspected, modified, and debugged at the weight level
They cannot:
- Handle complex reasoning reliably
- Generate coherent long-form text consistently
- Match the factual depth of production models
The value is not the output quality. The value is the implementation knowledge you gain by building one.
Core components: how an LLM works
Before writing code, you need to understand the four main pieces.
Tokenizer
The tokenizer converts raw text into integer IDs.
For example:
"Hello, world!" -> [15496, 11, 995, 0]
Each integer maps to a subword unit from a fixed vocabulary.
Why this matters for API work:
- Token counts affect latency and cost.
- Tokenization determines whether prompts fit within the model context window.
- Unexpected token splitting can cause truncation or malformed structured output.
GuppyLM uses a simple character-level tokenizer. Production models such as GPT-4 use BPE-style tokenization with vocabularies commonly in the tens of thousands of tokens.
Embedding layer
The embedding layer converts token IDs into dense vectors.
For example, a token ID might become a 384-dimensional vector. These learned vectors carry statistical meaning: tokens used in similar contexts tend to end up closer together in vector space.
Position embeddings are added so the model can distinguish token order.
Transformer blocks
Transformer blocks perform the core computation. Each block has two main parts.
Self-attention
Self-attention lets each token look at previous tokens in the sequence and decide which ones matter for predicting the next token.
GuppyLM uses 6 attention heads across 6 layers.
Feed-forward network
After attention, each token representation goes through a small MLP. GuppyLM uses ReLU activation, which is simpler than newer activations such as SwiGLU.
Output head
After the final transformer block, a linear layer projects each token representation into a vector the size of the vocabulary.
Then the model:
- Applies softmax to convert logits into probabilities.
- Selects or samples the next token.
- Appends that token to the context.
- Repeats the process.
Building a minimal LLM in Python
The following example implements a tiny character-level transformer in PyTorch.
import torch
import torch.nn as nn
import torch.nn.functional as F
# Hyperparameters
VOCAB_SIZE = 256 # character-level: one slot per ASCII char
D_MODEL = 128 # embedding dimension
N_HEADS = 4 # attention heads
N_LAYERS = 3 # transformer blocks
SEQ_LEN = 64 # context window
DROPOUT = 0.1
class SelfAttention(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.n_heads = n_heads
self.head_dim = d_model // n_heads
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.proj = nn.Linear(d_model, d_model, bias=False)
self.dropout = nn.Dropout(DROPOUT)
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
q, k, v = qkv.unbind(dim=2)
q = q.transpose(1, 2)
k = k.transpose(1, 2)
v = v.transpose(1, 2)
# Causal mask: each token can only attend to previous tokens
scale = self.head_dim ** -0.5
attn = (q @ k.transpose(-2, -1)) * scale
mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
attn = attn.masked_fill(mask, float("-inf"))
attn = F.softmax(attn, dim=-1)
attn = self.dropout(attn)
out = (attn @ v).transpose(1, 2).reshape(B, T, C)
return self.proj(out)
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads):
super().__init__()
self.attn = SelfAttention(d_model, n_heads)
self.ff = nn.Sequential(
nn.Linear(d_model, 4 * d_model),
nn.ReLU(),
nn.Linear(4 * d_model, d_model),
nn.Dropout(DROPOUT),
)
self.ln1 = nn.LayerNorm(d_model)
self.ln2 = nn.LayerNorm(d_model)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.ff(self.ln2(x))
return x
class TinyLLM(nn.Module):
def __init__(self):
super().__init__()
self.embed = nn.Embedding(VOCAB_SIZE, D_MODEL)
self.pos_embed = nn.Embedding(SEQ_LEN, D_MODEL)
self.blocks = nn.ModuleList([
TransformerBlock(D_MODEL, N_HEADS)
for _ in range(N_LAYERS)
])
self.ln_f = nn.LayerNorm(D_MODEL)
self.head = nn.Linear(D_MODEL, VOCAB_SIZE, bias=False)
def forward(self, idx):
B, T = idx.shape
tok_emb = self.embed(idx)
pos = torch.arange(T, device=idx.device)
pos_emb = self.pos_embed(pos)
x = tok_emb + pos_emb
for block in self.blocks:
x = block(x)
x = self.ln_f(x)
logits = self.head(x)
return logits
model = TinyLLM()
total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {total_params:,} parameters")
Training loop
Training next-token prediction is straightforward:
- Feed the model a sequence.
- Shift the same sequence by one token to create the target.
- Compute cross-entropy loss.
- Backpropagate.
import torch.optim as optim
def train(model, data, epochs=100, lr=3e-4):
optimizer = optim.AdamW(model.parameters(), lr=lr)
model.train()
for epoch in range(epochs):
# data: tensor of token IDs, shape [batch, seq_len + 1]
x = data[:, :-1] # input: all tokens except last
y = data[:, 1:] # target: all tokens shifted by 1
logits = model(x)
loss = F.cross_entropy(
logits.reshape(-1, VOCAB_SIZE),
y.reshape(-1)
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if epoch % 10 == 0:
print(f"Epoch {epoch}, loss: {loss.item():.4f}")
Inference: text generation
Generation is just repeated next-token prediction.
@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0, top_k=10):
model.eval()
ids = torch.tensor([prompt_ids])
for _ in range(max_new_tokens):
# Crop to context window
idx_cond = ids[:, -SEQ_LEN:]
logits = model(idx_cond)
# Use only the final token position
logits = logits[:, -1, :] / temperature
# Top-k sampling
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
probs = F.softmax(logits, dim=-1)
next_id = torch.multinomial(probs, num_samples=1)
ids = torch.cat([ids, next_id], dim=1)
return ids[0].tolist()
What this teaches you about AI API behavior
Building the model exposes implementation details that matter when using production AI APIs.
Temperature and sampling are mechanical
Temperature divides logits before softmax.
Higher temperature produces a flatter probability distribution, which makes output more random. Lower temperature sharpens the distribution, which makes output more deterministic.
If a production API returns slightly different responses at temperature=0, that does not always mean the API is broken. Some systems avoid a strict greedy path or apply additional decoding behavior.
Context windows are hard limits
This line shows what happens at the context boundary:
idx_cond = ids[:, -SEQ_LEN:]
The model only sees the last SEQ_LEN tokens. Older tokens are dropped.
For API integrations, this means:
- The model does not remember the full conversation forever.
- Long chat histories need summarization, truncation, retrieval, or another memory strategy.
- Token budgeting should be part of your request-building logic.
See [internal: how-ai-agent-memory-works] for more on agent memory patterns.
Streaming tokens are inference steps made visible
Streaming APIs do not use a fundamentally different architecture. They generate tokens one step at a time and flush each token or chunk to the client.
This matters for retry logic:
- If a stream drops halfway through, you generally cannot resume from the exact internal state.
- Your client should decide whether to retry the full request, show partial output, or ask the model to continue from visible context.
Logits explain why structured output is hard
At every generation step, the model assigns probability to every token in the vocabulary.
Generating valid JSON requires the right token to be selected at every position. One wrong comma, quote, brace, or field name can break the output.
Structured output modes often constrain generation so only grammar-valid tokens are allowed at each step. Libraries such as Outlines and Guidance use this kind of constrained decoding approach.
How to test AI API integrations with Apidog
Once you understand inference behavior, you can write better API tests.
For example, when testing a streaming chat API, create a Test Scenario with your /v1/chat/completions endpoint and validate the parts of the response your app depends on.
Test cases to include:
- Assert that
response.choices[0].finish_reason == "stop" - Assert that
response.usage.total_tokens < 4096 - Send the previous response as context in a follow-up step to simulate a multi-turn conversation
- Mock truncation with
finish_reason: "length" - Mock moderation or filtering with
finish_reason: "content_filter" - Simulate a network timeout during a stream
This lets you test AI integrations without spending API credits on every CI run.
See [internal: api-testing-tutorial] for a broader look at API testing approaches.
Testing token count assertions
Example assertion payload:
{
"assertions": [
{
"field": "response.usage.completion_tokens",
"operator": "less_than",
"value": 512
},
{
"field": "response.choices[0].finish_reason",
"operator": "equals",
"value": "stop"
},
{
"field": "response.choices[0].message.content",
"operator": "not_empty"
}
]
}
Run the same scenario across different providers and models, such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, to catch schema and behavior differences before they reach production.
Advanced: quantization and inference optimization
After you have a working tiny LLM, two production-serving concepts are worth understanding: quantization and KV cache.
Quantization
The model above uses 32-bit floating-point weights by default.
Quantization reduces weights to lower-precision formats such as INT8 or INT4. This can reduce memory usage significantly, usually with some accuracy tradeoff.
Example dynamic INT8 quantization in PyTorch:
import torch.quantization
quantized_model = torch.quantization.quantize_dynamic(
model,
{nn.Linear},
dtype=torch.qint8
)
Production APIs often run optimized or quantized model variants. When quality differs between versions of the same model family, serving optimizations may be one factor.
KV cache
The simple inference loop recomputes attention over the full visible sequence at every generation step.
Production systems commonly cache key-value pairs from previous tokens. This KV cache lets the model process each new token more efficiently.
That is one reason the first token in a streaming response can take longer than later tokens.
Tiny LLM vs. production API: when to use each
| Use case | Tiny LLM | Production API |
|---|---|---|
| Learning model internals | Best for | Overkill |
| Prototyping a new app | Insufficient quality | Best for |
| Private or sensitive data | Good option | Depends on provider |
| Offline or edge deployment | Viable | Not possible |
| Cost-sensitive, high-volume workloads | Possible with tradeoffs | Expensive at scale |
| Reasoning-heavy tasks | Not viable | Required |
For most developers, the practical approach is:
- Use a production API for your application.
- Build or run a tiny model to understand what is happening under the hood.
The two are not competing. See [internal: open-source-coding-assistants-2026] for tools that blur this line with bring-your-own-model setups.
Conclusion
Building a tiny LLM from scratch takes a weekend. You do not get a production system, but you do get a working mental model of how language models such as GuppyLM and GPT-4o generate text.
That understanding helps when you:
- Debug streaming integrations
- Tune sampling parameters
- Design token-budgeting logic
- Write assertions for AI API tests
- Handle malformed structured output
GuppyLM is a good starting point. Clone it, train it on a text dataset, and read the inference loop. Then return to your production API integrations with a clearer model of what the API is doing.
Try Apidog's Test Scenarios to apply the same testing rigor to AI APIs that you already apply to backend services.
FAQ
How many parameters does a tiny LLM need to generate coherent text?
Around 10M to 50M parameters with a decent training dataset can produce locally coherent sentences. Below 1M parameters, most tasks produce poor output. GuppyLM has 8.7M parameters and works for short conversations in its training domain.
Can I run a tiny LLM without a GPU?
Yes. Models under 100M parameters can run on CPU, although inference is slower. The small model in this article can generate tokens on a laptop CPU.
What dataset should I train on?
Character-level models work with plain text corpora such as Project Gutenberg texts, Wikipedia subsets, or domain-specific text files.
GuppyLM uses a 60K-entry conversation dataset on Hugging Face: arman-bd/guppylm-60k-generic.
For code generation experiments, datasets such as The Stack or CodeParrot are common options.
What's the difference between temperature and top-k sampling?
Temperature scales the logit distribution and controls overall randomness.
Top-k restricts the sampling pool to the k most likely tokens.
They can be used together:
- Filter candidates with top-k.
- Apply temperature to shape probabilities within that candidate set.
- Sample the next token.
Why does my LLM sometimes repeat itself?
Repetition happens when the model keeps assigning high probability to tokens or phrases it recently generated.
Production APIs often use repetition penalties or similar decoding controls. If your API supports it, a small repetition penalty such as 1.1 may reduce repeated phrases.
How long does it take to train a tiny LLM?
The model in this article can train to recognizable output in under 2 hours on a single consumer GPU such as an RTX 3060 or equivalent. GuppyLM trains in Colab in roughly the same range.
Larger models, such as 100M+ parameter models, require more compute and longer training runs.
What's the fastest way to go from tiny LLM to a real API endpoint?
One path is to export the model to GGUF format and serve it with llama-server from llama.cpp. That gives you an OpenAI-compatible local endpoint.
You can then point your API testing workflow at the local server. See [internal: rest-api-best-practices].
How do production LLMs handle context longer than their training window?
Common approaches include:
- RoPE scaling
- Sliding window attention
- Retrieval-augmented generation
- Summarization-based memory
- External vector search
The core transformer idea remains the same, but the way position information and attention windows are handled changes.
Top comments (0)