Wanda

Posted on Apr 7 • Originally published at apidog.com

How to build a LLM from scratch (and what it teaches you)

#tutorial #python #llm #machinelearning

TL;DR

Building a minimal language model from scratch takes fewer than 300 lines of Python. The process reveals exactly how tokenization, attention, and inference work, which makes you a far better API consumer when you're integrating production LLMs into your applications.

Try Apidog today

Introduction

Most developers treat language models as black boxes. You send text in, tokens come out, and somewhere in between, magic happens. That mental model works fine until you need to debug a broken API integration, tune sampling parameters, or figure out why your model keeps hallucinating structured data.

GuppyLM, a project that recently hit the HackerNews front page with 842 points, makes the internals visible. It's a 8.7M parameter transformer written from scratch in Python. It trains in under an hour on a consumer GPU. The code fits in a single file. The goal isn't to compete with GPT-4; it's to demystify what LLMs actually do.

This article walks through how to build a tiny LLM, what each component does, and what understanding the internals teaches you when you're working with AI APIs professionally.

💡 Tip: If you're testing AI API integrations, Apidog's Test Scenarios let you verify streaming responses, assert on token structure, and simulate edge-case completions without burning production credits. More on that later

What makes a language model "tiny"?

A production LLM like GPT-4 has hundreds of billions of parameters. A "tiny" LLM sits in the range of 1M to 25M parameters. Projects like GuppyLM (8.7M), Karpathy's nanoGPT (124M), and MicroLM (1-2M) all fall into this category.

Tiny LLMs can:

Train on a laptop or Google Colab
Fit entirely in CPU memory
Be inspected, modified, and debugged at the weight level

They can't:

Handle complex reasoning
Generate coherent long-form text reliably
Match the factual depth of production models

The value isn't the output. It's the understanding you get from building one.

Core components: how an LLM actually works

Understand these four main pieces before you write any code.

Tokenizer

The tokenizer converts raw text into integer IDs. "Hello, world!" becomes something like [15496, 11, 995, 0]. Each integer maps to a subword unit from a fixed vocabulary.

API relevance: Token counts affect latency and cost. Knowing how tokenizers split text helps you fit prompts within context limits and avoid truncation.

GuppyLM uses a simple character-level tokenizer. Production models like GPT-4 use BPE (byte-pair encoding) with vocabularies of 50K-100K tokens.

Embedding layer

The embedding layer converts token IDs into dense vectors. Each token gets a learned vector (e.g. 384 dimensions in GuppyLM). These vectors have semantic meaning; similar tokens cluster in vector space.

Position embeddings are added so the model knows token order.

Transformer blocks

The computation core. Each block has two parts:

Self-attention: Each token looks at all others in the sequence to decide relevance for predicting the next token. GuppyLM uses 6 attention heads across 6 layers.
Feed-forward network: A two-layer MLP applied to each token's representation after attention. GuppyLM uses ReLU activation for simplicity.

Output head

After the transformer blocks, a linear layer projects each token's representation to a vector with size equal to the vocabulary. Apply softmax to get probabilities, pick the highest (or sample), and repeat.

Building a minimal LLM in Python

Here’s a working minimal LLM based on GuppyLM, using PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
VOCAB_SIZE = 256     # character-level: one slot per ASCII char
D_MODEL = 128        # embedding dimension
N_HEADS = 4          # attention heads
N_LAYERS = 3         # transformer blocks
SEQ_LEN = 64         # context window
DROPOUT = 0.1

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads
        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(DROPOUT)

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)
        # Causal mask: each token can only attend to previous tokens
        scale = self.head_dim ** -0.5
        attn = (q @ k.transpose(-2, -1)) * scale
        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        attn = attn.masked_fill(mask, float('-inf'))
        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)
        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.proj(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = SelfAttention(d_model, n_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(DROPOUT),
        )
        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

class TinyLLM(nn.Module):
    def __init__(self):
        super().__init__()
        self.embed = nn.Embedding(VOCAB_SIZE, D_MODEL)
        self.pos_embed = nn.Embedding(SEQ_LEN, D_MODEL)
        self.blocks = nn.ModuleList([
            TransformerBlock(D_MODEL, N_HEADS) for _ in range(N_LAYERS)
        ])
        self.ln_f = nn.LayerNorm(D_MODEL)
        self.head = nn.Linear(D_MODEL, VOCAB_SIZE, bias=False)

    def forward(self, idx):
        B, T = idx.shape
        tok_emb = self.embed(idx)
        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embed(pos)
        x = tok_emb + pos_emb
        for block in self.blocks:
            x = block(x)
        x = self.ln_f(x)
        logits = self.head(x)
        return logits

# Initialize and count parameters
model = TinyLLM()
total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {total_params:,} parameters")  # ~1.2M

Training loop

import torch.optim as optim

def train(model, data, epochs=100, lr=3e-4):
    optimizer = optim.AdamW(model.parameters(), lr=lr)
    model.train()
    for epoch in range(epochs):
        # data: tensor of token IDs, shape [batch, seq_len+1]
        x = data[:, :-1]   # input: all tokens except last
        y = data[:, 1:]    # target: all tokens shifted by 1
        logits = model(x)
        loss = F.cross_entropy(logits.reshape(-1, VOCAB_SIZE), y.reshape(-1))
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        if epoch % 10 == 0:
            print(f"Epoch {epoch}, loss: {loss.item():.4f}")

Inference (text generation)

@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0, top_k=10):
    model.eval()
    ids = torch.tensor([prompt_ids])
    for _ in range(max_new_tokens):
        idx_cond = ids[:, -SEQ_LEN:]  # crop to context window
        logits = model(idx_cond)
        logits = logits[:, -1, :] / temperature  # last token only
        # top-k sampling
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = float('-inf')
        probs = F.softmax(logits, dim=-1)
        next_id = torch.multinomial(probs, num_samples=1)
        ids = torch.cat([ids, next_id], dim=1)
    return ids[0].tolist()

What this teaches you about AI API behavior

Building your own LLM clarifies several API behaviors you’ll encounter.

Temperature and sampling are mechanical, not magical

Temperature divides logits before softmax. Higher temperature = more random output; lower = more deterministic. temperature=0.0 is greedy argmax. Many APIs floor it slightly to prevent degenerate outputs.

Context windows are hard limits, not soft suggestions

In the inference loop, idx_cond = ids[:, -SEQ_LEN:] shows how the model drops older tokens once context is full. Your API integration must handle this: don’t assume infinite memory. For more, see [internal: how-ai-agent-memory-works].

Streaming tokens are just inference steps made visible

Streaming APIs run the inference loop and flush each token as it’s generated. If a stream drops mid-generation, it can’t be resumed—restart from the beginning.

Logits explain why structured output is hard

LLMs assign probability to every token at every step. Generating valid JSON, for example, requires the right token at every position. Libraries like Outlines and Guidance constrain the logit distribution to enforce grammar. Structured output modes in APIs do similar things.

How to test AI API integrations with Apidog

Understanding inference lets you write robust API tests. Apidog's Test Scenarios can chain API calls and assert on AI responses.

Example: testing a streaming chat API

Create a Test Scenario in Apidog with your /v1/chat/completions endpoint.
Set assertions on response structure:
- response.choices[0].finish_reason == "stop"
- response.usage.total_tokens < 4096
Add a follow-up step to send the response as context to the next turn (simulate multi-turn conversation).
Use Apidog’s Smart Mock to stub the AI endpoint and test error handling. Simulate:
- finish_reason: "length" (truncated output)
- finish_reason: "content_filter"
- network timeout mid-stream

This approach tests AI integrations without burning API credits on every CI run. See [internal: api-testing-tutorial] for more API testing strategies.

Testing token count assertions

{
  "assertions": [
    {
      "field": "response.usage.completion_tokens",
      "operator": "less_than",
      "value": 512
    },
    {
      "field": "response.choices[0].finish_reason",
      "operator": "equals",
      "value": "stop"
    },
    {
      "field": "response.choices[0].message.content",
      "operator": "not_empty"
    }
  ]
}

Run these across models (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro) in a single Test Scenario to catch schema differences before they hit production.

Advanced: quantization and inference optimization

Once you have a working tiny LLM, two optimization techniques directly apply to production deployments.

Quantization

Model weights are 32-bit floats by default. Quantization reduces them to 8-bit (INT8) or 4-bit (INT4), cutting memory usage 4-8x with minor accuracy loss.

# Example: dynamic INT8 quantization in PyTorch
import torch.quantization
quantized_model = torch.quantization.quantize_dynamic(
    model, {nn.Linear}, dtype=torch.qint8
)

Production APIs use quantized models. Quality changes between model versions often relate to quantization.

KV cache

In the inference loop above, attention is recomputed for the entire sequence at each token. Production systems cache key-value pairs from previous tokens (the KV cache), so each new token only needs one new attention calculation. That’s why the first token in a streaming response is slower than the rest.

Tiny LLM vs. production API: when to use each

Use case	Tiny LLM	Production API
Learning model internals	Best for	Overkill
Prototyping a new app	Insufficient quality	Best for
Private/sensitive data	Good option	Depends on provider
Offline/edge deployment	Viable	Not possible
Cost-sensitive, high volume	Possible with tradeoffs	Expensive at scale
Reasoning-heavy tasks	Not viable	Required

For most developers: use production APIs for your application, but run a tiny model to understand what’s happening under the hood. The two aren’t competing. See [internal: open-source-coding-assistants-2026] for tools that blend these approaches.

Conclusion

Building a tiny LLM from scratch takes a weekend. What you get isn’t a production system; it’s a working mental model of how every language model, from GuppyLM to GPT-4o, actually works. That understanding pays off every time you debug a streaming integration, tune sampling parameters, or design assertions for your AI API tests.

The GuppyLM project is a good starting point. Clone it, train it on any text dataset, and study the inference loop. Then, return to your production API integrations—you’ll approach them with deeper insight.

Try Apidog’s Test Scenarios to bring the same rigor to your AI API testing that you'd apply to any other backend system.

FAQ

How many parameters does a "tiny" LLM need to generate coherent text?

Around 10M-50M parameters with a solid dataset can produce locally coherent sentences. Below 1M, you get gibberish on most tasks. GuppyLM at 8.7M works for short conversations on its training domain (60 topics).

Can I run a tiny LLM without a GPU?

Yes—models under 100M parameters run on CPU, though inference is slower. The model above (1.2M params) generates tokens in milliseconds on a laptop CPU.

What dataset should I train on?

Character-level models work with Project Gutenberg texts, Wikipedia, or any plain text. GuppyLM uses a 60K-entry conversation dataset on HuggingFace (arman-bd/guppylm-60k-generic). For code, use The Stack or CodeParrot.

What's the difference between temperature and top-k sampling?

Temperature scales logit distribution (controls randomness). Top-k restricts the sampling pool to the k most likely tokens before temperature is applied. They’re often combined.

Why does my LLM sometimes repeat itself?

Repetition occurs when the model assigns high probability to recent tokens. Production APIs use repetition penalties (logit adjustment). Try repetition_penalty=1.1 in your API call to reduce this.

How long does it take to train a tiny LLM?

The model above trains to coherent output in under 2 hours on a single GPU (RTX 3060 or similar). GuppyLM trains in Colab in similar time. Larger models (100M+) need multi-GPU setups and days.

What's the fastest way to go from tiny LLM to a real API endpoint?

Export to GGUF format with llama.cpp’s script, serve with llama-server for an OpenAI-compatible API endpoint running locally. Then point Apidog at it for testing. See [internal: rest-api-best-practices].

How do production LLMs handle context longer than their training window?

Techniques like RoPE (Rotary Position Embedding) scaling, sliding window attention, and retrieval-augmented generation extend effective context. The core transformer doesn’t change; these modify how position info and attention windows work.

DEV Community