Preecha

Posted on Jun 8

How to build a LLM from scratch (and what it teaches you)

TL;DR

Building a minimal language model from scratch takes fewer than 300 lines of Python. The process shows how tokenization, attention, and inference work, which makes you a better API consumer when integrating production LLMs into applications.

Try Apidog today

Introduction

Most developers treat language models as black boxes: send text in, get tokens out, and assume the middle layer is magic. That works until you need to debug a broken API integration, tune sampling parameters, or understand why a model keeps failing to return valid structured data.

GuppyLM, a project that recently reached the Hacker News front page with 842 points, makes the internals visible. It is an 8.7M-parameter transformer written from scratch in Python. It trains in under an hour on a consumer GPU, and the code fits in a single file. The goal is not to compete with GPT-4; it is to make LLM mechanics understandable.

This article walks through how to build a tiny LLM, what each component does, and what that teaches you when you work with AI APIs professionally.

💡 If you're testing AI API integrations, Apidog's Test Scenarios can help you verify streaming responses, assert token-related fields, and simulate edge-case completions without burning production credits.

What makes a language model "tiny"?

A production LLM like GPT-4 has hundreds of billions of parameters. A tiny LLM usually sits around 1M to 25M parameters. Projects like GuppyLM, Karpathy's nanoGPT, and MicroLM fall into this category.

Tiny LLMs can:

Train on a laptop or Google Colab
Fit entirely in CPU memory
Be inspected, modified, and debugged at the weight level

They cannot:

Handle complex reasoning reliably
Generate coherent long-form text consistently
Match the factual depth of production models

The value is not the output quality. The value is the implementation knowledge you gain by building one.

Core components: how an LLM works

Before writing code, you need to understand the four main pieces.

Tokenizer

The tokenizer converts raw text into integer IDs.

For example:

"Hello, world!" -> [15496, 11, 995, 0]

Each integer maps to a subword unit from a fixed vocabulary.

Why this matters for API work:

Token counts affect latency and cost.
Tokenization determines whether prompts fit within the model context window.
Unexpected token splitting can cause truncation or malformed structured output.

GuppyLM uses a simple character-level tokenizer. Production models such as GPT-4 use BPE-style tokenization with vocabularies commonly in the tens of thousands of tokens.

Embedding layer

The embedding layer converts token IDs into dense vectors.

For example, a token ID might become a 384-dimensional vector. These learned vectors carry statistical meaning: tokens used in similar contexts tend to end up closer together in vector space.

Position embeddings are added so the model can distinguish token order.

Transformer blocks

Transformer blocks perform the core computation. Each block has two main parts.

Self-attention

Self-attention lets each token look at previous tokens in the sequence and decide which ones matter for predicting the next token.

GuppyLM uses 6 attention heads across 6 layers.

Feed-forward network

After attention, each token representation goes through a small MLP. GuppyLM uses ReLU activation, which is simpler than newer activations such as SwiGLU.

Output head

After the final transformer block, a linear layer projects each token representation into a vector the size of the vocabulary.

Then the model:

Applies softmax to convert logits into probabilities.
Selects or samples the next token.
Appends that token to the context.
Repeats the process.

Building a minimal LLM in Python

The following example implements a tiny character-level transformer in PyTorch.

import torch
import torch.nn as nn
import torch.nn.functional as F

# Hyperparameters
VOCAB_SIZE = 256     # character-level: one slot per ASCII char
D_MODEL = 128        # embedding dimension
N_HEADS = 4          # attention heads
N_LAYERS = 3         # transformer blocks
SEQ_LEN = 64         # context window
DROPOUT = 0.1

class SelfAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = d_model // n_heads

        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)
        self.dropout = nn.Dropout(DROPOUT)

    def forward(self, x):
        B, T, C = x.shape

        qkv = self.qkv(x).reshape(B, T, 3, self.n_heads, self.head_dim)
        q, k, v = qkv.unbind(dim=2)

        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)

        # Causal mask: each token can only attend to previous tokens
        scale = self.head_dim ** -0.5
        attn = (q @ k.transpose(-2, -1)) * scale

        mask = torch.triu(torch.ones(T, T, device=x.device), diagonal=1).bool()
        attn = attn.masked_fill(mask, float("-inf"))

        attn = F.softmax(attn, dim=-1)
        attn = self.dropout(attn)

        out = (attn @ v).transpose(1, 2).reshape(B, T, C)
        return self.proj(out)

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()

        self.attn = SelfAttention(d_model, n_heads)

        self.ff = nn.Sequential(
            nn.Linear(d_model, 4 * d_model),
            nn.ReLU(),
            nn.Linear(4 * d_model, d_model),
            nn.Dropout(DROPOUT),
        )

        self.ln1 = nn.LayerNorm(d_model)
        self.ln2 = nn.LayerNorm(d_model)

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.ff(self.ln2(x))
        return x

class TinyLLM(nn.Module):
    def __init__(self):
        super().__init__()

        self.embed = nn.Embedding(VOCAB_SIZE, D_MODEL)
        self.pos_embed = nn.Embedding(SEQ_LEN, D_MODEL)

        self.blocks = nn.ModuleList([
            TransformerBlock(D_MODEL, N_HEADS)
            for _ in range(N_LAYERS)
        ])

        self.ln_f = nn.LayerNorm(D_MODEL)
        self.head = nn.Linear(D_MODEL, VOCAB_SIZE, bias=False)

    def forward(self, idx):
        B, T = idx.shape

        tok_emb = self.embed(idx)

        pos = torch.arange(T, device=idx.device)
        pos_emb = self.pos_embed(pos)

        x = tok_emb + pos_emb

        for block in self.blocks:
            x = block(x)

        x = self.ln_f(x)
        logits = self.head(x)

        return logits

model = TinyLLM()

total_params = sum(p.numel() for p in model.parameters())
print(f"Model size: {total_params:,} parameters")

Training loop

Training next-token prediction is straightforward:

Feed the model a sequence.
Shift the same sequence by one token to create the target.
Compute cross-entropy loss.
Backpropagate.

import torch.optim as optim

def train(model, data, epochs=100, lr=3e-4):
    optimizer = optim.AdamW(model.parameters(), lr=lr)

    model.train()

    for epoch in range(epochs):
        # data: tensor of token IDs, shape [batch, seq_len + 1]
        x = data[:, :-1]   # input: all tokens except last
        y = data[:, 1:]    # target: all tokens shifted by 1

        logits = model(x)

        loss = F.cross_entropy(
            logits.reshape(-1, VOCAB_SIZE),
            y.reshape(-1)
        )

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if epoch % 10 == 0:
            print(f"Epoch {epoch}, loss: {loss.item():.4f}")

Inference: text generation

Generation is just repeated next-token prediction.

@torch.no_grad()
def generate(model, prompt_ids, max_new_tokens=50, temperature=1.0, top_k=10):
    model.eval()

    ids = torch.tensor([prompt_ids])

    for _ in range(max_new_tokens):
        # Crop to context window
        idx_cond = ids[:, -SEQ_LEN:]

        logits = model(idx_cond)

        # Use only the final token position
        logits = logits[:, -1, :] / temperature

        # Top-k sampling
        v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
        logits[logits < v[:, [-1]]] = float("-inf")

        probs = F.softmax(logits, dim=-1)

        next_id = torch.multinomial(probs, num_samples=1)

        ids = torch.cat([ids, next_id], dim=1)

    return ids[0].tolist()

What this teaches you about AI API behavior

Building the model exposes implementation details that matter when using production AI APIs.

Temperature and sampling are mechanical

Temperature divides logits before softmax.

Higher temperature produces a flatter probability distribution, which makes output more random. Lower temperature sharpens the distribution, which makes output more deterministic.

If a production API returns slightly different responses at temperature=0, that does not always mean the API is broken. Some systems avoid a strict greedy path or apply additional decoding behavior.

Context windows are hard limits

This line shows what happens at the context boundary:

idx_cond = ids[:, -SEQ_LEN:]

The model only sees the last SEQ_LEN tokens. Older tokens are dropped.

For API integrations, this means:

The model does not remember the full conversation forever.
Long chat histories need summarization, truncation, retrieval, or another memory strategy.
Token budgeting should be part of your request-building logic.

See [internal: how-ai-agent-memory-works] for more on agent memory patterns.

Streaming tokens are inference steps made visible

Streaming APIs do not use a fundamentally different architecture. They generate tokens one step at a time and flush each token or chunk to the client.

This matters for retry logic:

If a stream drops halfway through, you generally cannot resume from the exact internal state.
Your client should decide whether to retry the full request, show partial output, or ask the model to continue from visible context.

Logits explain why structured output is hard

At every generation step, the model assigns probability to every token in the vocabulary.

Generating valid JSON requires the right token to be selected at every position. One wrong comma, quote, brace, or field name can break the output.

Structured output modes often constrain generation so only grammar-valid tokens are allowed at each step. Libraries such as Outlines and Guidance use this kind of constrained decoding approach.

How to test AI API integrations with Apidog

Once you understand inference behavior, you can write better API tests.

For example, when testing a streaming chat API, create a Test Scenario with your /v1/chat/completions endpoint and validate the parts of the response your app depends on.

Test cases to include:

Assert that response.choices[0].finish_reason == "stop"
Assert that response.usage.total_tokens < 4096
Send the previous response as context in a follow-up step to simulate a multi-turn conversation
Mock truncation with finish_reason: "length"
Mock moderation or filtering with finish_reason: "content_filter"
Simulate a network timeout during a stream

This lets you test AI integrations without spending API credits on every CI run.

See [internal: api-testing-tutorial] for a broader look at API testing approaches.

Testing token count assertions

Example assertion payload:

{
  "assertions": [
    {
      "field": "response.usage.completion_tokens",
      "operator": "less_than",
      "value": 512
    },
    {
      "field": "response.choices[0].finish_reason",
      "operator": "equals",
      "value": "stop"
    },
    {
      "field": "response.choices[0].message.content",
      "operator": "not_empty"
    }
  ]
}

Run the same scenario across different providers and models, such as GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro, to catch schema and behavior differences before they reach production.

Advanced: quantization and inference optimization

After you have a working tiny LLM, two production-serving concepts are worth understanding: quantization and KV cache.

Quantization

The model above uses 32-bit floating-point weights by default.

Quantization reduces weights to lower-precision formats such as INT8 or INT4. This can reduce memory usage significantly, usually with some accuracy tradeoff.

Example dynamic INT8 quantization in PyTorch:

import torch.quantization

quantized_model = torch.quantization.quantize_dynamic(
    model,
    {nn.Linear},
    dtype=torch.qint8
)

Production APIs often run optimized or quantized model variants. When quality differs between versions of the same model family, serving optimizations may be one factor.

KV cache

The simple inference loop recomputes attention over the full visible sequence at every generation step.

Production systems commonly cache key-value pairs from previous tokens. This KV cache lets the model process each new token more efficiently.

That is one reason the first token in a streaming response can take longer than later tokens.

Tiny LLM vs. production API: when to use each

Use case	Tiny LLM	Production API
Learning model internals	Best for	Overkill
Prototyping a new app	Insufficient quality	Best for
Private or sensitive data	Good option	Depends on provider
Offline or edge deployment	Viable	Not possible
Cost-sensitive, high-volume workloads	Possible with tradeoffs	Expensive at scale
Reasoning-heavy tasks	Not viable	Required

For most developers, the practical approach is:

Use a production API for your application.
Build or run a tiny model to understand what is happening under the hood.

The two are not competing. See [internal: open-source-coding-assistants-2026] for tools that blur this line with bring-your-own-model setups.

Conclusion

Building a tiny LLM from scratch takes a weekend. You do not get a production system, but you do get a working mental model of how language models such as GuppyLM and GPT-4o generate text.

That understanding helps when you:

Debug streaming integrations
Tune sampling parameters
Design token-budgeting logic
Write assertions for AI API tests
Handle malformed structured output

GuppyLM is a good starting point. Clone it, train it on a text dataset, and read the inference loop. Then return to your production API integrations with a clearer model of what the API is doing.

Try Apidog's Test Scenarios to apply the same testing rigor to AI APIs that you already apply to backend services.

FAQ

How many parameters does a tiny LLM need to generate coherent text?

Around 10M to 50M parameters with a decent training dataset can produce locally coherent sentences. Below 1M parameters, most tasks produce poor output. GuppyLM has 8.7M parameters and works for short conversations in its training domain.

Can I run a tiny LLM without a GPU?

Yes. Models under 100M parameters can run on CPU, although inference is slower. The small model in this article can generate tokens on a laptop CPU.

What dataset should I train on?

Character-level models work with plain text corpora such as Project Gutenberg texts, Wikipedia subsets, or domain-specific text files.

GuppyLM uses a 60K-entry conversation dataset on Hugging Face: arman-bd/guppylm-60k-generic.

For code generation experiments, datasets such as The Stack or CodeParrot are common options.

What's the difference between temperature and top-k sampling?

Temperature scales the logit distribution and controls overall randomness.

Top-k restricts the sampling pool to the k most likely tokens.

They can be used together:

Filter candidates with top-k.
Apply temperature to shape probabilities within that candidate set.
Sample the next token.

Why does my LLM sometimes repeat itself?

Repetition happens when the model keeps assigning high probability to tokens or phrases it recently generated.

Production APIs often use repetition penalties or similar decoding controls. If your API supports it, a small repetition penalty such as 1.1 may reduce repeated phrases.

How long does it take to train a tiny LLM?

The model in this article can train to recognizable output in under 2 hours on a single consumer GPU such as an RTX 3060 or equivalent. GuppyLM trains in Colab in roughly the same range.

Larger models, such as 100M+ parameter models, require more compute and longer training runs.

What's the fastest way to go from tiny LLM to a real API endpoint?

One path is to export the model to GGUF format and serve it with llama-server from llama.cpp. That gives you an OpenAI-compatible local endpoint.

You can then point your API testing workflow at the local server. See [internal: rest-api-best-practices].

How do production LLMs handle context longer than their training window?

Common approaches include:

RoPE scaling
Sliding window attention
Retrieval-augmented generation
Summarization-based memory
External vector search

The core transformer idea remains the same, but the way position information and attention windows are handled changes.

DEV Community