vipin singh

Posted on Apr 15

From Tokens to Test Suites: Understanding How LLMs Work for QA Engineers

#ai #qa #testing #testdev

Who this is for: Senior QA / Automation Engineers transitioning into AI and LLM testing. This blog is structured in two parts: first we go deep on how LLMs actually work (grounded in Andrej Karpathy's "Deep Dive into LLMs"), then we use that foundation to reason clearly about how to test them.

Understanding the internals is not optional. If you don't know why an LLM hallucinates, you can't design a test that catches it.

Part 1 — How LLMs Actually Work

What Is an LLM?
Tokens and Tokenization
Pre-Training — Where Knowledge Comes From
Loss — The Compass of Training
Neural Networks — Inside the Black Box
Inference — How Text Gets Generated
Why Outputs Are Non-Deterministic
Generation Parameters — Temperature, Top-K, Top-P
Fine-Tuning and RLHF
Hallucinations — Why LLMs Make Things Up
Bias — Where It Comes From
Prompting Strategies — Zero, One, Few-Shot

PART 1 — How LLMs Actually Work

1. What Is an LLM?

At its core, a Large Language Model does exactly one thing: it predicts the next token given a sequence of preceding tokens.

That's it. Everything you see ChatGPT, Claude, or Gemini do — answer questions, write code, summarize documents, roleplay characters — emerges from one deeply trained function: what token is most likely to come next?

Think of your phone's autocomplete. When you type "I'll be there in" your keyboard suggests "five", "a few", "an hour". An LLM is that autocomplete, but trained on essentially the entire internet, with hundreds of billions of parameters, capable of maintaining coherent context across thousands of tokens.

The mental model that matters:

Input:  [token_1, token_2, ..., token_n]
Output: probability distribution over ~100,000 possible next tokens

Every time the model "speaks," it's sampling from that probability distribution, appending the result to the context, and repeating. That loop is the entirety of text generation.

Why this matters for QA: The model isn't reasoning in the way a human programmer reasons. It's not executing logic. It's pattern-matching at massive scale. When it fails, it fails in pattern-matching ways — not logic errors.

2. Tokens and Tokenization

Before any text enters a neural network, it has to be converted into numbers. The process is called tokenization.

How it works

Neural networks require a finite vocabulary of discrete symbols. Raw text is converted into these symbols — called tokens — using an algorithm called Byte Pair Encoding (BPE).

Here's the pipeline:

Start with the raw UTF-8 bytes of text (256 possible byte values).
Find the most common consecutive byte pairs and merge them into new symbols.
Repeat until you reach your target vocabulary size (~100,000 for GPT-4).

The result is a vocabulary where common English words become single tokens, common word-pieces become tokens, and rare or novel strings get split into multiple tokens.

GPT-4 uses exactly 100,277 tokens.

Concrete examples

"hello world"     → ["hello", " world"]        → [15339, 1917]
"helloworld"      → ["h", "elloworld"]          → [71, 96392]
"HELLO WORLD"     → ["HEL", "LO", " WORLD"]     → [51812, 1623, 51991]

Notice a few things:

The space before "world" is included in the token. Spacing matters.
Case changes the tokenization entirely.
The same letters in a different arrangement → completely different tokens.

Why tokenization matters for QA

Tokenization is a silent source of bugs in LLM systems. The model doesn't see characters — it sees token IDs. This has concrete implications:

Spelling tasks break: the model operates on tokens, not letters. Ask it to count the letters in "strawberry" and it often fails because "strawberry" might tokenize as ["straw", "berry"] — the model never "sees" individual letters.
Numbers behave unexpectedly: "9.11" and "9.9" tokenize differently, and the model's "understanding" of which is larger has been shown to be influenced by how those strings appear in training data (Bible verse chapter numbers, for instance, where 9.11 > 9.9).
Language boundary bugs: a prompt that works in English may tokenize to more tokens in another language, consuming more context window and potentially truncating critical content.

Tokenization Insight:
┌───────────────────────────────────────────────────────────────────┐
│  "strawberry"  →  ["straw", "berry"]  →  [19535, 15717]           |
│                                                                   │
│  Model perspective: Two tokens. No character-level access.        │
│  "Count the r's in strawberry" → the model guesses from patterns, │
│  not by literally counting characters.                            |
└───────────────────────────────────────────────────────────────────┘

3. Pre-Training

Pre-training is how an LLM acquires its knowledge. It's the most expensive phase — weeks or months on thousands of GPUs — and it's where the model learns everything it knows about language, facts, reasoning patterns, code, and the world.

The data: the internet

The training corpus starts with a massive scrape of the web. For reference, Meta's Fineweb dataset used in training Llama models contains approximately 15 trillion tokens (~44 terabytes of text).

But raw web data is messy. The pipeline to clean it involves multiple stages:

Raw Web Crawl (Common Crawl)
         │
         ▼
   URL Filtering (blacklists: spam, malware, adult content)
         │
         ▼
   Text Extraction (strip HTML → keep readable text)
         │
         ▼
   Language Filtering (e.g., keep pages >65% English)
         │
         ▼
   Deduplication (remove near-duplicate documents)
         │
         ▼
   PII Removal (strip addresses, SSNs, etc.)
         │
         ▼
   Final Corpus (high-quality, diverse, deduplicated text)

The training loop

Here's what actually happens during pre-training:

This loop runs billions of times across trillions of tokens. A single training run for a large model like GPT-4 might cost tens of millions of dollars and take months.

The intuition: imagine reading the entire internet, and every time you read a sentence, you predict the next word, then check if you were right, then slightly adjust your mental model to be more accurate next time. Do this trillions of times. That's pre-training.

The result is a base model — a token simulator that has internalized the statistical patterns of human language. It's not yet an assistant. It's a very sophisticated "continue this text" machine.

4. Loss

Loss is the single most important number during training. It answers: how wrong is the model right now?

How loss works

The neural network outputs a probability for every token in the vocabulary as the next token. The loss measures how much probability the model assigned to the correct next token.

Correct next token in corpus: " Post" (token 3962)

Model's prediction:
  " Direction"  →  4%  probability
  " Case"       →  2%  probability
  " Post"       →  3%  probability  ← should be HIGH
  (other 99,274 tokens share the remaining ~91%)

Loss = how surprised were we that the correct token appeared?
       (formally: negative log probability of the correct token)

Low loss = high probability assigned to correct tokens = good model.

High loss = model is surprised by what actually comes next = poor model.

The loss curve

Loss
  │
4.0│ ●
   │  ●
3.0│    ●●
   │       ●●●
2.0│           ●●●●●●●
   │                  ●●●●●●●●●●●●
1.0│                               ●●●●●●●●●●●●●●●●●●●●●●
   └────────────────────────────────────────────────────────
                            Training Steps

A decreasing loss is a healthy training run. If loss plateaus or spikes, something is wrong — data quality issues, learning rate problems, or architecture bugs.

Why QA engineers care about loss: When evaluating a fine-tuned model, validation loss is a key health metric. If you're running A/B tests on two model versions, the one with lower validation loss on your domain-specific data will generally perform better on your use case.

5. Neural Networks

You don't need to know the math, but you do need the right mental model of what a neural network actually is.

The core idea

A neural network is a mathematical function that takes an input (your token sequence) and produces an output (probability distribution over next tokens). It has parameters — billions of numbers — that determine how inputs get transformed into outputs.

Think of it like a massive mixing console with billions of dials. Random settings → random output. Carefully tuned settings (from training) → useful predictions.

Parameters (weights):  The "knowledge" of the model.
                        ~7 billion for Llama 3 8B
                        ~405 billion for Llama 3 405B
                        ~1.8 trillion estimated for GPT-4

Input tokens ──────────────────────────────────────────────┐
                                                           │
        ┌───────────────────────────────────────────────┐  │
        │  Embedding Layer                              │◄─┘
        │  (tokens → vectors)                           │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 1                          │
        │  ┌────────────┐  ┌─────────────────────────┐  │
        │  │  Attention │  │  Feed-Forward (MLP)     │  │
        │  └────────────┘  └─────────────────────────┘  │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 2  (same structure)        │
        └───────────────┬───────────────────────────────┘
                        │
                      [...]
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Output Layer (Logits → Softmax)              │
        └───────────────┬───────────────────────────────┘
                        │
                        ▼
        Probability distribution over 100,277 tokens

The attention mechanism is the key innovation in modern LLMs (from the "Attention Is All You Need" paper). It allows each token to "look at" other tokens in the context and weight their relevance. This is what gives LLMs their ability to maintain coherent context over long passages.

Important nuance: the parameters are fixed once training is done. When you're chatting with ChatGPT, no learning is happening. Those weights were locked months ago. The model is just computing — very expensively — the same mathematical function.

6. Inference

Inference is what happens when you send a prompt to an LLM and get a response. Here's the exact generation loop:

Step by step with a concrete example:

Context:   [91, 860, 287]   =  "|Viewing ing"
                                       ↓
Neural network runs forward pass
                                       ↓
Output probability vector:
  " Single"  → 12%
  " Article" → 8%
  " Post"    → 7%
  " Page"    → 4%
  ...         ...
                                       ↓
Sample: say we draw " Single" (token 11579)
                                       ↓
New context: [91, 860, 287, 11579] = "|Viewing ing Single"
                                       ↓
Repeat...

The context window is the model's "working memory" — everything it can see while generating the next token. For GPT-2 this was 1,024 tokens. For modern models it's 128K to 1M+ tokens. Content inside the context window is directly accessible; the model doesn't need to "remember" it from training.

Key inference insight: the model only ever appends tokens to the sequence. It can't go back and revise a previous token once it's generated. This is why LLMs sometimes talk themselves into a corner — they're committed to their prior output.

7. Non-Determinism

Ask ChatGPT the same question twice. You'll likely get different answers. Why?

The sampling process

At each step, the model produces a probability distribution over the next token. It doesn't always pick the highest probability token (that would be called greedy decoding and would produce repetitive, boring text). Instead, it samples from the distribution — which introduces randomness.

Token probabilities for next token:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%
  " grape"  → 15%
  (others)  → 5%

Greedy:  always picks " apple" → deterministic, repetitive
Sampling: picks " banana" 25% of the time → varied, creative

This is the same reason the model hallucinated three different fake biographies of "Orson Kovacs" (a made-up person) in Karpathy's demo — it doesn't "know" the right answer, so it samples plausible-sounding text each time, landing on different random outputs.

The implications for QA are profound: the same prompt can yield different outputs on different runs. You cannot use simple assertEqual comparisons to verify correctness. This is the single biggest shift in testing philosophy when you move from traditional software to LLM-based systems.

8. Generation Parameters

These are the knobs that control how the model samples from its probability distributions. Understanding them is essential for both building and testing LLM systems.

Temperature

Temperature controls how "flat" or "peaked" the probability distribution is before sampling.

Token probabilities BEFORE temperature:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%

Temperature = 0.1 (LOW — more deterministic):
  " apple"  → 91%  (dominant choice amplified)
  " banana" → 6%
  " orange" → 3%
  → Very predictable, somewhat repetitive output

Temperature = 1.0 (NEUTRAL):
  Original distribution preserved → balanced exploration

Temperature = 2.0 (HIGH — more random):
  " apple"  → 12%  (differences flattened)
  " banana" → 11%
  " orange" → 10%
  → Wildly creative, often incoherent output

Rule of thumb:

Factual Q&A, code generation → temperature: 0.1–0.3
Creative writing, brainstorming → temperature: 0.7–1.0
Random/experimental output → temperature > 1.0 (usually a mistake)

Top-K

Limits sampling to the K most probable tokens. All others are zeroed out.

Top-K = 3:
  Only sample from [" apple", " banana", " orange"]
  Tokens ranked 4th and below are excluded

Effect: Prevents very unlikely tokens from ever being sampled.
        Can make output feel more constrained.

Top-P (Nucleus Sampling)

Instead of a fixed K, samples from the smallest set of tokens whose cumulative probability exceeds P.

Top-P = 0.9:
  Add tokens by probability until cumulative sum ≥ 90%

  " apple"  → 35%  (sum: 35%)
  " banana" → 25%  (sum: 60%)
  " orange" → 20%  (sum: 80%)
  " grape"  → 15%  (sum: 95%)  ← crosses 90% here

  Sample only from {" apple", " banana", " orange", " grape"}

Top-P is generally preferred over Top-K because it adapts to the actual probability distribution. When the model is confident (one token dominates), the nucleus is small. When the model is uncertain, the nucleus expands.

Parameters summary

Parameter	Low Value	High Value	QA Implication
Temperature	Predictable, deterministic	Random, creative	Low temp → easier to test; High temp → need more runs
Top-K	Few token candidates	Many token candidates	Lower K → more consistent outputs
Top-P	Small nucleus (confident choices)	Large nucleus (broad choices)	Lower P → less variance in outputs

9. Fine-Tuning and RLHF

A pre-trained base model is brilliant but unusable. It doesn't answer questions — it just "continues" text in the style of the internet. Turning it into an assistant requires two more training stages.

Stage 2: Supervised Fine-Tuning (SFT)

The training procedure is identical to pre-training — same algorithm, same loss function. The only change is the dataset.

Instead of internet documents, the training data is now human-curated conversations:

[
  {
    "role": "user",
    "content": "What's the capital of France?"
  },
  {
    "role": "assistant",
    "content": "The capital of France is Paris."
  }
]

Millions of such conversations, written by paid expert annotators following detailed labeling guidelines, are used to teach the model to adopt the "assistant" persona and response format.

The limitation of SFT: the model imitates human experts. It can never exceed human performance on tasks where the human labeler was the ceiling. And the labeler doesn't always know the optimal solution — especially for math problems where the best "chain of thought" for a human differs from what works best for the model.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

This is where the model learns to discover solutions on its own through trial and error.

The model generates many candidate responses, checks which ones are correct (or preferred), and updates its parameters to make the correct responses more likely. Crucially, no human is writing the solutions — the model discovers them itself.

This is analogous to how DeepMind's AlphaGo went from "imitating human moves" (SFT) to "discovering move 37" — a move no human would make, but which emerged from RL because it statistically led to winning.

The result of RLHF is what you interact with on ChatGPT: a model that doesn't just imitate — it has developed internal "reasoning strategies" that it discovered were effective.

The three-stage summary:

Stage	Data	Goal	Analogy
Pre-Training	Internet documents	Build knowledge	Reading every textbook
SFT	Human-curated conversations	Become an assistant	Studying worked examples
RLHF	Self-generated (trial & error)	Discover effective strategies	Doing practice problems

10. Hallucinations

This is where things get uncomfortable — and where most teams are surprised the first time they encounter it in production.

Why hallucinations happen

The model doesn't have a "I don't know" default. It was trained on data where questions of the form "Who is X?" are answered confidently with correct answers. So when you ask "Who is Orson Kovacs?" (a made-up person), the model doesn't say "I don't know" — it samples the most statistically likely continuation of a "Who is X?" prompt, which happens to sound like a confident biographical description.

Training data pattern:
  "Who is Tom Cruise?"  → "[confident answer about Tom Cruise]"
  "Who is John Barrasso?" → "[confident answer about Senator Barrasso]"
  "Who is Genghis Khan?" → "[confident answer about Mongol ruler]"

Learned behavior:
  "Who is Orson Kovacs?" → "[confident answer about... someone invented on the spot]"

The model is not "lying". It's doing exactly what it was trained to do: produce the statistically most likely token sequence given the context. It just happens that the most likely token sequence for "Who is [unknown person]?" in its training data was a confident-sounding response.

The deeper issue

Even when internal network activations may "know" the answer is uncertain, that knowledge isn't wired to the output. The model has no direct mechanism to surface its own uncertainty unless it was explicitly trained on examples where "I don't know" was the labeled correct answer.

Modern mitigations

Epistemic training: interrogate the model on thousands of factual questions, identify which it gets consistently wrong, then add "I don't know" responses for those to the training data.
Tool use: give the model a <SEARCH_START> / <SEARCH_END> token protocol. When uncertain, it can emit a search query, retrieve web results, and place them into its context window. The context window functions as working memory — anything in it is directly accessible, unlike knowledge in parameters which is more like vague long-term memory.

Knowledge in parameters = vague recollection (what you remember from something you read months ago)
Knowledge in context window = working memory (what's right in front of you)

11. Bias

LLMs absorb bias from three sources:

1. Training data bias

The internet over-represents certain perspectives: English speakers, Western cultures, certain age demographics, certain political viewpoints. If 90% of web pages in the training corpus express opinion X on a topic, the model will tend toward X.

A model trained primarily on English web data will perform worse on low-resource languages. A model trained on Wikipedia will reflect the coverage biases in Wikipedia. These aren't bugs per se — they're statistical reflections of the data.

2. Labeler bias

During SFT and RLHF, human annotators make judgment calls. Their cultural background, political views, and personal style preferences all influence what gets labeled as "ideal" responses. Annotator guidelines try to minimize this, but can't eliminate it.

3. Amplification through sampling

Because the model tends toward the mean of its training distribution, it can amplify stereotypes that are statistically common in training data even if they're not normatively accurate. If "CEO" in training data is overwhelmingly paired with male pronouns, the model will associate CEO with male pronouns even if no one explicitly programmed that association.

Why this matters for QA: bias is hard to test with unit tests. It shows up in aggregate — across thousands of test cases, certain demographic groups, certain topic areas. Your testing strategy needs to explicitly probe for it.

12. Prompting Strategies

The way you frame a prompt dramatically affects the model's output. This is one of the most practically important concepts for QA engineers to understand, because your prompt design becomes part of your test case design.

Zero-Shot Prompting

No examples provided. Just the task description.

Classify the sentiment of the following review as POSITIVE, NEGATIVE, or NEUTRAL:

"The delivery was late but the product itself was excellent."

Use when: the task is simple and well-represented in training data. The model has seen many examples of sentiment classification during pre-training.

Limitation: the model must infer the desired output format entirely from context. Ambiguous instructions produce inconsistent formatting.

One-Shot Prompting

One example provided before the actual task.

Classify the sentiment of the following review:

Review: "Absolutely loved the packaging and the smell. Will buy again!"
Sentiment: POSITIVE

Review: "The delivery was late but the product itself was excellent."
Sentiment:

Use when: you need a specific output format the model might not default to, or for edge cases where the classification is ambiguous and you want to demonstrate intent.

Few-Shot Prompting

Multiple examples (typically 3–10) before the task.

Classify the sentiment of the following review:

Review: "Absolutely loved the packaging." → POSITIVE
Review: "Took 3 weeks to arrive and was damaged." → NEGATIVE
Review: "Does what it says, nothing more." → NEUTRAL
Review: "Good price, but customer service was horrible." → MIXED

Review: "The delivery was late but the product itself was excellent."
Sentiment:

Use when: tasks are complex, output format needs to be precise, or the model needs to learn a classification scheme that goes beyond what's common in its training data (e.g., your company's specific taxonomy).

The QA angle on prompting

Every prompt you write is a specification. It deserves the same rigor as any test specification:

Is it unambiguous? Can the model interpret the instruction in multiple ways?
Does the example cover edge cases? One good example often does more than five generic ones.
Is the output format specified? If you need JSON, say so explicitly.
How robust is it to variations? If the input contains typos, does the prompt still work?

📚 References & Further Reading

If you want to go deeper, these are the few resources that actually matter:

💡 Suggested Reading Flow

If you actually want to understand this space:

Start with Karpathy (intuition first)
Move to Transformers + GPT papers (core mechanics)
Learn tokenisation (how models see text)
Understand decoding (why outputs vary)
Study RLHF (why models behave like assistants)
Focus on evals + hallucination (this is where QA adds real value)

Table of Contents