DEV Community: vipin singh

From Tokens to Test Suites: Understanding How LLMs Work for QA Engineers

vipin singh — Wed, 15 Apr 2026 04:38:28 +0000

Who this is for: Senior QA / Automation Engineers transitioning into AI and LLM testing. This blog is structured in two parts: first we go deep on how LLMs actually work (grounded in Andrej Karpathy's "Deep Dive into LLMs"), then we use that foundation to reason clearly about how to test them.

Understanding the internals is not optional. If you don't know why an LLM hallucinates, you can't design a test that catches it.

Part 1 — How LLMs Actually Work

What Is an LLM?
Tokens and Tokenization
Pre-Training — Where Knowledge Comes From
Loss — The Compass of Training
Neural Networks — Inside the Black Box
Inference — How Text Gets Generated
Why Outputs Are Non-Deterministic
Generation Parameters — Temperature, Top-K, Top-P
Fine-Tuning and RLHF
Hallucinations — Why LLMs Make Things Up
Bias — Where It Comes From
Prompting Strategies — Zero, One, Few-Shot

PART 1 — How LLMs Actually Work

1. What Is an LLM?

At its core, a Large Language Model does exactly one thing: it predicts the next token given a sequence of preceding tokens.

That's it. Everything you see ChatGPT, Claude, or Gemini do — answer questions, write code, summarize documents, roleplay characters — emerges from one deeply trained function: what token is most likely to come next?

Think of your phone's autocomplete. When you type "I'll be there in" your keyboard suggests "five", "a few", "an hour". An LLM is that autocomplete, but trained on essentially the entire internet, with hundreds of billions of parameters, capable of maintaining coherent context across thousands of tokens.

The mental model that matters:

Input:  [token_1, token_2, ..., token_n]
Output: probability distribution over ~100,000 possible next tokens

Every time the model "speaks," it's sampling from that probability distribution, appending the result to the context, and repeating. That loop is the entirety of text generation.

Why this matters for QA: The model isn't reasoning in the way a human programmer reasons. It's not executing logic. It's pattern-matching at massive scale. When it fails, it fails in pattern-matching ways — not logic errors.

2. Tokens and Tokenization

Before any text enters a neural network, it has to be converted into numbers. The process is called tokenization.

How it works

Neural networks require a finite vocabulary of discrete symbols. Raw text is converted into these symbols — called tokens — using an algorithm called Byte Pair Encoding (BPE).

Here's the pipeline:

Start with the raw UTF-8 bytes of text (256 possible byte values).
Find the most common consecutive byte pairs and merge them into new symbols.
Repeat until you reach your target vocabulary size (~100,000 for GPT-4).

The result is a vocabulary where common English words become single tokens, common word-pieces become tokens, and rare or novel strings get split into multiple tokens.

GPT-4 uses exactly 100,277 tokens.

Concrete examples

"hello world"     → ["hello", " world"]        → [15339, 1917]
"helloworld"      → ["h", "elloworld"]          → [71, 96392]
"HELLO WORLD"     → ["HEL", "LO", " WORLD"]     → [51812, 1623, 51991]

Notice a few things:

The space before "world" is included in the token. Spacing matters.
Case changes the tokenization entirely.
The same letters in a different arrangement → completely different tokens.

Why tokenization matters for QA

Tokenization is a silent source of bugs in LLM systems. The model doesn't see characters — it sees token IDs. This has concrete implications:

Spelling tasks break: the model operates on tokens, not letters. Ask it to count the letters in "strawberry" and it often fails because "strawberry" might tokenize as ["straw", "berry"] — the model never "sees" individual letters.
Numbers behave unexpectedly: "9.11" and "9.9" tokenize differently, and the model's "understanding" of which is larger has been shown to be influenced by how those strings appear in training data (Bible verse chapter numbers, for instance, where 9.11 > 9.9).
Language boundary bugs: a prompt that works in English may tokenize to more tokens in another language, consuming more context window and potentially truncating critical content.

Tokenization Insight:
┌───────────────────────────────────────────────────────────────────┐
│  "strawberry"  →  ["straw", "berry"]  →  [19535, 15717]           |
│                                                                   │
│  Model perspective: Two tokens. No character-level access.        │
│  "Count the r's in strawberry" → the model guesses from patterns, │
│  not by literally counting characters.                            |
└───────────────────────────────────────────────────────────────────┘

3. Pre-Training

Pre-training is how an LLM acquires its knowledge. It's the most expensive phase — weeks or months on thousands of GPUs — and it's where the model learns everything it knows about language, facts, reasoning patterns, code, and the world.

The data: the internet

The training corpus starts with a massive scrape of the web. For reference, Meta's Fineweb dataset used in training Llama models contains approximately 15 trillion tokens (~44 terabytes of text).

But raw web data is messy. The pipeline to clean it involves multiple stages:

Raw Web Crawl (Common Crawl)
         │
         ▼
   URL Filtering (blacklists: spam, malware, adult content)
         │
         ▼
   Text Extraction (strip HTML → keep readable text)
         │
         ▼
   Language Filtering (e.g., keep pages >65% English)
         │
         ▼
   Deduplication (remove near-duplicate documents)
         │
         ▼
   PII Removal (strip addresses, SSNs, etc.)
         │
         ▼
   Final Corpus (high-quality, diverse, deduplicated text)

The training loop

Here's what actually happens during pre-training:

This loop runs billions of times across trillions of tokens. A single training run for a large model like GPT-4 might cost tens of millions of dollars and take months.

The intuition: imagine reading the entire internet, and every time you read a sentence, you predict the next word, then check if you were right, then slightly adjust your mental model to be more accurate next time. Do this trillions of times. That's pre-training.

The result is a base model — a token simulator that has internalized the statistical patterns of human language. It's not yet an assistant. It's a very sophisticated "continue this text" machine.

4. Loss

Loss is the single most important number during training. It answers: how wrong is the model right now?

How loss works

The neural network outputs a probability for every token in the vocabulary as the next token. The loss measures how much probability the model assigned to the correct next token.

Correct next token in corpus: " Post" (token 3962)

Model's prediction:
  " Direction"  →  4%  probability
  " Case"       →  2%  probability
  " Post"       →  3%  probability  ← should be HIGH
  (other 99,274 tokens share the remaining ~91%)

Loss = how surprised were we that the correct token appeared?
       (formally: negative log probability of the correct token)

Low loss = high probability assigned to correct tokens = good model.

High loss = model is surprised by what actually comes next = poor model.

The loss curve

Loss
  │
4.0│ ●
   │  ●
3.0│    ●●
   │       ●●●
2.0│           ●●●●●●●
   │                  ●●●●●●●●●●●●
1.0│                               ●●●●●●●●●●●●●●●●●●●●●●
   └────────────────────────────────────────────────────────
                            Training Steps

A decreasing loss is a healthy training run. If loss plateaus or spikes, something is wrong — data quality issues, learning rate problems, or architecture bugs.

Why QA engineers care about loss: When evaluating a fine-tuned model, validation loss is a key health metric. If you're running A/B tests on two model versions, the one with lower validation loss on your domain-specific data will generally perform better on your use case.

5. Neural Networks

You don't need to know the math, but you do need the right mental model of what a neural network actually is.

The core idea

A neural network is a mathematical function that takes an input (your token sequence) and produces an output (probability distribution over next tokens). It has parameters — billions of numbers — that determine how inputs get transformed into outputs.

Think of it like a massive mixing console with billions of dials. Random settings → random output. Carefully tuned settings (from training) → useful predictions.

Parameters (weights):  The "knowledge" of the model.
                        ~7 billion for Llama 3 8B
                        ~405 billion for Llama 3 405B
                        ~1.8 trillion estimated for GPT-4

Input tokens ──────────────────────────────────────────────┐
                                                           │
        ┌───────────────────────────────────────────────┐  │
        │  Embedding Layer                              │◄─┘
        │  (tokens → vectors)                           │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 1                          │
        │  ┌────────────┐  ┌─────────────────────────┐  │
        │  │  Attention │  │  Feed-Forward (MLP)     │  │
        │  └────────────┘  └─────────────────────────┘  │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 2  (same structure)        │
        └───────────────┬───────────────────────────────┘
                        │
                      [...]
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Output Layer (Logits → Softmax)              │
        └───────────────┬───────────────────────────────┘
                        │
                        ▼
        Probability distribution over 100,277 tokens

The attention mechanism is the key innovation in modern LLMs (from the "Attention Is All You Need" paper). It allows each token to "look at" other tokens in the context and weight their relevance. This is what gives LLMs their ability to maintain coherent context over long passages.

Important nuance: the parameters are fixed once training is done. When you're chatting with ChatGPT, no learning is happening. Those weights were locked months ago. The model is just computing — very expensively — the same mathematical function.

6. Inference

Inference is what happens when you send a prompt to an LLM and get a response. Here's the exact generation loop:

Step by step with a concrete example:

Context:   [91, 860, 287]   =  "|Viewing ing"
                                       ↓
Neural network runs forward pass
                                       ↓
Output probability vector:
  " Single"  → 12%
  " Article" → 8%
  " Post"    → 7%
  " Page"    → 4%
  ...         ...
                                       ↓
Sample: say we draw " Single" (token 11579)
                                       ↓
New context: [91, 860, 287, 11579] = "|Viewing ing Single"
                                       ↓
Repeat...

The context window is the model's "working memory" — everything it can see while generating the next token. For GPT-2 this was 1,024 tokens. For modern models it's 128K to 1M+ tokens. Content inside the context window is directly accessible; the model doesn't need to "remember" it from training.

Key inference insight: the model only ever appends tokens to the sequence. It can't go back and revise a previous token once it's generated. This is why LLMs sometimes talk themselves into a corner — they're committed to their prior output.

7. Non-Determinism

Ask ChatGPT the same question twice. You'll likely get different answers. Why?

The sampling process

At each step, the model produces a probability distribution over the next token. It doesn't always pick the highest probability token (that would be called greedy decoding and would produce repetitive, boring text). Instead, it samples from the distribution — which introduces randomness.

Token probabilities for next token:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%
  " grape"  → 15%
  (others)  → 5%

Greedy:  always picks " apple" → deterministic, repetitive
Sampling: picks " banana" 25% of the time → varied, creative

This is the same reason the model hallucinated three different fake biographies of "Orson Kovacs" (a made-up person) in Karpathy's demo — it doesn't "know" the right answer, so it samples plausible-sounding text each time, landing on different random outputs.

The implications for QA are profound: the same prompt can yield different outputs on different runs. You cannot use simple assertEqual comparisons to verify correctness. This is the single biggest shift in testing philosophy when you move from traditional software to LLM-based systems.

8. Generation Parameters

These are the knobs that control how the model samples from its probability distributions. Understanding them is essential for both building and testing LLM systems.

Temperature

Temperature controls how "flat" or "peaked" the probability distribution is before sampling.

Token probabilities BEFORE temperature:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%

Temperature = 0.1 (LOW — more deterministic):
  " apple"  → 91%  (dominant choice amplified)
  " banana" → 6%
  " orange" → 3%
  → Very predictable, somewhat repetitive output

Temperature = 1.0 (NEUTRAL):
  Original distribution preserved → balanced exploration

Temperature = 2.0 (HIGH — more random):
  " apple"  → 12%  (differences flattened)
  " banana" → 11%
  " orange" → 10%
  → Wildly creative, often incoherent output

Rule of thumb:

Factual Q&A, code generation → temperature: 0.1–0.3
Creative writing, brainstorming → temperature: 0.7–1.0
Random/experimental output → temperature > 1.0 (usually a mistake)

Top-K

Limits sampling to the K most probable tokens. All others are zeroed out.

Top-K = 3:
  Only sample from [" apple", " banana", " orange"]
  Tokens ranked 4th and below are excluded

Effect: Prevents very unlikely tokens from ever being sampled.
        Can make output feel more constrained.

Top-P (Nucleus Sampling)

Instead of a fixed K, samples from the smallest set of tokens whose cumulative probability exceeds P.

Top-P = 0.9:
  Add tokens by probability until cumulative sum ≥ 90%

  " apple"  → 35%  (sum: 35%)
  " banana" → 25%  (sum: 60%)
  " orange" → 20%  (sum: 80%)
  " grape"  → 15%  (sum: 95%)  ← crosses 90% here

  Sample only from {" apple", " banana", " orange", " grape"}

Top-P is generally preferred over Top-K because it adapts to the actual probability distribution. When the model is confident (one token dominates), the nucleus is small. When the model is uncertain, the nucleus expands.

Parameters summary

Parameter	Low Value	High Value	QA Implication
Temperature	Predictable, deterministic	Random, creative	Low temp → easier to test; High temp → need more runs
Top-K	Few token candidates	Many token candidates	Lower K → more consistent outputs
Top-P	Small nucleus (confident choices)	Large nucleus (broad choices)	Lower P → less variance in outputs

9. Fine-Tuning and RLHF

A pre-trained base model is brilliant but unusable. It doesn't answer questions — it just "continues" text in the style of the internet. Turning it into an assistant requires two more training stages.

Stage 2: Supervised Fine-Tuning (SFT)

The training procedure is identical to pre-training — same algorithm, same loss function. The only change is the dataset.

Instead of internet documents, the training data is now human-curated conversations:

[
  {
    "role": "user",
    "content": "What's the capital of France?"
  },
  {
    "role": "assistant",
    "content": "The capital of France is Paris."
  }
]

Millions of such conversations, written by paid expert annotators following detailed labeling guidelines, are used to teach the model to adopt the "assistant" persona and response format.

The limitation of SFT: the model imitates human experts. It can never exceed human performance on tasks where the human labeler was the ceiling. And the labeler doesn't always know the optimal solution — especially for math problems where the best "chain of thought" for a human differs from what works best for the model.

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

This is where the model learns to discover solutions on its own through trial and error.

The model generates many candidate responses, checks which ones are correct (or preferred), and updates its parameters to make the correct responses more likely. Crucially, no human is writing the solutions — the model discovers them itself.

This is analogous to how DeepMind's AlphaGo went from "imitating human moves" (SFT) to "discovering move 37" — a move no human would make, but which emerged from RL because it statistically led to winning.

The result of RLHF is what you interact with on ChatGPT: a model that doesn't just imitate — it has developed internal "reasoning strategies" that it discovered were effective.

The three-stage summary:

Stage	Data	Goal	Analogy
Pre-Training	Internet documents	Build knowledge	Reading every textbook
SFT	Human-curated conversations	Become an assistant	Studying worked examples
RLHF	Self-generated (trial & error)	Discover effective strategies	Doing practice problems

10. Hallucinations

This is where things get uncomfortable — and where most teams are surprised the first time they encounter it in production.

Why hallucinations happen

The model doesn't have a "I don't know" default. It was trained on data where questions of the form "Who is X?" are answered confidently with correct answers. So when you ask "Who is Orson Kovacs?" (a made-up person), the model doesn't say "I don't know" — it samples the most statistically likely continuation of a "Who is X?" prompt, which happens to sound like a confident biographical description.

Training data pattern:
  "Who is Tom Cruise?"  → "[confident answer about Tom Cruise]"
  "Who is John Barrasso?" → "[confident answer about Senator Barrasso]"
  "Who is Genghis Khan?" → "[confident answer about Mongol ruler]"

Learned behavior:
  "Who is Orson Kovacs?" → "[confident answer about... someone invented on the spot]"

The model is not "lying". It's doing exactly what it was trained to do: produce the statistically most likely token sequence given the context. It just happens that the most likely token sequence for "Who is [unknown person]?" in its training data was a confident-sounding response.

The deeper issue

Even when internal network activations may "know" the answer is uncertain, that knowledge isn't wired to the output. The model has no direct mechanism to surface its own uncertainty unless it was explicitly trained on examples where "I don't know" was the labeled correct answer.

Modern mitigations

Epistemic training: interrogate the model on thousands of factual questions, identify which it gets consistently wrong, then add "I don't know" responses for those to the training data.
Tool use: give the model a <SEARCH_START> / <SEARCH_END> token protocol. When uncertain, it can emit a search query, retrieve web results, and place them into its context window. The context window functions as working memory — anything in it is directly accessible, unlike knowledge in parameters which is more like vague long-term memory.

Knowledge in parameters = vague recollection (what you remember from something you read months ago)
Knowledge in context window = working memory (what's right in front of you)

11. Bias

LLMs absorb bias from three sources:

1. Training data bias

The internet over-represents certain perspectives: English speakers, Western cultures, certain age demographics, certain political viewpoints. If 90% of web pages in the training corpus express opinion X on a topic, the model will tend toward X.

A model trained primarily on English web data will perform worse on low-resource languages. A model trained on Wikipedia will reflect the coverage biases in Wikipedia. These aren't bugs per se — they're statistical reflections of the data.

2. Labeler bias

During SFT and RLHF, human annotators make judgment calls. Their cultural background, political views, and personal style preferences all influence what gets labeled as "ideal" responses. Annotator guidelines try to minimize this, but can't eliminate it.

3. Amplification through sampling

Because the model tends toward the mean of its training distribution, it can amplify stereotypes that are statistically common in training data even if they're not normatively accurate. If "CEO" in training data is overwhelmingly paired with male pronouns, the model will associate CEO with male pronouns even if no one explicitly programmed that association.

Why this matters for QA: bias is hard to test with unit tests. It shows up in aggregate — across thousands of test cases, certain demographic groups, certain topic areas. Your testing strategy needs to explicitly probe for it.

12. Prompting Strategies

The way you frame a prompt dramatically affects the model's output. This is one of the most practically important concepts for QA engineers to understand, because your prompt design becomes part of your test case design.

Zero-Shot Prompting

No examples provided. Just the task description.

Classify the sentiment of the following review as POSITIVE, NEGATIVE, or NEUTRAL:

"The delivery was late but the product itself was excellent."

Use when: the task is simple and well-represented in training data. The model has seen many examples of sentiment classification during pre-training.

Limitation: the model must infer the desired output format entirely from context. Ambiguous instructions produce inconsistent formatting.

One-Shot Prompting

One example provided before the actual task.

Classify the sentiment of the following review:

Review: "Absolutely loved the packaging and the smell. Will buy again!"
Sentiment: POSITIVE

Review: "The delivery was late but the product itself was excellent."
Sentiment:

Use when: you need a specific output format the model might not default to, or for edge cases where the classification is ambiguous and you want to demonstrate intent.

Few-Shot Prompting

Multiple examples (typically 3–10) before the task.

Classify the sentiment of the following review:

Review: "Absolutely loved the packaging." → POSITIVE
Review: "Took 3 weeks to arrive and was damaged." → NEGATIVE
Review: "Does what it says, nothing more." → NEUTRAL
Review: "Good price, but customer service was horrible." → MIXED

Review: "The delivery was late but the product itself was excellent."
Sentiment:

Use when: tasks are complex, output format needs to be precise, or the model needs to learn a classification scheme that goes beyond what's common in its training data (e.g., your company's specific taxonomy).

The QA angle on prompting

Every prompt you write is a specification. It deserves the same rigor as any test specification:

Is it unambiguous? Can the model interpret the instruction in multiple ways?
Does the example cover edge cases? One good example often does more than five generic ones.
Is the output format specified? If you need JSON, say so explicitly.
How robust is it to variations? If the input contains typos, does the prompt still work?

📚 References & Further Reading

If you want to go deeper, these are the few resources that actually matter:

💡 Suggested Reading Flow

If you actually want to understand this space:

Start with Karpathy (intuition first)
Move to Transformers + GPT papers (core mechanics)
Learn tokenisation (how models see text)
Understand decoding (why outputs vary)
Study RLHF (why models behave like assistants)
Focus on evals + hallucination (this is where QA adds real value)

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

vipin singh — Fri, 06 Mar 2026 11:21:14 +0000

Last month I shipped 112 API tests and 14 UI tests. Two months ago, that would've taken me a Month.

The Payoff — Before You Read Anything Else

Metric	Before AI Agents	After AI Agents
Tests shipped in a month	~15–20	126 (112 API + 14 UI)
Error scenario coverage	Only P0 errors	Systematically covered per endpoint
Code consistency	Variable (depends on the day)	High — agents follow patterns better than tired humans
PR review comments	Many	Fewer — AI code review catches issues before humans see them
My time spent on	Writing boilerplate	Test design & strategy

Now let me tell you how I got here.

My Setup

I use two AI-powered tools daily — they serve different purposes, and the combination is where the real power lies.

Tool	What It Is	Best For
Claude Code	CLI-based coding agent with full codebase access	Multi-file research, large test suites, gap analysis
Cursor	AI-powered IDE built on VS Code	Quick edits, in-context tweaks, focused single-file work

The Secret Weapon: Skill Files & Markdown Context

This is the most important part of the entire workflow.

Before I ask an agent to write a single line of code, I make sure it has context. Without it, the agent guesses. With it, the agent is an informed collaborator.

Without Skill Files              With Skill Files
─────────────────────           ─────────────────────
❌ Agent guesses                 ✅ Agent knows your patterns
❌ Generic output                ✅ Code that fits your codebase
❌ Re-explain every session      ✅ Instant onboarding every time
❌ Lots of manual editing        ✅ Minimal corrections needed

Think of it like onboarding a new contractor. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every AI session benefits forever.

Here's what I've built:

File	What It Captures
`PROJECT.md`	System architecture, domain terminology, environment details, requirements/specs
API Test Skill	Framework setup, dynamic payload construction, test data creation APIs & sequences, auth patterns, existing helpers, response validation patterns
UI Test Skill	Page Object Model structure, locator strategy, component interaction patterns, assertion approaches, best practices
`CLAUDE.md` / `.cursorrules`	Repo conventions, build commands, coding standards

What's Inside the API Test Skill

This is the file that made 112 API tests possible in a month. It tells the agent:

How to build dynamic payloads — which fields are required, which are generated (unique IDs, timestamps), how to construct valid payloads per scenario
How to create test data — the exact sequence of API calls needed (e.g., "create customer → create order → authenticate"), how to generate unique data to avoid collisions, how to clean up after
Auth & environment config — how to obtain tokens, which headers to include, how to target staging vs. QA
Existing utilities — what helpers already exist so the agent doesn't reinvent the wheel

Here's a real excerpt from my API test skill file — this is what the agent reads before writing a single test:

/**
 * ENDPOINT: POST /v2/payments
 *
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 *
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */

// Test data creation sequence:
// 1. Create customer   → POST /v2/customers
// 2. Create card       → POST /v2/cards  (use sandbox nonce)
// 3. Create payment    → POST /v2/payments (reference customer + card)
// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)
// 5. Cleanup           → POST /v2/refunds (refund test payment)

// Payload builder — agent uses this pattern for every endpoint:
function buildPaymentPayload(overrides = {}) {
  return {
    idempotency_key: crypto.randomUUID(),
    source_id: overrides.source_id || testCard.id,
    amount_money: {
      amount: overrides.amount || 1000,
      currency: overrides.currency || 'AUD',
    },
    customer_id: overrides.customer_id || testCustomer.id,
    reference_id: `test-${Date.now()}`,
    ...overrides,
  };
}

Why this works: The agent now knows the exact payload structure, the test data sequence, which fields to randomize, and the error codes to cover. It generates one test per error scenario without me dictating each one.

Here's what the agent produces from that skill file — a complete error scenario test:

test('POST /v2/payments with declined card returns 402', async () => {
  const payload = buildPaymentPayload({
    source_id: 'cnon:card-nonce-declined',  // sandbox decline token
  });

  console.log(`Testing: declined card → expect 402`);
  const res = await api.post('/v2/payments', payload);
  console.log(`Response: ${res.status} — ${res.data?.errors?.[0]?.code}`);

  expect(res.status).toBe(402);
  expect(res.data.errors[0].category).toBe('PAYMENT_METHOD_ERROR');
  expect(res.data.errors[0].code).toBe('CARD_DECLINED');
});

test('POST /v2/payments with expired token returns 401', async () => {
  const payload = buildPaymentPayload();

  const res = await api.post('/v2/payments', payload, {
    headers: { Authorization: 'Bearer expired-token-xxx' },
  });

  expect(res.status).toBe(401);
  expect(res.data.errors[0].category).toBe('AUTHENTICATION_ERROR');
});

Before skill files, I was only covering P0 happy-path scenarios. Now the agent systematically generates tests for every error code listed in the skill file — 400, 401, 402, 404, 422, 429, 500 — per endpoint.

What's Inside the UI Test Skill

This is why 14 UI tests came out consistent and maintainable:

POM structure — how page objects are organized, base classes, naming conventions, directory layout
Locator strategy — the single biggest source of flaky UI tests, locked down with clear priorities
Component interaction patterns — how to interact with custom components (dropdowns, date pickers, modals)
Best practices — never hard-code sleeps, always clean state between tests, use beforeEach for setup

Here's the test structure template from the skill file — every UI test the agent writes follows this exact shape:

const { test, expect } = require('@playwright/test')
const { chrome } = require('../../utils/browser')
const { navigateToCheckout } = require('../../utils')

const { LandingPage } = require('../pages/Landing')
const { LoginPage } = require('../pages/Login')
const { SummaryPage } = require('../pages/Summary')

const { describe, beforeEach } = test

describe('@checkout_regression_au_login', () => {
  let page
  let landingPage, loginPage, summaryPage

  beforeEach(async () => {
    const browserInstance = await chrome()
    page = browserInstance.page
    landingPage = new LandingPage(page)
    loginPage = new LoginPage(page)
    summaryPage = new SummaryPage(page)
  })

  test('Existing user completes login and confirms order', async () => {
    // Arrange
    await navigateToCheckout({ landingPage })

    // Act
    await loginPage.setEmailAddress('jane@doe.com')
    await loginPage.continue()
    await passwordPage.setPassword('password')
    await passwordPage.login()

    // Assert
    await summaryPage.confirmOrder()
    const action = await landingPage.getCallbackAction()
    expect(action).toBe('confirm')
  })
})

And here's the page object pattern — the FIELDS convention that keeps selectors organized:

const FIELDS = {
  submitButton: {
    selector: '[data-testid="submit-button"]',
  },
  emailInput: {
    selector: '[data-testid="email-input"]',
  },
}

exports.MyPage = class {
  constructor(page) {
    this.page = page
  }

  async clickSubmit() {
    await this.page.waitForSelector(
      FIELDS.submitButton.selector, { state: 'visible' }
    )
    await this.page.click(FIELDS.submitButton.selector)
  }

  async setEmail(text) {
    await this.page.waitForSelector(
      FIELDS.emailInput.selector, { state: 'visible' }
    )
    await this.page.fill(FIELDS.emailInput.selector, text)
  }
}

The locator strategy is defined as a strict priority order:

✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits

And the anti-patterns section — without these rules, agents produce code that works in demos but fails in CI:

// ❌ BAD — arbitrary wait, masks timing issues
await page.waitForTimeout(3000)
await page.click('[data-testid="button"]')

// ✅ GOOD — explicit wait for element
await page.waitForSelector(
  '[data-testid="button"]', { state: 'visible' }
)
await page.click('[data-testid="button"]')

// ❌ BAD — try-catch masks real failures
try {
  await page.waitForSelector(selector1)
} catch {
  await page.waitForSelector(selector2)
}

// ✅ GOOD — explicit conditional
if (country === 'us') {
  await page.waitForSelector(usSelector)
} else {
  await page.waitForSelector(defaultSelector)
}

Before I added these anti-patterns to the skill file, roughly 1 in 3 generated tests had at least one of these issues.

Bonus: Writing skill files forces you to codify knowledge that usually lives only in your head. It becomes documentation that helps human teammates too.

The Workflow: Research → Plan → Implement

I never just say "write me some tests." I follow a deliberate three-phase process.

In Claude Code: Research → Plan → Implement

Research — "Read existing tests, read the API spec, read the skill files. What's covered? What's missing?" The agent explores and builds a mental model. I review its understanding before moving forward.
Plan — "Propose which tests to write, in what order, and why." The agent produces a prioritized list of scenarios. I review and approve before any code is written.
Implement — Only after the plan is approved does the agent write code. Because it's already done the research and has an approved plan, the code is targeted, well-structured, and aligned.

This prevents the most common failure mode: the agent eagerly writing 500 lines of code that miss the point entirely.

In Cursor: Plan → Implement

Cursor's workflow is lighter-weight since I'm usually already in the code:

Plan — I describe what I want in the chat, referencing specific files. Cursor proposes an approach inline, and I review it.
Implement — Once I approve, Cursor applies the changes directly in the editor. I review each diff as it appears.

My rule of thumb: Claude Code for large, multi-file efforts. Cursor for focused, in-context edits.

Quality Gates Before Every PR

Writing tests fast means nothing if the tests are broken, unreadable, or unmaintainable. Every piece of AI-generated test code must pass three gates before I raise a PR.

1. All Tests Running and Passing

Non-negotiable. I run the full test suite — not just the new tests — to make sure nothing is broken. If a new test is flaky, it doesn't ship. I iterate with the agent until it's stable.

2. Proper Logging for Human Verification

Every test must include meaningful logging so that a human reviewing the test output can understand what happened without reading the code:

Log the test scenario being executed in plain English
Log key request payloads and response data (sanitized of sensitive info)
Log assertion results with context ("Expected order status to be ACTIVE, got ACTIVE — PASS")
Log setup and teardown steps so failures can be traced to their root cause

I explicitly instruct the agent to add this logging. Left to its own devices, it'll write tests that either log nothing or log everything. The skill files include examples of what "good logging" looks like.

3. AI-Powered Code Review Before PR

Before raising a PR, I spin up another agent session specifically for code review. I ask the agent to review the test code with fresh eyes — checking for:

Code consistency with existing patterns
Missing edge cases or assertions
Hardcoded values that should be dynamic
Proper error handling and cleanup
Test isolation (no shared state between tests)
Readability and naming clarity

This is like having a second pair of eyes, except it's instant and never annoyed that you're asking for a review at 6pm on a Friday.

Only after this code review pass — and after addressing any findings — do I raise the PR for human review.

What Works Surprisingly Well

Capability	Why It's Great
Pattern matching	Tell the agent "follow the same pattern as existing tests" and it genuinely does — naming, helpers, assertions, structure
Spec → Tests	Give it a requirements doc and it produces a structured test suite mapped directly to the spec
Error scenarios	Agents don't have the human bias toward happy paths — they'll systematically cover timeouts, invalid inputs, auth failures, rate limits
Dynamic payloads	Once it understands your payload structure from the skill file, it generates valid variations without you dictating every field
Boilerplate	Setup, teardown, data builders, config files — all the tedious-but-essential stuff, handled effortlessly

What Doesn't Work (Yet)

Flaky test debugging — If a test passes sometimes and fails sometimes, agents struggle. Flakiness stems from timing, environment issues, or shared state — things that require runtime observation, not just code reading.
Complex environment setup — Agents can write the test code, but they can't spin up your Docker containers, seed your database, or configure your VPN. You still own the infrastructure.
Business logic judgment — The agent can write a test that checks "the response status is 200," but it can't tell you whether 200 is the correct behavior for that scenario. You still need domain knowledge to validate the what, even if the agent handles the how.

Getting Started

Step 1: Create Your Context Files (4–6 hours)

File	Purpose	Key Contents
`PROJECT.md`	Project context	Architecture, terminology, requirements, environment details
API Test Skill	API test knowledge	Framework setup, payload construction, test data APIs, auth patterns, helper utilities
UI Test Skill	UI test knowledge	POM structure, locator strategy, interaction patterns, assertion approaches, best practices
`CLAUDE.md` / `.cursorrules`	Tool-specific config	Repository conventions, build commands, coding standards

Step 2: Establish Your Workflow

Always research before planning, plan before implementing
Start with one test, iterate, then scale — don't ask for 20 tests at once
Run tests after every change — paste failures back to the agent and let it self-correct

Step 3: Set Your Quality Gates

All tests green before PR
Meaningful logging in every test
AI code review pass before human review
No hardcoded test data, no flaky waits, no shared state

Step 4: Invest Time Upfront, Save Time Forever

Writing skill files takes a few hours. But those hours pay dividends across every future session. Every time you or a teammate starts a new AI session, you skip the "explain everything from scratch" phase and go straight to productive work.

Final Thought

AI coding agents don't replace the engineer. They replace the tedium. The judgment calls — what to test, why it matters, whether the behavior is correct — those are still yours. But the mechanical work of translating those decisions into running code? That's where agents shine.

The real unlock isn't the AI itself — it's the context you build around it. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a team member who understands your codebase, follows your conventions, and produces work you're confident shipping.

112 API tests. 14 UI tests. One month. Invest a day building your skill files and try pairing with an AI agent for a week. You won't go back.

Full disclosure: The ideas, workflow, skill files, and real-world experience in this post are entirely mine — born from months of actually doing this work day in, day out. AI helped me write and structure the blog post itself. Practice what you preach, right?

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

vipin singh — Fri, 06 Mar 2026 10:57:43 +0000

112 API tests. 14 UI tests. At my old pace of ~15/month, that would have taken 8+ months.

The Payoff — Before You Read Anything Else

Metric	Before	After
Tests shipped / month	~15–20	126 (112 API + 14 UI)
AI output needing manual fixes	—	~30%
Error scenario coverage	Only P0 errors	Systematically covered per endpoint
Code consistency	Variable	High — agents follow skill files faithfully
My time spent on	Writing boilerplate	Test design & strategy

Now let me tell you how I got here.

My Setup

Two AI tools, different purposes. The combination is where the power lies.

Tool	What It Is	Best For
Claude Code	CLI-based coding agent with full codebase access	Multi-file research, large test suites, gap analysis
Cursor	AI-powered IDE built on VS Code	Quick edits, in-context tweaks, single-file work

The Secret Weapon: Skill Files & Markdown Context

Before asking an agent to write a single line of code, I make sure it has context. Without it, the agent guesses. With it, the agent is an informed collaborator.

Think of it like onboarding a new contractor. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every session benefits.

File	What It Captures
`PROJECT.md`	Architecture, domain terminology, environment details, requirements
API Test Skill	Framework setup, payload construction, test data sequences, auth, helpers
UI Test Skill	Page Object Model structure, locator strategy, interaction patterns, anti-patterns
`CLAUDE.md`	Repo conventions, build commands, coding standards

What's Actually Inside (Real Excerpts)

API Test Skill — How 112 API Tests Became Possible

This is the file that made the biggest difference. It tells the agent exactly how to construct tests for our payment API:

// API Test Skill Excerpt — Dynamic Payload Construction

/**
 * ENDPOINT: POST /v2/payments
 * 
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 * 
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */

// Test data creation sequence:
// 1. Create customer   → POST /v2/customers
// 2. Create card       → POST /v2/cards  (use sandbox token: cnon:card-nonce-ok)
// 3. Create payment    → POST /v2/payments (reference customer + card)
// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)
// 5. Cleanup           → POST /v2/refunds (refund test payment)

// Payload builder — agent uses this pattern for every endpoint:
function buildPaymentPayload(overrides = {}) {
  return {
    idempotency_key: crypto.randomUUID(),
    source_id: overrides.source_id || testCard.id,
    amount_money: {
      amount: overrides.amount || 1000,
      currency: overrides.currency || 'AUD',
    },
    customer_id: overrides.customer_id || testCustomer.id,
    reference_id: `test-${Date.now()}`,
    ...overrides,
  };
}

// Error scenario template — agent generates one per error code:
test('POST /v2/payments with expired card returns 402', async () => {
  const payload = buildPaymentPayload({
    source_id: 'cnon:card-nonce-declined',  // sandbox decline token
  });

  console.log(`Testing: expired card → expect 402`);
  const res = await api.post('/v2/payments', payload);
  console.log(`Response: ${res.status} — ${res.data?.errors?.[0]?.code}`);

  expect(res.status).toBe(402);
  expect(res.data.errors[0].category).toBe('PAYMENT_METHOD_ERROR');
  expect(res.data.errors[0].code).toBe('CARD_DECLINED');
});

// Auth failure template:
test('POST /v2/payments with expired token returns 401', async () => {
  const payload = buildPaymentPayload();

  const res = await api.post('/v2/payments', payload, {
    headers: { Authorization: 'Bearer expired-token-xxx' },
  });

  expect(res.status).toBe(401);
  expect(res.data.errors[0].category).toBe('AUTHENTICATION_ERROR');
});

Why this works: The agent now knows the exact payload structure, the sandbox tokens for each error scenario, the assertion patterns, and the cleanup sequence. It generates one test per error code without me dictating each one.

UI Test Structure Template — what every test looks like

The skill file defines a standard template so the agent produces tests that match the codebase from the first generation:

const { test, expect } = require('@playwright/test')
const { chrome } = require('../../utils/browser')
const { navigateToCheckout } = require('../../utils')

const { LandingPage } = require('../pages/Landing')
const { LoginPage } = require('../pages/Login')
const { SummaryPage } = require('../pages/Summary')

const { describe, beforeEach } = test

describe('@checkout_regression_au_login', () =&gt; {
  let page
  let landingPage, loginPage, summaryPage

  beforeEach(async () =&gt; {
    const browserInstance = await chrome()
    page = browserInstance.page
    landingPage = new LandingPage(page)
    loginPage = new LoginPage(page)
    summaryPage = new SummaryPage(page)
  })

  test('Existing user completes login and confirms order', async () =&gt; {
    // Arrange
    await navigateToCheckout({ landingPage })

    // Act
    await loginPage.setEmailAddress('jane@doe.com')
    await loginPage.continue()
    await passwordPage.setPassword('password')
    await passwordPage.login()

    // Assert
    await summaryPage.confirmOrder()
    const action = await landingPage.getCallbackAction()
    expect(action).toBe('confirm')
  })
})

Every test follows this exact shape — imports, beforeEach setup, Arrange/Act/Assert structure.

Page Object Pattern — the FIELDS convention

const FIELDS = {
  submitButton: {
    selector: '[data-testid="submit-button"]',
  },
  emailInput: {
    selector: '[data-testid="email-input"]',
  },
}

exports.MyPage = class {
  constructor(page) {
    this.page = page
  }

  async clickSubmit() {
    await this.page.waitForSelector(
      FIELDS.submitButton.selector, { state: 'visible' }
    )
    await this.page.click(FIELDS.submitButton.selector)
  }

  async setEmail(text) {
    await this.page.waitForSelector(
      FIELDS.emailInput.selector, { state: 'visible' }
    )
    await this.page.fill(FIELDS.emailInput.selector, text)
  }
}

Selectors live in a FIELDS object at the top. Every interaction method waits for visibility first. The agent follows this pattern without being reminded.

Selector Strategy — the #1 source of flaky tests, locked down

Selector Priority Order:

✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits

Anti-Patterns — what the agent must never do

Without these rules, agents produce code that works in demos but fails in CI:

// ❌ BAD — arbitrary wait, masks timing issues
await page.waitForTimeout(3000)
await page.click('[data-testid="button"]')

// ✅ GOOD — explicit wait for element
await page.waitForSelector(
  '[data-testid="button"]', { state: 'visible' }
)
await page.click('[data-testid="button"]')

// ❌ BAD — try-catch masks real failures
try {
  await page.waitForSelector(selector1)
} catch {
  await page.waitForSelector(selector2)
}

// ✅ GOOD — explicit conditional
if (country === 'us') {
  await page.waitForSelector(usSelector)
} else {
  await page.waitForSelector(defaultSelector)
}

Before I added these to the skill file, roughly 1 in 3 generated tests had at least one of these issues.

The Workflow: Research → Plan → Implement

I never just say "write me some tests." I follow a deliberate three-phase process.

1. Research — "Read existing tests, the API spec, and the skill files. What's covered? What's missing?" The agent explores and builds a mental model. I review before moving on. (~5 min)

2. Plan — "Propose which tests to write, in what order, and why." The agent produces a prioritized list. I review and approve before any code is written. (~5 min)

3. Implement — Only now does the agent write code. Because it's done research and has an approved plan, the output is targeted and aligned. (~15–20 min/test)

This prevents the most common failure mode: the agent eagerly writing 500 lines of code that miss the point entirely.

Tool split: Claude Code for large, multi-file efforts (research → plan → implement). Cursor for focused, in-context single-file edits (plan → implement).

Quality Gates Before Every PR

Writing tests fast means nothing if they're broken. Every AI-generated test passes three gates:

1. All tests running and passing — Full suite, not just new tests. If a test is flaky, it doesn't ship.

2. Proper logging for human verification — Every test logs the scenario in plain English, key request/response data (sanitized), and assertion results with context. The skill files include examples of what "good logging" looks like.

3. AI code review before human review — Before raising a PR, I spin up a separate agent session for code review — checking pattern consistency, missing edge cases, hardcoded values, test isolation, and naming clarity.

What Works Well

✅ Pattern matching — "Follow the same pattern as existing tests" and it genuinely does
✅ Spec → Tests — Hand it a requirements doc, get a structured test suite mapped to the spec
✅ Error scenarios — Agents don't have the human happy-path bias; they systematically cover failures
✅ Dynamic payloads — Once it knows your structure from the skill file, it generates valid variations
✅ Boilerplate — Setup, teardown, data builders, config — all handled in minutes

What Doesn't Work (Honestly)

❌ Flaky test debugging — Flakiness stems from timing and environment, not code. Agents need runtime observation, which they don't have.
❌ Infrastructure setup — Agents write test code but can't spin up Docker, seed databases, or configure VPNs.
❌ Business logic judgment — The agent checks how, but you still decide what's correct. Every test still needs human validation of intent.
❌ Hardcoded values — ~30% of generated tests have hardcoded IDs or timestamps that should be dynamic. Always review before merging.
❌ Stale skill files — When APIs change, skill files can cause the agent to generate outdated tests. Maintain them like any other documentation.

Getting Started

1. Write your context files (4–6 hours) — PROJECT.md, API test skill, UI test skill. Include real code examples — the excerpts above are a good template.

2. Start with one test (30 min) — Use the research → plan → implement workflow. Iterate until it passes.

3. Scale gradually (1–2 weeks) — Write 5–10 tests. Refine your skill files based on what the agent gets wrong. Expect ~30% of output to need manual fixes early on.

4. Establish quality gates — All tests green. Meaningful logging. AI code review pass. No hardcoded data. Then human review.

Final Thought

The real unlock isn't the AI itself — it's the context you build around it. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a tool that understands your codebase, follows your conventions, and produces work you're confident shipping.

112 API tests. 14 UI tests. One month. The skill files took a day to write.

Full disclosure: The ideas, workflow, and real-world experience in this post are entirely mine — born from months of actually doing this work. AI helped me write and structure the blog post itself. Practice what you preach, right?

DEV Community: vipin singh

From Tokens to Test Suites: Understanding How LLMs Work for QA Engineers

Table of Contents

Part 1 — How LLMs Actually Work

PART 1 — How LLMs Actually Work

1. What Is an LLM?

2. Tokens and Tokenization

How it works

Concrete examples

Why tokenization matters for QA

3. Pre-Training

The data: the internet

The training loop

4. Loss

How loss works

The loss curve

5. Neural Networks

The core idea

6. Inference

7. Non-Determinism

The sampling process

8. Generation Parameters

Temperature

Top-K

Top-P (Nucleus Sampling)

Parameters summary

9. Fine-Tuning and RLHF

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reinforcement Learning from Human Feedback (RLHF)

10. Hallucinations

Why hallucinations happen

The deeper issue

Modern mitigations

11. Bias

1. Training data bias

2. Labeler bias

3. Amplification through sampling

12. Prompting Strategies

Zero-Shot Prompting

One-Shot Prompting

Few-Shot Prompting

The QA angle on prompting

📚 References & Further Reading

💡 Suggested Reading Flow

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

The Payoff — Before You Read Anything Else

My Setup

The Secret Weapon: Skill Files & Markdown Context

What's Inside the API Test Skill

What's Inside the UI Test Skill

The Workflow: Research → Plan → Implement

In Claude Code: Research → Plan → Implement

In Cursor: Plan → Implement

Quality Gates Before Every PR

1. All Tests Running and Passing

2. Proper Logging for Human Verification

3. AI-Powered Code Review Before PR

What Works Surprisingly Well

What Doesn't Work (Yet)

Getting Started

Final Thought

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

The Payoff — Before You Read Anything Else

My Setup

The Secret Weapon: Skill Files & Markdown Context

What's Actually Inside (Real Excerpts)

API Test Skill — How 112 API Tests Became Possible

The Workflow: Research → Plan → Implement

Quality Gates Before Every PR

What Works Well

What Doesn't Work (Honestly)

Getting Started

Final Thought