<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: vipin singh</title>
    <description>The latest articles on DEV Community by vipin singh (@vipin_singh_701b96b0df516).</description>
    <link>https://dev.to/vipin_singh_701b96b0df516</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3809563%2F4f3c27b1-fe70-49d4-be86-e6ade3908b1a.png</url>
      <title>DEV Community: vipin singh</title>
      <link>https://dev.to/vipin_singh_701b96b0df516</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vipin_singh_701b96b0df516"/>
    <language>en</language>
    <item>
      <title>From Tokens to Test Suites: Understanding How LLMs Work for QA Engineers</title>
      <dc:creator>vipin singh</dc:creator>
      <pubDate>Wed, 15 Apr 2026 04:38:28 +0000</pubDate>
      <link>https://dev.to/vipin_singh_701b96b0df516/from-tokens-to-test-suites-understanding-how-llms-work-for-qa-engineers-21f1</link>
      <guid>https://dev.to/vipin_singh_701b96b0df516/from-tokens-to-test-suites-understanding-how-llms-work-for-qa-engineers-21f1</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; Senior QA / Automation Engineers transitioning into AI and LLM testing. This blog is structured in two parts: first we go deep on &lt;em&gt;how LLMs actually work&lt;/em&gt; (grounded in Andrej Karpathy's "Deep Dive into LLMs"), then we use that foundation to reason clearly about &lt;em&gt;how to test them&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Understanding the internals is not optional. If you don't know why an LLM hallucinates, you can't design a test that catches it.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Table of Contents
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Part 1 — How LLMs Actually Work
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;What Is an LLM?&lt;/li&gt;
&lt;li&gt;Tokens and Tokenization&lt;/li&gt;
&lt;li&gt;Pre-Training — Where Knowledge Comes From&lt;/li&gt;
&lt;li&gt;Loss — The Compass of Training&lt;/li&gt;
&lt;li&gt;Neural Networks — Inside the Black Box&lt;/li&gt;
&lt;li&gt;Inference — How Text Gets Generated&lt;/li&gt;
&lt;li&gt;Why Outputs Are Non-Deterministic&lt;/li&gt;
&lt;li&gt;Generation Parameters — Temperature, Top-K, Top-P&lt;/li&gt;
&lt;li&gt;Fine-Tuning and RLHF&lt;/li&gt;
&lt;li&gt;Hallucinations — Why LLMs Make Things Up&lt;/li&gt;
&lt;li&gt;Bias — Where It Comes From&lt;/li&gt;
&lt;li&gt;Prompting Strategies — Zero, One, Few-Shot&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  PART 1 — How LLMs Actually Work
&lt;/h2&gt;




&lt;h2&gt;
  
  
  1. What Is an LLM?
&lt;/h2&gt;

&lt;p&gt;At its core, a Large Language Model does exactly one thing: &lt;strong&gt;it predicts the next token given a sequence of preceding tokens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That's it. Everything you see ChatGPT, Claude, or Gemini do — answer questions, write code, summarize documents, roleplay characters — emerges from one deeply trained function: &lt;em&gt;what token is most likely to come next?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Think of your phone's autocomplete. When you type "I'll be there in" your keyboard suggests "five", "a few", "an hour". An LLM is that autocomplete, but trained on essentially the entire internet, with hundreds of billions of parameters, capable of maintaining coherent context across thousands of tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The mental model that matters:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input:  [token_1, token_2, ..., token_n]
Output: probability distribution over ~100,000 possible next tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every time the model "speaks," it's sampling from that probability distribution, appending the result to the context, and repeating. That loop is the entirety of text generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for QA:&lt;/strong&gt; The model isn't reasoning in the way a human programmer reasons. It's not executing logic. It's pattern-matching at massive scale. When it fails, it fails in pattern-matching ways — not logic errors.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Tokens and Tokenization
&lt;/h2&gt;

&lt;p&gt;Before any text enters a neural network, it has to be converted into numbers. The process is called &lt;strong&gt;tokenization&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;Neural networks require a finite vocabulary of discrete symbols. Raw text is converted into these symbols — called &lt;strong&gt;tokens&lt;/strong&gt; — using an algorithm called &lt;strong&gt;Byte Pair Encoding (BPE)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Here's the pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with the raw UTF-8 bytes of text (256 possible byte values).&lt;/li&gt;
&lt;li&gt;Find the most common consecutive byte pairs and merge them into new symbols.&lt;/li&gt;
&lt;li&gt;Repeat until you reach your target vocabulary size (~100,000 for GPT-4).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The result is a vocabulary where common English words become single tokens, common word-pieces become tokens, and rare or novel strings get split into multiple tokens.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GPT-4 uses exactly 100,277 tokens.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Concrete examples
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"hello world"     → ["hello", " world"]        → [15339, 1917]
"helloworld"      → ["h", "elloworld"]          → [71, 96392]
"HELLO WORLD"     → ["HEL", "LO", " WORLD"]     → [51812, 1623, 51991]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The space before "world" is &lt;em&gt;included&lt;/em&gt; in the token. Spacing matters.&lt;/li&gt;
&lt;li&gt;Case changes the tokenization entirely.&lt;/li&gt;
&lt;li&gt;The same letters in a different arrangement → completely different tokens.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why tokenization matters for QA
&lt;/h3&gt;

&lt;p&gt;Tokenization is a silent source of bugs in LLM systems. The model doesn't see characters — it sees token IDs. This has concrete implications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Spelling tasks break&lt;/strong&gt;: the model operates on tokens, not letters. Ask it to count the letters in "strawberry" and it often fails because "strawberry" might tokenize as &lt;code&gt;["straw", "berry"]&lt;/code&gt; — the model never "sees" individual letters.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Numbers behave unexpectedly&lt;/strong&gt;: "9.11" and "9.9" tokenize differently, and the model's "understanding" of which is larger has been shown to be influenced by how those strings appear in training data (Bible verse chapter numbers, for instance, where 9.11 &amp;gt; 9.9).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Language boundary bugs&lt;/strong&gt;: a prompt that works in English may tokenize to more tokens in another language, consuming more context window and potentially truncating critical content.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tokenization Insight:
┌───────────────────────────────────────────────────────────────────┐
│  "strawberry"  →  ["straw", "berry"]  →  [19535, 15717]           |
│                                                                   │
│  Model perspective: Two tokens. No character-level access.        │
│  "Count the r's in strawberry" → the model guesses from patterns, │
│  not by literally counting characters.                            |
└───────────────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  3. Pre-Training
&lt;/h2&gt;

&lt;p&gt;Pre-training is how an LLM acquires its knowledge. It's the most expensive phase — weeks or months on thousands of GPUs — and it's where the model learns everything it knows about language, facts, reasoning patterns, code, and the world.&lt;/p&gt;

&lt;h3&gt;
  
  
  The data: the internet
&lt;/h3&gt;

&lt;p&gt;The training corpus starts with a massive scrape of the web. For reference, Meta's Fineweb dataset used in training Llama models contains approximately &lt;strong&gt;15 trillion tokens&lt;/strong&gt; (~44 terabytes of text).&lt;/p&gt;

&lt;p&gt;But raw web data is messy. The pipeline to clean it involves multiple stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Raw Web Crawl (Common Crawl)
         │
         ▼
   URL Filtering (blacklists: spam, malware, adult content)
         │
         ▼
   Text Extraction (strip HTML → keep readable text)
         │
         ▼
   Language Filtering (e.g., keep pages &amp;gt;65% English)
         │
         ▼
   Deduplication (remove near-duplicate documents)
         │
         ▼
   PII Removal (strip addresses, SSNs, etc.)
         │
         ▼
   Final Corpus (high-quality, diverse, deduplicated text)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The training loop
&lt;/h3&gt;

&lt;p&gt;Here's what actually happens during pre-training:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsc767ba20ah6iewqh7gs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsc767ba20ah6iewqh7gs.png" alt="The Training Loop"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This loop runs billions of times across trillions of tokens. A single training run for a large model like GPT-4 might cost tens of millions of dollars and take months.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The intuition&lt;/strong&gt;: imagine reading the entire internet, and every time you read a sentence, you predict the next word, then check if you were right, then slightly adjust your mental model to be more accurate next time. Do this trillions of times. That's pre-training.&lt;/p&gt;

&lt;p&gt;The result is a &lt;strong&gt;base model&lt;/strong&gt; — a token simulator that has internalized the statistical patterns of human language. It's not yet an assistant. It's a very sophisticated "continue this text" machine.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Loss
&lt;/h2&gt;

&lt;p&gt;Loss is the single most important number during training. It answers: &lt;strong&gt;how wrong is the model right now?&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How loss works
&lt;/h3&gt;

&lt;p&gt;The neural network outputs a probability for every token in the vocabulary as the next token. The loss measures how much probability the model assigned to the &lt;em&gt;correct&lt;/em&gt; next token.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Correct next token in corpus: " Post" (token 3962)

Model's prediction:
  " Direction"  →  4%  probability
  " Case"       →  2%  probability
  " Post"       →  3%  probability  ← should be HIGH
  (other 99,274 tokens share the remaining ~91%)

Loss = how surprised were we that the correct token appeared?
       (formally: negative log probability of the correct token)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Low loss = high probability assigned to correct tokens = good model.&lt;/p&gt;

&lt;p&gt;High loss = model is surprised by what actually comes next = poor model.&lt;/p&gt;

&lt;h3&gt;
  
  
  The loss curve
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Loss
  │
4.0│ ●
   │  ●
3.0│    ●●
   │       ●●●
2.0│           ●●●●●●●
   │                  ●●●●●●●●●●●●
1.0│                               ●●●●●●●●●●●●●●●●●●●●●●
   └────────────────────────────────────────────────────────
                            Training Steps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A decreasing loss is a healthy training run. If loss plateaus or spikes, something is wrong — data quality issues, learning rate problems, or architecture bugs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why QA engineers care about loss&lt;/strong&gt;: When evaluating a fine-tuned model, validation loss is a key health metric. If you're running A/B tests on two model versions, the one with lower validation loss on your domain-specific data will generally perform better on your use case.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Neural Networks
&lt;/h2&gt;

&lt;p&gt;You don't need to know the math, but you do need the right mental model of what a neural network actually is.&lt;/p&gt;

&lt;h3&gt;
  
  
  The core idea
&lt;/h3&gt;

&lt;p&gt;A neural network is a &lt;strong&gt;mathematical function&lt;/strong&gt; that takes an input (your token sequence) and produces an output (probability distribution over next tokens). It has &lt;strong&gt;parameters&lt;/strong&gt; — billions of numbers — that determine how inputs get transformed into outputs.&lt;/p&gt;

&lt;p&gt;Think of it like a massive mixing console with billions of dials. Random settings → random output. Carefully tuned settings (from training) → useful predictions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Parameters (weights):  The "knowledge" of the model.
                        ~7 billion for Llama 3 8B
                        ~405 billion for Llama 3 405B
                        ~1.8 trillion estimated for GPT-4

Input tokens ──────────────────────────────────────────────┐
                                                           │
        ┌───────────────────────────────────────────────┐  │
        │  Embedding Layer                              │◄─┘
        │  (tokens → vectors)                           │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 1                          │
        │  ┌────────────┐  ┌─────────────────────────┐  │
        │  │  Attention │  │  Feed-Forward (MLP)     │  │
        │  └────────────┘  └─────────────────────────┘  │
        └───────────────┬───────────────────────────────┘
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Transformer Block 2  (same structure)        │
        └───────────────┬───────────────────────────────┘
                        │
                      [...]
                        │
        ┌───────────────▼───────────────────────────────┐
        │  Output Layer (Logits → Softmax)              │
        └───────────────┬───────────────────────────────┘
                        │
                        ▼
        Probability distribution over 100,277 tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;strong&gt;attention mechanism&lt;/strong&gt; is the key innovation in modern LLMs (from the "Attention Is All You Need" paper). It allows each token to "look at" other tokens in the context and weight their relevance. This is what gives LLMs their ability to maintain coherent context over long passages.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important nuance&lt;/strong&gt;: the parameters are fixed once training is done. When you're chatting with ChatGPT, &lt;em&gt;no learning is happening&lt;/em&gt;. Those weights were locked months ago. The model is just computing — very expensively — the same mathematical function.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Inference
&lt;/h2&gt;

&lt;p&gt;Inference is what happens when you send a prompt to an LLM and get a response. Here's the exact generation loop:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsghnn94zds2a98pglgez.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsghnn94zds2a98pglgez.png" alt="Inference"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step by step with a concrete example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context:   [91, 860, 287]   =  "|Viewing ing"
                                       ↓
Neural network runs forward pass
                                       ↓
Output probability vector:
  " Single"  → 12%
  " Article" → 8%
  " Post"    → 7%
  " Page"    → 4%
  ...         ...
                                       ↓
Sample: say we draw " Single" (token 11579)
                                       ↓
New context: [91, 860, 287, 11579] = "|Viewing ing Single"
                                       ↓
Repeat...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The context window&lt;/strong&gt; is the model's "working memory" — everything it can see while generating the next token. For GPT-2 this was 1,024 tokens. For modern models it's 128K to 1M+ tokens. Content inside the context window is directly accessible; the model doesn't need to "remember" it from training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key inference insight&lt;/strong&gt;: the model only ever &lt;em&gt;appends&lt;/em&gt; tokens to the sequence. It can't go back and revise a previous token once it's generated. This is why LLMs sometimes talk themselves into a corner — they're committed to their prior output.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. Non-Determinism
&lt;/h2&gt;

&lt;p&gt;Ask ChatGPT the same question twice. You'll likely get different answers. Why?&lt;/p&gt;

&lt;h3&gt;
  
  
  The sampling process
&lt;/h3&gt;

&lt;p&gt;At each step, the model produces a probability distribution over the next token. It doesn't always pick the &lt;em&gt;highest probability&lt;/em&gt; token (that would be called &lt;em&gt;greedy decoding&lt;/em&gt; and would produce repetitive, boring text). Instead, it &lt;strong&gt;samples&lt;/strong&gt; from the distribution — which introduces randomness.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Token probabilities for next token:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%
  " grape"  → 15%
  (others)  → 5%

Greedy:  always picks " apple" → deterministic, repetitive
Sampling: picks " banana" 25% of the time → varied, creative
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same reason the model hallucinated three different fake biographies of "Orson Kovacs" (a made-up person) in Karpathy's demo — it doesn't "know" the right answer, so it samples plausible-sounding text each time, landing on different random outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The implications for QA are profound&lt;/strong&gt;: the same prompt can yield different outputs on different runs. You cannot use simple &lt;code&gt;assertEqual&lt;/code&gt; comparisons to verify correctness. This is the single biggest shift in testing philosophy when you move from traditional software to LLM-based systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Generation Parameters
&lt;/h2&gt;

&lt;p&gt;These are the knobs that control how the model samples from its probability distributions. Understanding them is essential for both building and testing LLM systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Temperature
&lt;/h3&gt;

&lt;p&gt;Temperature controls how "flat" or "peaked" the probability distribution is before sampling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Token probabilities BEFORE temperature:
  " apple"  → 35%
  " banana" → 25%
  " orange" → 20%

Temperature = 0.1 (LOW — more deterministic):
  " apple"  → 91%  (dominant choice amplified)
  " banana" → 6%
  " orange" → 3%
  → Very predictable, somewhat repetitive output

Temperature = 1.0 (NEUTRAL):
  Original distribution preserved → balanced exploration

Temperature = 2.0 (HIGH — more random):
  " apple"  → 12%  (differences flattened)
  " banana" → 11%
  " orange" → 10%
  → Wildly creative, often incoherent output
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Rule of thumb:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Factual Q&amp;amp;A, code generation → &lt;code&gt;temperature: 0.1–0.3&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Creative writing, brainstorming → &lt;code&gt;temperature: 0.7–1.0&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Random/experimental output → &lt;code&gt;temperature &amp;gt; 1.0&lt;/code&gt; (usually a mistake)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Top-K
&lt;/h3&gt;

&lt;p&gt;Limits sampling to the K most probable tokens. All others are zeroed out.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Top-K = 3:
  Only sample from [" apple", " banana", " orange"]
  Tokens ranked 4th and below are excluded

Effect: Prevents very unlikely tokens from ever being sampled.
        Can make output feel more constrained.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Top-P (Nucleus Sampling)
&lt;/h3&gt;

&lt;p&gt;Instead of a fixed K, samples from the smallest set of tokens whose cumulative probability exceeds P.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Top-P = 0.9:
  Add tokens by probability until cumulative sum ≥ 90%

  " apple"  → 35%  (sum: 35%)
  " banana" → 25%  (sum: 60%)
  " orange" → 20%  (sum: 80%)
  " grape"  → 15%  (sum: 95%)  ← crosses 90% here

  Sample only from {" apple", " banana", " orange", " grape"}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Top-P is generally preferred over Top-K because it adapts to the actual probability distribution. When the model is confident (one token dominates), the nucleus is small. When the model is uncertain, the nucleus expands.&lt;/p&gt;

&lt;h3&gt;
  
  
  Parameters summary
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Low Value&lt;/th&gt;
&lt;th&gt;High Value&lt;/th&gt;
&lt;th&gt;QA Implication&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Temperature&lt;/td&gt;
&lt;td&gt;Predictable, deterministic&lt;/td&gt;
&lt;td&gt;Random, creative&lt;/td&gt;
&lt;td&gt;Low temp → easier to test; High temp → need more runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-K&lt;/td&gt;
&lt;td&gt;Few token candidates&lt;/td&gt;
&lt;td&gt;Many token candidates&lt;/td&gt;
&lt;td&gt;Lower K → more consistent outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Top-P&lt;/td&gt;
&lt;td&gt;Small nucleus (confident choices)&lt;/td&gt;
&lt;td&gt;Large nucleus (broad choices)&lt;/td&gt;
&lt;td&gt;Lower P → less variance in outputs&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  9. Fine-Tuning and RLHF
&lt;/h2&gt;

&lt;p&gt;A pre-trained base model is brilliant but unusable. It doesn't answer questions — it just "continues" text in the style of the internet. Turning it into an assistant requires two more training stages.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 2: Supervised Fine-Tuning (SFT)
&lt;/h3&gt;

&lt;p&gt;The training procedure is identical to pre-training — same algorithm, same loss function. The only change is the &lt;strong&gt;dataset&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of internet documents, the training data is now &lt;strong&gt;human-curated conversations&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What's the capital of France?"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The capital of France is Paris."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Millions of such conversations, written by paid expert annotators following detailed labeling guidelines, are used to teach the model to adopt the "assistant" persona and response format.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The limitation of SFT&lt;/strong&gt;: the model imitates human experts. It can never exceed human performance on tasks where the human labeler was the ceiling. And the labeler doesn't always know the optimal solution — especially for math problems where the best "chain of thought" for a human differs from what works best for the model.&lt;/p&gt;

&lt;h3&gt;
  
  
  Stage 3: Reinforcement Learning from Human Feedback (RLHF)
&lt;/h3&gt;

&lt;p&gt;This is where the model learns to discover solutions on its own through trial and error.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotb2cthp4w6at70fgz97.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fotb2cthp4w6at70fgz97.png" alt="RLHF"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model generates many candidate responses, checks which ones are correct (or preferred), and updates its parameters to make the correct responses more likely. Crucially, &lt;strong&gt;no human is writing the solutions&lt;/strong&gt; — the model discovers them itself.&lt;/p&gt;

&lt;p&gt;This is analogous to how DeepMind's AlphaGo went from "imitating human moves" (SFT) to "discovering move 37" — a move no human would make, but which emerged from RL because it statistically led to winning.&lt;/p&gt;

&lt;p&gt;The result of RLHF is what you interact with on ChatGPT: a model that doesn't just imitate — it has developed internal "reasoning strategies" that it discovered were effective.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The three-stage summary:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Data&lt;/th&gt;
&lt;th&gt;Goal&lt;/th&gt;
&lt;th&gt;Analogy&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pre-Training&lt;/td&gt;
&lt;td&gt;Internet documents&lt;/td&gt;
&lt;td&gt;Build knowledge&lt;/td&gt;
&lt;td&gt;Reading every textbook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SFT&lt;/td&gt;
&lt;td&gt;Human-curated conversations&lt;/td&gt;
&lt;td&gt;Become an assistant&lt;/td&gt;
&lt;td&gt;Studying worked examples&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RLHF&lt;/td&gt;
&lt;td&gt;Self-generated (trial &amp;amp; error)&lt;/td&gt;
&lt;td&gt;Discover effective strategies&lt;/td&gt;
&lt;td&gt;Doing practice problems&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  10. Hallucinations
&lt;/h2&gt;

&lt;p&gt;This is where things get uncomfortable — and where most teams are surprised the first time they encounter it in production.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why hallucinations happen
&lt;/h3&gt;

&lt;p&gt;The model doesn't have a "I don't know" default. It was trained on data where questions of the form "Who is X?" are answered confidently with correct answers. So when you ask "Who is Orson Kovacs?" (a made-up person), the model doesn't say "I don't know" — it samples the most statistically likely continuation of a "Who is X?" prompt, which happens to sound like a confident biographical description.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Training data pattern:
  "Who is Tom Cruise?"  → "[confident answer about Tom Cruise]"
  "Who is John Barrasso?" → "[confident answer about Senator Barrasso]"
  "Who is Genghis Khan?" → "[confident answer about Mongol ruler]"

Learned behavior:
  "Who is Orson Kovacs?" → "[confident answer about... someone invented on the spot]"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model is not "lying". It's doing exactly what it was trained to do: produce the statistically most likely token sequence given the context. It just happens that the most likely token sequence for "Who is [unknown person]?" in its training data was a confident-sounding response.&lt;/p&gt;

&lt;h3&gt;
  
  
  The deeper issue
&lt;/h3&gt;

&lt;p&gt;Even when internal network activations may "know" the answer is uncertain, that knowledge isn't wired to the output. The model has no direct mechanism to surface its own uncertainty unless it was explicitly trained on examples where "I don't know" was the labeled correct answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Modern mitigations
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Epistemic training&lt;/strong&gt;: interrogate the model on thousands of factual questions, identify which it gets consistently wrong, then add "I don't know" responses for those to the training data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Tool use&lt;/strong&gt;: give the model a &lt;code&gt;&amp;lt;SEARCH_START&amp;gt;&lt;/code&gt; / &lt;code&gt;&amp;lt;SEARCH_END&amp;gt;&lt;/code&gt; token protocol. When uncertain, it can emit a search query, retrieve web results, and place them into its context window. The context window functions as working memory — anything in it is directly accessible, unlike knowledge in parameters which is more like vague long-term memory.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Knowledge in parameters&lt;/strong&gt; = vague recollection (what you remember from something you read months ago)&lt;br&gt;
&lt;strong&gt;Knowledge in context window&lt;/strong&gt; = working memory (what's right in front of you)&lt;/p&gt;


&lt;h2&gt;
  
  
  11. Bias
&lt;/h2&gt;

&lt;p&gt;LLMs absorb bias from three sources:&lt;/p&gt;
&lt;h3&gt;
  
  
  1. Training data bias
&lt;/h3&gt;

&lt;p&gt;The internet over-represents certain perspectives: English speakers, Western cultures, certain age demographics, certain political viewpoints. If 90% of web pages in the training corpus express opinion X on a topic, the model will tend toward X.&lt;/p&gt;

&lt;p&gt;A model trained primarily on English web data will perform worse on low-resource languages. A model trained on Wikipedia will reflect the coverage biases in Wikipedia. These aren't bugs per se — they're statistical reflections of the data.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Labeler bias
&lt;/h3&gt;

&lt;p&gt;During SFT and RLHF, human annotators make judgment calls. Their cultural background, political views, and personal style preferences all influence what gets labeled as "ideal" responses. Annotator guidelines try to minimize this, but can't eliminate it.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. Amplification through sampling
&lt;/h3&gt;

&lt;p&gt;Because the model tends toward the mean of its training distribution, it can amplify stereotypes that are statistically common in training data even if they're not normatively accurate. If "CEO" in training data is overwhelmingly paired with male pronouns, the model will associate CEO with male pronouns even if no one explicitly programmed that association.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters for QA&lt;/strong&gt;: bias is hard to test with unit tests. It shows up in aggregate — across thousands of test cases, certain demographic groups, certain topic areas. Your testing strategy needs to explicitly probe for it.&lt;/p&gt;


&lt;h2&gt;
  
  
  12. Prompting Strategies
&lt;/h2&gt;

&lt;p&gt;The way you frame a prompt dramatically affects the model's output. This is one of the most practically important concepts for QA engineers to understand, because your prompt design becomes part of your test case design.&lt;/p&gt;
&lt;h3&gt;
  
  
  Zero-Shot Prompting
&lt;/h3&gt;

&lt;p&gt;No examples provided. Just the task description.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify the sentiment of the following review as POSITIVE, NEGATIVE, or NEUTRAL:

"The delivery was late but the product itself was excellent."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use when&lt;/strong&gt;: the task is simple and well-represented in training data. The model has seen many examples of sentiment classification during pre-training.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Limitation&lt;/strong&gt;: the model must infer the desired output format entirely from context. Ambiguous instructions produce inconsistent formatting.&lt;/p&gt;

&lt;h3&gt;
  
  
  One-Shot Prompting
&lt;/h3&gt;

&lt;p&gt;One example provided before the actual task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify the sentiment of the following review:

Review: "Absolutely loved the packaging and the smell. Will buy again!"
Sentiment: POSITIVE

Review: "The delivery was late but the product itself was excellent."
Sentiment:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use when&lt;/strong&gt;: you need a specific output format the model might not default to, or for edge cases where the classification is ambiguous and you want to demonstrate intent.&lt;/p&gt;

&lt;h3&gt;
  
  
  Few-Shot Prompting
&lt;/h3&gt;

&lt;p&gt;Multiple examples (typically 3–10) before the task.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Classify the sentiment of the following review:

Review: "Absolutely loved the packaging." → POSITIVE
Review: "Took 3 weeks to arrive and was damaged." → NEGATIVE
Review: "Does what it says, nothing more." → NEUTRAL
Review: "Good price, but customer service was horrible." → MIXED

Review: "The delivery was late but the product itself was excellent."
Sentiment:
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Use when&lt;/strong&gt;: tasks are complex, output format needs to be precise, or the model needs to learn a classification scheme that goes beyond what's common in its training data (e.g., your company's specific taxonomy).&lt;/p&gt;

&lt;h3&gt;
  
  
  The QA angle on prompting
&lt;/h3&gt;

&lt;p&gt;Every prompt you write is a specification. It deserves the same rigor as any test specification:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is it unambiguous?&lt;/strong&gt; Can the model interpret the instruction in multiple ways?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Does the example cover edge cases?&lt;/strong&gt; One good example often does more than five generic ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Is the output format specified?&lt;/strong&gt; If you need JSON, say so explicitly.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How robust is it to variations?&lt;/strong&gt; If the input contains typos, does the prompt still work?&lt;/li&gt;
&lt;/ul&gt;




&lt;h4&gt;
  
  
  📚 References &amp;amp; Further Reading
&lt;/h4&gt;

&lt;p&gt;If you want to go deeper, these are the few resources that actually matter:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://lapasserelle.com/documents/how_llms_work.pdf" rel="noopener noreferrer"&gt;How LLMs Work (Karpathy-style breakdown PDF)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.youtube.com/watch?v=7xTGNNLPyMI" rel="noopener noreferrer"&gt;Andrej Karpathy – Intro to LLMs (YouTube)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/1706.03762" rel="noopener noreferrer"&gt;Attention Is All You Need (Transformer paper)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/abs/2005.14165" rel="noopener noreferrer"&gt;GPT-3 Paper (few-shot learning)&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/openai/tiktoken" rel="noopener noreferrer"&gt;OpenAI tiktoken (how tokenization actually works)&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h5&gt;
  
  
  💡 Suggested Reading Flow
&lt;/h5&gt;

&lt;p&gt;If you actually want to understand this space:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Start with Karpathy (intuition first)&lt;/li&gt;
&lt;li&gt;Move to Transformers + GPT papers (core mechanics)&lt;/li&gt;
&lt;li&gt;Learn tokenisation (how models see text)&lt;/li&gt;
&lt;li&gt;Understand decoding (why outputs vary)&lt;/li&gt;
&lt;li&gt;Study RLHF (why models behave like assistants)&lt;/li&gt;
&lt;li&gt;Focus on evals + hallucination (this is where QA adds real value)&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>qa</category>
      <category>testing</category>
      <category>testdev</category>
    </item>
    <item>
      <title>I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.</title>
      <dc:creator>vipin singh</dc:creator>
      <pubDate>Fri, 06 Mar 2026 11:21:14 +0000</pubDate>
      <link>https://dev.to/vipin_singh_701b96b0df516/i-shipped-126-tests-last-month-heres-the-ai-workflow-that-got-me-there-bb4</link>
      <guid>https://dev.to/vipin_singh_701b96b0df516/i-shipped-126-tests-last-month-heres-the-ai-workflow-that-got-me-there-bb4</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Last month I shipped 112 API tests and 14 UI tests. Two months ago, that would've taken me a Month.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Payoff — Before You Read Anything Else
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before AI Agents&lt;/th&gt;
&lt;th&gt;After AI Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tests shipped in a month&lt;/td&gt;
&lt;td&gt;~15–20&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;126&lt;/strong&gt; (112 API + 14 UI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error scenario coverage&lt;/td&gt;
&lt;td&gt;Only P0 errors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Systematically covered per endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code consistency&lt;/td&gt;
&lt;td&gt;Variable (depends on the day)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt; — agents follow patterns better than tired humans&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PR review comments&lt;/td&gt;
&lt;td&gt;Many&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Fewer&lt;/strong&gt; — AI code review catches issues before humans see them&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;My time spent on&lt;/td&gt;
&lt;td&gt;Writing boilerplate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Test design &amp;amp; strategy&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now let me tell you how I got here.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;I use two AI-powered tools daily — they serve different purposes, and the &lt;strong&gt;combination&lt;/strong&gt; is where the real power lies.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI-based coding agent with full codebase access&lt;/td&gt;
&lt;td&gt;Multi-file research, large test suites, gap analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI-powered IDE built on VS Code&lt;/td&gt;
&lt;td&gt;Quick edits, in-context tweaks, focused single-file work&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Secret Weapon: Skill Files &amp;amp; Markdown Context
&lt;/h2&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This is the most important part of the entire workflow.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Before I ask an agent to write a single line of code, I make sure it has &lt;strong&gt;context&lt;/strong&gt;. Without it, the agent guesses. With it, the agent is an informed collaborator.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Without Skill Files              With Skill Files
─────────────────────           ─────────────────────
❌ Agent guesses                 ✅ Agent knows your patterns
❌ Generic output                ✅ Code that fits your codebase
❌ Re-explain every session      ✅ Instant onboarding every time
❌ Lots of manual editing        ✅ Minimal corrections needed
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Think of it like &lt;strong&gt;onboarding a new contractor&lt;/strong&gt;. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every AI session benefits forever.&lt;/p&gt;

&lt;p&gt;Here's what I've built:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;What It Captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;PROJECT.md&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;System architecture, domain terminology, environment details, requirements/specs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;API Test Skill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Framework setup, dynamic payload construction, test data creation APIs &amp;amp; sequences, auth patterns, existing helpers, response validation patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;UI Test Skill&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Page Object Model structure, locator strategy, component interaction patterns, assertion approaches, best practices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt; / &lt;code&gt;.cursorrules&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Repo conventions, build commands, coding standards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  What's Inside the API Test Skill
&lt;/h3&gt;

&lt;p&gt;This is the file that made 112 API tests possible in a month. It tells the agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;How to build dynamic payloads&lt;/strong&gt; — which fields are required, which are generated (unique IDs, timestamps), how to construct valid payloads per scenario&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How to create test data&lt;/strong&gt; — the exact sequence of API calls needed (e.g., "create customer → create order → authenticate"), how to generate unique data to avoid collisions, how to clean up after&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auth &amp;amp; environment config&lt;/strong&gt; — how to obtain tokens, which headers to include, how to target staging vs. QA&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Existing utilities&lt;/strong&gt; — what helpers already exist so the agent doesn't reinvent the wheel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's a real excerpt from my API test skill file — this is what the agent reads before writing a single test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="cm"&gt;/**
 * ENDPOINT: POST /v2/payments
 *
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 *
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */&lt;/span&gt;

&lt;span class="c1"&gt;// Test data creation sequence:&lt;/span&gt;
&lt;span class="c1"&gt;// 1. Create customer   → POST /v2/customers&lt;/span&gt;
&lt;span class="c1"&gt;// 2. Create card       → POST /v2/cards  (use sandbox nonce)&lt;/span&gt;
&lt;span class="c1"&gt;// 3. Create payment    → POST /v2/payments (reference customer + card)&lt;/span&gt;
&lt;span class="c1"&gt;// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)&lt;/span&gt;
&lt;span class="c1"&gt;// 5. Cleanup           → POST /v2/refunds (refund test payment)&lt;/span&gt;

&lt;span class="c1"&gt;// Payload builder — agent uses this pattern for every endpoint:&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;overrides&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source_id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;testCard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;amount_money&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AUD&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;testCustomer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;reference_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`test-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; The agent now knows the exact payload structure, the test data sequence, which fields to randomize, and the error codes to cover. It generates one test per error scenario without me dictating each one.&lt;/p&gt;

&lt;p&gt;Here's what the agent produces from that skill file — a complete error scenario test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST /v2/payments with declined card returns 402&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cnon:card-nonce-declined&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// sandbox decline token&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Testing: declined card → expect 402`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/v2/payments&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Response: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; — &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;402&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PAYMENT_METHOD_ERROR&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CARD_DECLINED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST /v2/payments with expired token returns 401&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/v2/payments&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bearer expired-token-xxx&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AUTHENTICATION_ERROR&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before skill files, I was only covering P0 happy-path scenarios. Now the agent systematically generates tests for every error code listed in the skill file — 400, 401, 402, 404, 422, 429, 500 — per endpoint.&lt;/p&gt;




&lt;h3&gt;
  
  
  What's Inside the UI Test Skill
&lt;/h3&gt;

&lt;p&gt;This is why 14 UI tests came out consistent and maintainable:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;POM structure&lt;/strong&gt; — how page objects are organized, base classes, naming conventions, directory layout&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Locator strategy&lt;/strong&gt; — the single biggest source of flaky UI tests, locked down with clear priorities&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Component interaction patterns&lt;/strong&gt; — how to interact with custom components (dropdowns, date pickers, modals)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Best practices&lt;/strong&gt; — never hard-code sleeps, always clean state between tests, use &lt;code&gt;beforeEach&lt;/code&gt; for setup&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here's the test structure template from the skill file — every UI test the agent writes follows this exact shape:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;chrome&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../../utils/browser&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;navigateToCheckout&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../../utils&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;LandingPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Landing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;LoginPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;SummaryPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;beforeEach&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;

&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@checkout_regression_au_login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;summaryPage&lt;/span&gt;

  &lt;span class="nf"&gt;beforeEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browserInstance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;browserInstance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;
    &lt;span class="nx"&gt;landingPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LandingPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;loginPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoginPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;summaryPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SummaryPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Existing user completes login and confirms order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Arrange&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;navigateToCheckout&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;// Act&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setEmailAddress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jane@doe.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;passwordPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;passwordPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;// Assert&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;summaryPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;confirmOrder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCallbackAction&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;confirm&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And here's the page object pattern — the &lt;code&gt;FIELDS&lt;/code&gt; convention that keeps selectors organized:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;FIELDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="submit-button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="email-input"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MyPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="err"&gt;{
  &lt;/span&gt;&lt;span class="nc"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;clickSubmit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;setEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The locator strategy is defined as a strict priority order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the anti-patterns section — without these rules, agents produce code that works in demos but fails in CI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ BAD — arbitrary wait, masks timing issues&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ GOOD — explicit wait for element&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ BAD — try-catch masks real failures&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ GOOD — explicit conditional&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;country&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;us&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usSelector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;defaultSelector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before I added these anti-patterns to the skill file, roughly 1 in 3 generated tests had at least one of these issues.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Bonus:&lt;/strong&gt; Writing skill files forces &lt;em&gt;you&lt;/em&gt; to codify knowledge that usually lives only in your head. It becomes documentation that helps human teammates too.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Workflow: Research → Plan → Implement
&lt;/h2&gt;

&lt;p&gt;I never just say "write me some tests." I follow a deliberate three-phase process.&lt;/p&gt;

&lt;h3&gt;
  
  
  In Claude Code: Research → Plan → Implement
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Research&lt;/strong&gt; — &lt;em&gt;"Read existing tests, read the API spec, read the skill files. What's covered? What's missing?"&lt;/em&gt; The agent explores and builds a mental model. I review its understanding before moving forward.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Plan&lt;/strong&gt; — &lt;em&gt;"Propose which tests to write, in what order, and why."&lt;/em&gt; The agent produces a prioritized list of scenarios. &lt;strong&gt;I review and approve before any code is written.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Implement&lt;/strong&gt; — Only after the plan is approved does the agent write code. Because it's already done the research and has an approved plan, the code is targeted, well-structured, and aligned.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;This prevents the most common failure mode: &lt;strong&gt;the agent eagerly writing 500 lines of code that miss the point entirely.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  In Cursor: Plan → Implement
&lt;/h3&gt;

&lt;p&gt;Cursor's workflow is lighter-weight since I'm usually already in the code:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Plan&lt;/strong&gt; — I describe what I want in the chat, referencing specific files. Cursor proposes an approach inline, and I review it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Implement&lt;/strong&gt; — Once I approve, Cursor applies the changes directly in the editor. I review each diff as it appears.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;My rule of thumb:&lt;/strong&gt; Claude Code for large, multi-file efforts. Cursor for focused, in-context edits.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Quality Gates Before Every PR
&lt;/h2&gt;

&lt;p&gt;Writing tests fast means nothing if the tests are broken, unreadable, or unmaintainable. Every piece of AI-generated test code must pass three gates before I raise a PR.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. All Tests Running and Passing
&lt;/h3&gt;

&lt;p&gt;Non-negotiable. I run the &lt;strong&gt;full test suite&lt;/strong&gt; — not just the new tests — to make sure nothing is broken. If a new test is flaky, it doesn't ship. I iterate with the agent until it's stable.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Proper Logging for Human Verification
&lt;/h3&gt;

&lt;p&gt;Every test must include meaningful logging so that a human reviewing the test output can &lt;strong&gt;understand what happened without reading the code&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log the test scenario being executed in plain English&lt;/li&gt;
&lt;li&gt;Log key request payloads and response data (sanitized of sensitive info)&lt;/li&gt;
&lt;li&gt;Log assertion results with context (&lt;em&gt;"Expected order status to be ACTIVE, got ACTIVE — PASS"&lt;/em&gt;)&lt;/li&gt;
&lt;li&gt;Log setup and teardown steps so failures can be traced to their root cause&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I explicitly instruct the agent to add this logging. Left to its own devices, it'll write tests that either log nothing or log everything. The skill files include examples of what "good logging" looks like.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. AI-Powered Code Review Before PR
&lt;/h3&gt;

&lt;p&gt;Before raising a PR, I spin up another agent session specifically for code review. I ask the agent to review the test code with fresh eyes — checking for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code consistency with existing patterns&lt;/li&gt;
&lt;li&gt;Missing edge cases or assertions&lt;/li&gt;
&lt;li&gt;Hardcoded values that should be dynamic&lt;/li&gt;
&lt;li&gt;Proper error handling and cleanup&lt;/li&gt;
&lt;li&gt;Test isolation (no shared state between tests)&lt;/li&gt;
&lt;li&gt;Readability and naming clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;This is like having a second pair of eyes, except it's instant and never annoyed that you're asking for a review at 6pm on a Friday.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Only after this code review pass — and after addressing any findings — do I raise the PR for human review.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Works Surprisingly Well
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Capability&lt;/th&gt;
&lt;th&gt;Why It's Great&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Pattern matching&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tell the agent "follow the same pattern as existing tests" and it genuinely does — naming, helpers, assertions, structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Spec → Tests&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Give it a requirements doc and it produces a structured test suite mapped directly to the spec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Error scenarios&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Agents don't have the human bias toward happy paths — they'll systematically cover timeouts, invalid inputs, auth failures, rate limits&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dynamic payloads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Once it understands your payload structure from the skill file, it generates valid variations without you dictating every field&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Boilerplate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Setup, teardown, data builders, config files — all the tedious-but-essential stuff, handled effortlessly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What Doesn't Work (Yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flaky test debugging&lt;/strong&gt; — If a test passes sometimes and fails sometimes, agents struggle. Flakiness stems from timing, environment issues, or shared state — things that require runtime observation, not just code reading.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Complex environment setup&lt;/strong&gt; — Agents can write the test code, but they can't spin up your Docker containers, seed your database, or configure your VPN. You still own the infrastructure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Business logic judgment&lt;/strong&gt; — The agent can write a test that checks "the response status is 200," but it can't tell you whether 200 is the &lt;em&gt;correct&lt;/em&gt; behavior for that scenario. You still need domain knowledge to validate the &lt;em&gt;what&lt;/em&gt;, even if the agent handles the &lt;em&gt;how&lt;/em&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Create Your Context Files (4–6 hours)&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;th&gt;Key Contents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PROJECT.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Project context&lt;/td&gt;
&lt;td&gt;Architecture, terminology, requirements, environment details&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Test Skill&lt;/td&gt;
&lt;td&gt;API test knowledge&lt;/td&gt;
&lt;td&gt;Framework setup, payload construction, test data APIs, auth patterns, helper utilities&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI Test Skill&lt;/td&gt;
&lt;td&gt;UI test knowledge&lt;/td&gt;
&lt;td&gt;POM structure, locator strategy, interaction patterns, assertion approaches, best practices&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;CLAUDE.md&lt;/code&gt; / &lt;code&gt;.cursorrules&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Tool-specific config&lt;/td&gt;
&lt;td&gt;Repository conventions, build commands, coding standards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Establish Your Workflow&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Always research before planning, plan before implementing&lt;/li&gt;
&lt;li&gt;Start with one test, iterate, then scale — don't ask for 20 tests at once&lt;/li&gt;
&lt;li&gt;Run tests after every change — paste failures back to the agent and let it self-correct&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Set Your Quality Gates&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;All tests green before PR&lt;/li&gt;
&lt;li&gt;Meaningful logging in every test&lt;/li&gt;
&lt;li&gt;AI code review pass before human review&lt;/li&gt;
&lt;li&gt;No hardcoded test data, no flaky waits, no shared state&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Invest Time Upfront, Save Time Forever&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Writing skill files takes a few hours. But those hours pay dividends across every future session. Every time you or a teammate starts a new AI session, you skip the "explain everything from scratch" phase and go straight to productive work.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;AI coding agents don't replace the engineer. They replace the tedium. The judgment calls — &lt;em&gt;what&lt;/em&gt; to test, &lt;em&gt;why&lt;/em&gt; it matters, &lt;em&gt;whether&lt;/em&gt; the behavior is correct — those are still yours. But the mechanical work of translating those decisions into running code? &lt;strong&gt;That's where agents shine.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The real unlock isn't the AI itself — it's the &lt;strong&gt;context you build around it&lt;/strong&gt;. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a team member who understands your codebase, follows your conventions, and produces work you're confident shipping.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;112 API tests. 14 UI tests. One month. Invest a day building your skill files and try pairing with an AI agent for a week. You won't go back.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;&lt;em&gt;Full disclosure: The ideas, workflow, skill files, and real-world experience in this post are entirely mine — born from months of actually doing this work day in, day out. AI helped me write and structure the blog post itself. Practice what you preach, right?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>ai</category>
      <category>playwright</category>
      <category>productivity</category>
    </item>
    <item>
      <title>I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.</title>
      <dc:creator>vipin singh</dc:creator>
      <pubDate>Fri, 06 Mar 2026 10:57:43 +0000</pubDate>
      <link>https://dev.to/vipin_singh_701b96b0df516/i-shipped-126-tests-last-month-heres-the-ai-workflow-that-got-me-there-599n</link>
      <guid>https://dev.to/vipin_singh_701b96b0df516/i-shipped-126-tests-last-month-heres-the-ai-workflow-that-got-me-there-599n</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;112 API tests. 14 UI tests. At my old pace of ~15/month, that would have taken 8+ months.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Payoff — Before You Read Anything Else
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Before&lt;/th&gt;
&lt;th&gt;After&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tests shipped / month&lt;/td&gt;
&lt;td&gt;~15–20&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;126&lt;/strong&gt; (112 API + 14 UI)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI output needing manual fixes&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~30%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Error scenario coverage&lt;/td&gt;
&lt;td&gt;Only P0 errors&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Systematically covered per endpoint&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code consistency&lt;/td&gt;
&lt;td&gt;Variable&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;High&lt;/strong&gt; — agents follow skill files faithfully&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;My time spent on&lt;/td&gt;
&lt;td&gt;Writing boilerplate&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;Test design &amp;amp; strategy&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Now let me tell you how I got here.&lt;/p&gt;




&lt;h2&gt;
  
  
  My Setup
&lt;/h2&gt;

&lt;p&gt;Two AI tools, different purposes. The combination is where the power lies.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;What It Is&lt;/th&gt;
&lt;th&gt;Best For&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Claude Code&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;CLI-based coding agent with full codebase access&lt;/td&gt;
&lt;td&gt;Multi-file research, large test suites, gap analysis&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Cursor&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;AI-powered IDE built on VS Code&lt;/td&gt;
&lt;td&gt;Quick edits, in-context tweaks, single-file work&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  The Secret Weapon: Skill Files &amp;amp; Markdown Context
&lt;/h2&gt;

&lt;p&gt;Before asking an agent to write a single line of code, I make sure it has &lt;strong&gt;context&lt;/strong&gt;. Without it, the agent guesses. With it, the agent is an informed collaborator.&lt;/p&gt;

&lt;p&gt;Think of it like onboarding a new contractor. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every session benefits.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;What It Captures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;PROJECT.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Architecture, domain terminology, environment details, requirements&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API Test Skill&lt;/td&gt;
&lt;td&gt;Framework setup, payload construction, test data sequences, auth, helpers&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;UI Test Skill&lt;/td&gt;
&lt;td&gt;Page Object Model structure, locator strategy, interaction patterns, anti-patterns&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;CLAUDE.md&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Repo conventions, build commands, coding standards&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h3&gt;
  
  
  What's Actually Inside (Real Excerpts)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  API Test Skill — How 112 API Tests Became Possible
&lt;/h4&gt;

&lt;p&gt;This is the file that made the biggest difference. It tells the agent exactly how to construct tests for our payment API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// API Test Skill Excerpt — Dynamic Payload Construction&lt;/span&gt;

&lt;span class="cm"&gt;/**
 * ENDPOINT: POST /v2/payments
 * 
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 * 
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */&lt;/span&gt;

&lt;span class="c1"&gt;// Test data creation sequence:&lt;/span&gt;
&lt;span class="c1"&gt;// 1. Create customer   → POST /v2/customers&lt;/span&gt;
&lt;span class="c1"&gt;// 2. Create card       → POST /v2/cards  (use sandbox token: cnon:card-nonce-ok)&lt;/span&gt;
&lt;span class="c1"&gt;// 3. Create payment    → POST /v2/payments (reference customer + card)&lt;/span&gt;
&lt;span class="c1"&gt;// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)&lt;/span&gt;
&lt;span class="c1"&gt;// 5. Cleanup           → POST /v2/refunds (refund test payment)&lt;/span&gt;

&lt;span class="c1"&gt;// Payload builder — agent uses this pattern for every endpoint:&lt;/span&gt;
&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;overrides&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;idempotency_key&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;randomUUID&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;source_id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;testCard&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;amount_money&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;amount&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;amount&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;currency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;currency&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AUD&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="na"&gt;customer_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;customer_id&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;testCustomer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;reference_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`test-&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;overrides&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Error scenario template — agent generates one per error code:&lt;/span&gt;
&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST /v2/payments with expired card returns 402&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;source_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;cnon:card-nonce-declined&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// sandbox decline token&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Testing: expired card → expect 402`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/v2/payments&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;log&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Response: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; — &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;?.[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]?.&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;402&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;PAYMENT_METHOD_ERROR&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;code&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CARD_DECLINED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// Auth failure template:&lt;/span&gt;
&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;POST /v2/payments with expired token returns 401&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;buildPaymentPayload&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;api&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/v2/payments&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;Authorization&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Bearer expired-token-xxx&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;

  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;401&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;errors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;category&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;AUTHENTICATION_ERROR&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt; The agent now knows the exact payload structure, the sandbox tokens for each error scenario, the assertion patterns, and the cleanup sequence. It generates one test per error code without me dictating each one.&lt;/p&gt;




&lt;p&gt;&lt;b&gt;UI Test Structure Template — what every test looks like&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;The skill file defines a standard template so the agent produces tests that match the codebase from the first generation:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;expect&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@playwright/test&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;chrome&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../../utils/browser&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;navigateToCheckout&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../../utils&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;LandingPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Landing&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;LoginPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;SummaryPage&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;../pages/Summary&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;beforeEach&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;test&lt;/span&gt;

&lt;span class="nf"&gt;describe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;@checkout_regression_au_login&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;summaryPage&lt;/span&gt;

  &lt;span class="nf"&gt;beforeEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;browserInstance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;chrome&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;browserInstance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;
    &lt;span class="nx"&gt;landingPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LandingPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;loginPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;LoginPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nx"&gt;summaryPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;SummaryPage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;

  &lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Existing user completes login and confirms order&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;amp;&lt;/span&gt;&lt;span class="nx"&gt;gt&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Arrange&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;navigateToCheckout&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;

    &lt;span class="c1"&gt;// Act&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setEmailAddress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;jane@doe.com&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;loginPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;continue&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;passwordPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setPassword&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;password&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;passwordPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;login&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;// Assert&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;summaryPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;confirmOrder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;landingPage&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getCallbackAction&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;action&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;confirm&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Every test follows this exact shape — imports, &lt;code&gt;beforeEach&lt;/code&gt; setup, Arrange/Act/Assert structure.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Page Object Pattern — the FIELDS convention&lt;/b&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;FIELDS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="submit-button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="email-input"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;exports&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;MyPage&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="err"&gt;{
  &lt;/span&gt;&lt;span class="nc"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;clickSubmit&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;submitButton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;setEmail&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fill&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;FIELDS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;emailInput&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;selector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Selectors live in a &lt;code&gt;FIELDS&lt;/code&gt; object at the top. Every interaction method waits for visibility first. The agent follows this pattern without being reminded.&lt;/p&gt;

&lt;p&gt;&lt;b&gt;Selector Strategy — the #1 source of flaky tests, locked down&lt;/b&gt;&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Selector Priority Order:

✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;&lt;b&gt;Anti-Patterns — what the agent must never do&lt;/b&gt;&lt;/p&gt;

&lt;p&gt;Without these rules, agents produce code that works in demos but fails in CI:&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ BAD — arbitrary wait, masks timing issues&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ GOOD — explicit wait for element&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;visible&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;click&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[data-testid="button"]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// ❌ BAD — try-catch masks real failures&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;selector2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// ✅ GOOD — explicit conditional&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;country&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;us&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;usSelector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;page&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitForSelector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;defaultSelector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;Before I added these to the skill file, roughly 1 in 3 generated tests had at least one of these issues.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Workflow: Research → Plan → Implement
&lt;/h2&gt;

&lt;p&gt;I never just say "write me some tests." I follow a deliberate three-phase process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Research&lt;/strong&gt; — "Read existing tests, the API spec, and the skill files. What's covered? What's missing?" The agent explores and builds a mental model. I review before moving on. (~5 min)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Plan&lt;/strong&gt; — "Propose which tests to write, in what order, and why." The agent produces a prioritized list. I review and approve before any code is written. (~5 min)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Implement&lt;/strong&gt; — Only now does the agent write code. Because it's done research and has an approved plan, the output is targeted and aligned. (~15–20 min/test)&lt;/p&gt;

&lt;p&gt;This prevents the most common failure mode: the agent eagerly writing 500 lines of code that miss the point entirely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool split:&lt;/strong&gt; Claude Code for large, multi-file efforts (research → plan → implement). Cursor for focused, in-context single-file edits (plan → implement).&lt;/p&gt;




&lt;h2&gt;
  
  
  Quality Gates Before Every PR
&lt;/h2&gt;

&lt;p&gt;Writing tests fast means nothing if they're broken. Every AI-generated test passes three gates:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. All tests running and passing&lt;/strong&gt; — Full suite, not just new tests. If a test is flaky, it doesn't ship.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Proper logging for human verification&lt;/strong&gt; — Every test logs the scenario in plain English, key request/response data (sanitized), and assertion results with context. The skill files include examples of what "good logging" looks like.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. AI code review before human review&lt;/strong&gt; — Before raising a PR, I spin up a separate agent session for code review — checking pattern consistency, missing edge cases, hardcoded values, test isolation, and naming clarity.&lt;/p&gt;




&lt;h2&gt;
  
  
  What Works Well
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;✅ &lt;strong&gt;Pattern matching&lt;/strong&gt; — "Follow the same pattern as existing tests" and it genuinely does&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Spec → Tests&lt;/strong&gt; — Hand it a requirements doc, get a structured test suite mapped to the spec&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Error scenarios&lt;/strong&gt; — Agents don't have the human happy-path bias; they systematically cover failures&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Dynamic payloads&lt;/strong&gt; — Once it knows your structure from the skill file, it generates valid variations&lt;/li&gt;
&lt;li&gt;✅ &lt;strong&gt;Boilerplate&lt;/strong&gt; — Setup, teardown, data builders, config — all handled in minutes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  What Doesn't Work (Honestly)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;❌ &lt;strong&gt;Flaky test debugging&lt;/strong&gt; — Flakiness stems from timing and environment, not code. Agents need runtime observation, which they don't have.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Infrastructure setup&lt;/strong&gt; — Agents write test code but can't spin up Docker, seed databases, or configure VPNs.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Business logic judgment&lt;/strong&gt; — The agent checks &lt;em&gt;how&lt;/em&gt;, but you still decide &lt;em&gt;what's correct&lt;/em&gt;. Every test still needs human validation of intent.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Hardcoded values&lt;/strong&gt; — ~30% of generated tests have hardcoded IDs or timestamps that should be dynamic. Always review before merging.&lt;/li&gt;
&lt;li&gt;❌ &lt;strong&gt;Stale skill files&lt;/strong&gt; — When APIs change, skill files can cause the agent to generate outdated tests. Maintain them like any other documentation.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Write your context files (4–6 hours)&lt;/strong&gt; — PROJECT.md, API test skill, UI test skill. Include real code examples — the excerpts above are a good template.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Start with one test (30 min)&lt;/strong&gt; — Use the research → plan → implement workflow. Iterate until it passes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Scale gradually (1–2 weeks)&lt;/strong&gt; — Write 5–10 tests. Refine your skill files based on what the agent gets wrong. Expect ~30% of output to need manual fixes early on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Establish quality gates&lt;/strong&gt; — All tests green. Meaningful logging. AI code review pass. No hardcoded data. Then human review.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thought
&lt;/h2&gt;

&lt;p&gt;AI coding agents don't replace the engineer. They replace the tedium. The judgment calls — &lt;em&gt;what&lt;/em&gt; to test, &lt;em&gt;why&lt;/em&gt; it matters, &lt;em&gt;whether&lt;/em&gt; the behavior is correct — those are still yours. But the mechanical work of translating those decisions into running code? That's where agents shine.&lt;/p&gt;

&lt;p&gt;The real unlock isn't the AI itself — it's the &lt;strong&gt;context you build around it&lt;/strong&gt;. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a tool that understands your codebase, follows your conventions, and produces work you're confident shipping.&lt;/p&gt;




&lt;p&gt;112 API tests. 14 UI tests. One month. The skill files took a day to write.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Full disclosure: The ideas, workflow, and real-world experience in this post are entirely mine — born from months of actually doing this work. AI helped me write and structure the blog post itself. Practice what you preach, right?&lt;/em&gt;&lt;/p&gt;

</description>
      <category>testing</category>
      <category>ai</category>
      <category>playwright</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
