vipin singh

Posted on Mar 6

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

#playwright #testing #ai #productivity

112 API tests. 14 UI tests. At my old pace of ~15/month, that would have taken 8+ months.

The Payoff — Before You Read Anything Else

Metric	Before	After
Tests shipped / month	~15–20	126 (112 API + 14 UI)
AI output needing manual fixes	—	~30%
Error scenario coverage	Only P0 errors	Systematically covered per endpoint
Code consistency	Variable	High — agents follow skill files faithfully
My time spent on	Writing boilerplate	Test design & strategy

Now let me tell you how I got here.

My Setup

Two AI tools, different purposes. The combination is where the power lies.

Tool	What It Is	Best For
Claude Code	CLI-based coding agent with full codebase access	Multi-file research, large test suites, gap analysis
Cursor	AI-powered IDE built on VS Code	Quick edits, in-context tweaks, single-file work

The Secret Weapon: Skill Files & Markdown Context

Before asking an agent to write a single line of code, I make sure it has context. Without it, the agent guesses. With it, the agent is an informed collaborator.

Think of it like onboarding a new contractor. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every session benefits.

File	What It Captures
`PROJECT.md`	Architecture, domain terminology, environment details, requirements
API Test Skill	Framework setup, payload construction, test data sequences, auth, helpers
UI Test Skill	Page Object Model structure, locator strategy, interaction patterns, anti-patterns
`CLAUDE.md`	Repo conventions, build commands, coding standards

What's Actually Inside (Real Excerpts)

API Test Skill — How 112 API Tests Became Possible

This is the file that made the biggest difference. It tells the agent exactly how to construct tests for our payment API:

// API Test Skill Excerpt — Dynamic Payload Construction

/**
 * ENDPOINT: POST /v2/payments
 * 
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 * 
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */

// Test data creation sequence:
// 1. Create customer   → POST /v2/customers
// 2. Create card       → POST /v2/cards  (use sandbox token: cnon:card-nonce-ok)
// 3. Create payment    → POST /v2/payments (reference customer + card)
// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)
// 5. Cleanup           → POST /v2/refunds (refund test payment)

// Payload builder — agent uses this pattern for every endpoint:
function buildPaymentPayload(overrides = {}) {
  return {
    idempotency_key: crypto.randomUUID(),
    source_id: overrides.source_id || testCard.id,
    amount_money: {
      amount: overrides.amount || 1000,
      currency: overrides.currency || 'AUD',
    },
    customer_id: overrides.customer_id || testCustomer.id,
    reference_id: `test-${Date.now()}`,
    ...overrides,
  };
}

// Error scenario template — agent generates one per error code:
test('POST /v2/payments with expired card returns 402', async () => {
  const payload = buildPaymentPayload({
    source_id: 'cnon:card-nonce-declined',  // sandbox decline token
  });

  console.log(`Testing: expired card → expect 402`);
  const res = await api.post('/v2/payments', payload);
  console.log(`Response: ${res.status} — ${res.data?.errors?.[0]?.code}`);

  expect(res.status).toBe(402);
  expect(res.data.errors[0].category).toBe('PAYMENT_METHOD_ERROR');
  expect(res.data.errors[0].code).toBe('CARD_DECLINED');
});

// Auth failure template:
test('POST /v2/payments with expired token returns 401', async () => {
  const payload = buildPaymentPayload();

  const res = await api.post('/v2/payments', payload, {
    headers: { Authorization: 'Bearer expired-token-xxx' },
  });

  expect(res.status).toBe(401);
  expect(res.data.errors[0].category).toBe('AUTHENTICATION_ERROR');
});

Why this works: The agent now knows the exact payload structure, the sandbox tokens for each error scenario, the assertion patterns, and the cleanup sequence. It generates one test per error code without me dictating each one.

UI Test Structure Template — what every test looks like

The skill file defines a standard template so the agent produces tests that match the codebase from the first generation:

const { test, expect } = require('@playwright/test')
const { chrome } = require('../../utils/browser')
const { navigateToCheckout } = require('../../utils')

const { LandingPage } = require('../pages/Landing')
const { LoginPage } = require('../pages/Login')
const { SummaryPage } = require('../pages/Summary')

const { describe, beforeEach } = test

describe('@checkout_regression_au_login', () =&gt; {
  let page
  let landingPage, loginPage, summaryPage

  beforeEach(async () =&gt; {
    const browserInstance = await chrome()
    page = browserInstance.page
    landingPage = new LandingPage(page)
    loginPage = new LoginPage(page)
    summaryPage = new SummaryPage(page)
  })

  test('Existing user completes login and confirms order', async () =&gt; {
    // Arrange
    await navigateToCheckout({ landingPage })

    // Act
    await loginPage.setEmailAddress('jane@doe.com')
    await loginPage.continue()
    await passwordPage.setPassword('password')
    await passwordPage.login()

    // Assert
    await summaryPage.confirmOrder()
    const action = await landingPage.getCallbackAction()
    expect(action).toBe('confirm')
  })
})

Every test follows this exact shape — imports, beforeEach setup, Arrange/Act/Assert structure.

Page Object Pattern — the FIELDS convention

const FIELDS = {
  submitButton: {
    selector: '[data-testid="submit-button"]',
  },
  emailInput: {
    selector: '[data-testid="email-input"]',
  },
}

exports.MyPage = class {
  constructor(page) {
    this.page = page
  }

  async clickSubmit() {
    await this.page.waitForSelector(
      FIELDS.submitButton.selector, { state: 'visible' }
    )
    await this.page.click(FIELDS.submitButton.selector)
  }

  async setEmail(text) {
    await this.page.waitForSelector(
      FIELDS.emailInput.selector, { state: 'visible' }
    )
    await this.page.fill(FIELDS.emailInput.selector, text)
  }
}

Selectors live in a FIELDS object at the top. Every interaction method waits for visibility first. The agent follows this pattern without being reminded.

Selector Strategy — the #1 source of flaky tests, locked down

Selector Priority Order:

✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits

Anti-Patterns — what the agent must never do

Without these rules, agents produce code that works in demos but fails in CI:

// ❌ BAD — arbitrary wait, masks timing issues
await page.waitForTimeout(3000)
await page.click('[data-testid="button"]')

// ✅ GOOD — explicit wait for element
await page.waitForSelector(
  '[data-testid="button"]', { state: 'visible' }
)
await page.click('[data-testid="button"]')

// ❌ BAD — try-catch masks real failures
try {
  await page.waitForSelector(selector1)
} catch {
  await page.waitForSelector(selector2)
}

// ✅ GOOD — explicit conditional
if (country === 'us') {
  await page.waitForSelector(usSelector)
} else {
  await page.waitForSelector(defaultSelector)
}

Before I added these to the skill file, roughly 1 in 3 generated tests had at least one of these issues.

The Workflow: Research → Plan → Implement

I never just say "write me some tests." I follow a deliberate three-phase process.

1. Research — "Read existing tests, the API spec, and the skill files. What's covered? What's missing?" The agent explores and builds a mental model. I review before moving on. (~5 min)

2. Plan — "Propose which tests to write, in what order, and why." The agent produces a prioritized list. I review and approve before any code is written. (~5 min)

3. Implement — Only now does the agent write code. Because it's done research and has an approved plan, the output is targeted and aligned. (~15–20 min/test)

This prevents the most common failure mode: the agent eagerly writing 500 lines of code that miss the point entirely.

Tool split: Claude Code for large, multi-file efforts (research → plan → implement). Cursor for focused, in-context single-file edits (plan → implement).

Quality Gates Before Every PR

Writing tests fast means nothing if they're broken. Every AI-generated test passes three gates:

1. All tests running and passing — Full suite, not just new tests. If a test is flaky, it doesn't ship.

2. Proper logging for human verification — Every test logs the scenario in plain English, key request/response data (sanitized), and assertion results with context. The skill files include examples of what "good logging" looks like.

3. AI code review before human review — Before raising a PR, I spin up a separate agent session for code review — checking pattern consistency, missing edge cases, hardcoded values, test isolation, and naming clarity.

What Works Well

✅ Pattern matching — "Follow the same pattern as existing tests" and it genuinely does
✅ Spec → Tests — Hand it a requirements doc, get a structured test suite mapped to the spec
✅ Error scenarios — Agents don't have the human happy-path bias; they systematically cover failures
✅ Dynamic payloads — Once it knows your structure from the skill file, it generates valid variations
✅ Boilerplate — Setup, teardown, data builders, config — all handled in minutes

What Doesn't Work (Honestly)

❌ Flaky test debugging — Flakiness stems from timing and environment, not code. Agents need runtime observation, which they don't have.
❌ Infrastructure setup — Agents write test code but can't spin up Docker, seed databases, or configure VPNs.
❌ Business logic judgment — The agent checks how, but you still decide what's correct. Every test still needs human validation of intent.
❌ Hardcoded values — ~30% of generated tests have hardcoded IDs or timestamps that should be dynamic. Always review before merging.
❌ Stale skill files — When APIs change, skill files can cause the agent to generate outdated tests. Maintain them like any other documentation.

Getting Started

1. Write your context files (4–6 hours) — PROJECT.md, API test skill, UI test skill. Include real code examples — the excerpts above are a good template.

2. Start with one test (30 min) — Use the research → plan → implement workflow. Iterate until it passes.

3. Scale gradually (1–2 weeks) — Write 5–10 tests. Refine your skill files based on what the agent gets wrong. Expect ~30% of output to need manual fixes early on.

4. Establish quality gates — All tests green. Meaningful logging. AI code review pass. No hardcoded data. Then human review.

Final Thought

AI coding agents don't replace the engineer. They replace the tedium. The judgment calls — what to test, why it matters, whether the behavior is correct — those are still yours. But the mechanical work of translating those decisions into running code? That's where agents shine.

The real unlock isn't the AI itself — it's the context you build around it. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a tool that understands your codebase, follows your conventions, and produces work you're confident shipping.

112 API tests. 14 UI tests. One month. The skill files took a day to write.

Full disclosure: The ideas, workflow, and real-world experience in this post are entirely mine — born from months of actually doing this work. AI helped me write and structure the blog post itself. Practice what you preach, right?