DEV Community

vipin singh
vipin singh

Posted on

I Shipped 126 Tests Last Month. Here's the AI Workflow That Got Me There.

Last month I shipped 112 API tests and 14 UI tests. Two months ago, that would've taken me a Month.


The Payoff — Before You Read Anything Else

Metric Before AI Agents After AI Agents
Tests shipped in a month ~15–20 126 (112 API + 14 UI)
Error scenario coverage Only P0 errors Systematically covered per endpoint
Code consistency Variable (depends on the day) High — agents follow patterns better than tired humans
PR review comments Many Fewer — AI code review catches issues before humans see them
My time spent on Writing boilerplate Test design & strategy

Now let me tell you how I got here.


My Setup

I use two AI-powered tools daily — they serve different purposes, and the combination is where the real power lies.

Tool What It Is Best For
Claude Code CLI-based coding agent with full codebase access Multi-file research, large test suites, gap analysis
Cursor AI-powered IDE built on VS Code Quick edits, in-context tweaks, focused single-file work

The Secret Weapon: Skill Files & Markdown Context

This is the most important part of the entire workflow.

Before I ask an agent to write a single line of code, I make sure it has context. Without it, the agent guesses. With it, the agent is an informed collaborator.

Without Skill Files              With Skill Files
─────────────────────           ─────────────────────
❌ Agent guesses                 ✅ Agent knows your patterns
❌ Generic output                ✅ Code that fits your codebase
❌ Re-explain every session      ✅ Instant onboarding every time
❌ Lots of manual editing        ✅ Minimal corrections needed
Enter fullscreen mode Exit fullscreen mode

Think of it like onboarding a new contractor. You wouldn't hand them a Jira ticket and say "go." You'd give them architecture docs, point them at example code, and explain conventions. Skill files are that onboarding — except you write them once and every AI session benefits forever.

Here's what I've built:

File What It Captures
PROJECT.md System architecture, domain terminology, environment details, requirements/specs
API Test Skill Framework setup, dynamic payload construction, test data creation APIs & sequences, auth patterns, existing helpers, response validation patterns
UI Test Skill Page Object Model structure, locator strategy, component interaction patterns, assertion approaches, best practices
CLAUDE.md / .cursorrules Repo conventions, build commands, coding standards

What's Inside the API Test Skill

This is the file that made 112 API tests possible in a month. It tells the agent:

  • How to build dynamic payloads — which fields are required, which are generated (unique IDs, timestamps), how to construct valid payloads per scenario
  • How to create test data — the exact sequence of API calls needed (e.g., "create customer → create order → authenticate"), how to generate unique data to avoid collisions, how to clean up after
  • Auth & environment config — how to obtain tokens, which headers to include, how to target staging vs. QA
  • Existing utilities — what helpers already exist so the agent doesn't reinvent the wheel

Here's a real excerpt from my API test skill file — this is what the agent reads before writing a single test:

/**
 * ENDPOINT: POST /v2/payments
 *
 * Required fields: amount, currency, source_id
 * Generated fields: idempotency_key (UUID per request)
 *
 * Error coverage per endpoint:
 *   400 → Invalid request (missing fields, bad format)
 *   401 → Auth expired / invalid token
 *   402 → Card declined (insufficient funds, expired card)
 *   404 → Resource not found (bad source_id)
 *   422 → Unprocessable (amount = 0, currency mismatch)
 *   429 → Rate limited
 *   500 → Server error (retry with backoff)
 */

// Test data creation sequence:
// 1. Create customer   → POST /v2/customers
// 2. Create card       → POST /v2/cards  (use sandbox nonce)
// 3. Create payment    → POST /v2/payments (reference customer + card)
// 4. Verify status     → GET  /v2/payments/:id (poll until COMPLETED)
// 5. Cleanup           → POST /v2/refunds (refund test payment)

// Payload builder — agent uses this pattern for every endpoint:
function buildPaymentPayload(overrides = {}) {
  return {
    idempotency_key: crypto.randomUUID(),
    source_id: overrides.source_id || testCard.id,
    amount_money: {
      amount: overrides.amount || 1000,
      currency: overrides.currency || 'AUD',
    },
    customer_id: overrides.customer_id || testCustomer.id,
    reference_id: `test-${Date.now()}`,
    ...overrides,
  };
}
Enter fullscreen mode Exit fullscreen mode

Why this works: The agent now knows the exact payload structure, the test data sequence, which fields to randomize, and the error codes to cover. It generates one test per error scenario without me dictating each one.

Here's what the agent produces from that skill file — a complete error scenario test:

test('POST /v2/payments with declined card returns 402', async () => {
  const payload = buildPaymentPayload({
    source_id: 'cnon:card-nonce-declined',  // sandbox decline token
  });

  console.log(`Testing: declined card → expect 402`);
  const res = await api.post('/v2/payments', payload);
  console.log(`Response: ${res.status}${res.data?.errors?.[0]?.code}`);

  expect(res.status).toBe(402);
  expect(res.data.errors[0].category).toBe('PAYMENT_METHOD_ERROR');
  expect(res.data.errors[0].code).toBe('CARD_DECLINED');
});

test('POST /v2/payments with expired token returns 401', async () => {
  const payload = buildPaymentPayload();

  const res = await api.post('/v2/payments', payload, {
    headers: { Authorization: 'Bearer expired-token-xxx' },
  });

  expect(res.status).toBe(401);
  expect(res.data.errors[0].category).toBe('AUTHENTICATION_ERROR');
});
Enter fullscreen mode Exit fullscreen mode

Before skill files, I was only covering P0 happy-path scenarios. Now the agent systematically generates tests for every error code listed in the skill file — 400, 401, 402, 404, 422, 429, 500 — per endpoint.


What's Inside the UI Test Skill

This is why 14 UI tests came out consistent and maintainable:

  • POM structure — how page objects are organized, base classes, naming conventions, directory layout
  • Locator strategy — the single biggest source of flaky UI tests, locked down with clear priorities
  • Component interaction patterns — how to interact with custom components (dropdowns, date pickers, modals)
  • Best practices — never hard-code sleeps, always clean state between tests, use beforeEach for setup

Here's the test structure template from the skill file — every UI test the agent writes follows this exact shape:

const { test, expect } = require('@playwright/test')
const { chrome } = require('../../utils/browser')
const { navigateToCheckout } = require('../../utils')

const { LandingPage } = require('../pages/Landing')
const { LoginPage } = require('../pages/Login')
const { SummaryPage } = require('../pages/Summary')

const { describe, beforeEach } = test

describe('@checkout_regression_au_login', () => {
  let page
  let landingPage, loginPage, summaryPage

  beforeEach(async () => {
    const browserInstance = await chrome()
    page = browserInstance.page
    landingPage = new LandingPage(page)
    loginPage = new LoginPage(page)
    summaryPage = new SummaryPage(page)
  })

  test('Existing user completes login and confirms order', async () => {
    // Arrange
    await navigateToCheckout({ landingPage })

    // Act
    await loginPage.setEmailAddress('jane@doe.com')
    await loginPage.continue()
    await passwordPage.setPassword('password')
    await passwordPage.login()

    // Assert
    await summaryPage.confirmOrder()
    const action = await landingPage.getCallbackAction()
    expect(action).toBe('confirm')
  })
})
Enter fullscreen mode Exit fullscreen mode

And here's the page object pattern — the FIELDS convention that keeps selectors organized:

const FIELDS = {
  submitButton: {
    selector: '[data-testid="submit-button"]',
  },
  emailInput: {
    selector: '[data-testid="email-input"]',
  },
}

exports.MyPage = class {
  constructor(page) {
    this.page = page
  }

  async clickSubmit() {
    await this.page.waitForSelector(
      FIELDS.submitButton.selector, { state: 'visible' }
    )
    await this.page.click(FIELDS.submitButton.selector)
  }

  async setEmail(text) {
    await this.page.waitForSelector(
      FIELDS.emailInput.selector, { state: 'visible' }
    )
    await this.page.fill(FIELDS.emailInput.selector, text)
  }
}
Enter fullscreen mode Exit fullscreen mode

The locator strategy is defined as a strict priority order:

✅ Priority 1: data-testid attributes
   [data-testid="summary-button"]

✅ Priority 2: ARIA selectors
   button[aria-controls="order-summary-panel"]

✅ Priority 3: Role-based selectors
   button[type="submit"]

❌ Avoid: CSS selectors tied to styling classes
❌ Avoid: XPath tied to DOM structure
❌ Avoid: Hardcoded sleeps — use explicit waits
Enter fullscreen mode Exit fullscreen mode

And the anti-patterns section — without these rules, agents produce code that works in demos but fails in CI:

// ❌ BAD — arbitrary wait, masks timing issues
await page.waitForTimeout(3000)
await page.click('[data-testid="button"]')

// ✅ GOOD — explicit wait for element
await page.waitForSelector(
  '[data-testid="button"]', { state: 'visible' }
)
await page.click('[data-testid="button"]')
Enter fullscreen mode Exit fullscreen mode
// ❌ BAD — try-catch masks real failures
try {
  await page.waitForSelector(selector1)
} catch {
  await page.waitForSelector(selector2)
}

// ✅ GOOD — explicit conditional
if (country === 'us') {
  await page.waitForSelector(usSelector)
} else {
  await page.waitForSelector(defaultSelector)
}
Enter fullscreen mode Exit fullscreen mode

Before I added these anti-patterns to the skill file, roughly 1 in 3 generated tests had at least one of these issues.

Bonus: Writing skill files forces you to codify knowledge that usually lives only in your head. It becomes documentation that helps human teammates too.


The Workflow: Research → Plan → Implement

I never just say "write me some tests." I follow a deliberate three-phase process.

In Claude Code: Research → Plan → Implement

  1. Research"Read existing tests, read the API spec, read the skill files. What's covered? What's missing?" The agent explores and builds a mental model. I review its understanding before moving forward.

  2. Plan"Propose which tests to write, in what order, and why." The agent produces a prioritized list of scenarios. I review and approve before any code is written.

  3. Implement — Only after the plan is approved does the agent write code. Because it's already done the research and has an approved plan, the code is targeted, well-structured, and aligned.

This prevents the most common failure mode: the agent eagerly writing 500 lines of code that miss the point entirely.

In Cursor: Plan → Implement

Cursor's workflow is lighter-weight since I'm usually already in the code:

  1. Plan — I describe what I want in the chat, referencing specific files. Cursor proposes an approach inline, and I review it.
  2. Implement — Once I approve, Cursor applies the changes directly in the editor. I review each diff as it appears.

My rule of thumb: Claude Code for large, multi-file efforts. Cursor for focused, in-context edits.


Quality Gates Before Every PR

Writing tests fast means nothing if the tests are broken, unreadable, or unmaintainable. Every piece of AI-generated test code must pass three gates before I raise a PR.

1. All Tests Running and Passing

Non-negotiable. I run the full test suite — not just the new tests — to make sure nothing is broken. If a new test is flaky, it doesn't ship. I iterate with the agent until it's stable.

2. Proper Logging for Human Verification

Every test must include meaningful logging so that a human reviewing the test output can understand what happened without reading the code:

  • Log the test scenario being executed in plain English
  • Log key request payloads and response data (sanitized of sensitive info)
  • Log assertion results with context ("Expected order status to be ACTIVE, got ACTIVE — PASS")
  • Log setup and teardown steps so failures can be traced to their root cause

I explicitly instruct the agent to add this logging. Left to its own devices, it'll write tests that either log nothing or log everything. The skill files include examples of what "good logging" looks like.

3. AI-Powered Code Review Before PR

Before raising a PR, I spin up another agent session specifically for code review. I ask the agent to review the test code with fresh eyes — checking for:

  • Code consistency with existing patterns
  • Missing edge cases or assertions
  • Hardcoded values that should be dynamic
  • Proper error handling and cleanup
  • Test isolation (no shared state between tests)
  • Readability and naming clarity

This is like having a second pair of eyes, except it's instant and never annoyed that you're asking for a review at 6pm on a Friday.

Only after this code review pass — and after addressing any findings — do I raise the PR for human review.


What Works Surprisingly Well

Capability Why It's Great
Pattern matching Tell the agent "follow the same pattern as existing tests" and it genuinely does — naming, helpers, assertions, structure
Spec → Tests Give it a requirements doc and it produces a structured test suite mapped directly to the spec
Error scenarios Agents don't have the human bias toward happy paths — they'll systematically cover timeouts, invalid inputs, auth failures, rate limits
Dynamic payloads Once it understands your payload structure from the skill file, it generates valid variations without you dictating every field
Boilerplate Setup, teardown, data builders, config files — all the tedious-but-essential stuff, handled effortlessly

What Doesn't Work (Yet)

  • Flaky test debugging — If a test passes sometimes and fails sometimes, agents struggle. Flakiness stems from timing, environment issues, or shared state — things that require runtime observation, not just code reading.

  • Complex environment setup — Agents can write the test code, but they can't spin up your Docker containers, seed your database, or configure your VPN. You still own the infrastructure.

  • Business logic judgment — The agent can write a test that checks "the response status is 200," but it can't tell you whether 200 is the correct behavior for that scenario. You still need domain knowledge to validate the what, even if the agent handles the how.


Getting Started

Step 1: Create Your Context Files (4–6 hours)

File Purpose Key Contents
PROJECT.md Project context Architecture, terminology, requirements, environment details
API Test Skill API test knowledge Framework setup, payload construction, test data APIs, auth patterns, helper utilities
UI Test Skill UI test knowledge POM structure, locator strategy, interaction patterns, assertion approaches, best practices
CLAUDE.md / .cursorrules Tool-specific config Repository conventions, build commands, coding standards

Step 2: Establish Your Workflow

  • Always research before planning, plan before implementing
  • Start with one test, iterate, then scale — don't ask for 20 tests at once
  • Run tests after every change — paste failures back to the agent and let it self-correct

Step 3: Set Your Quality Gates

  • All tests green before PR
  • Meaningful logging in every test
  • AI code review pass before human review
  • No hardcoded test data, no flaky waits, no shared state

Step 4: Invest Time Upfront, Save Time Forever

Writing skill files takes a few hours. But those hours pay dividends across every future session. Every time you or a teammate starts a new AI session, you skip the "explain everything from scratch" phase and go straight to productive work.


Final Thought

AI coding agents don't replace the engineer. They replace the tedium. The judgment calls — what to test, why it matters, whether the behavior is correct — those are still yours. But the mechanical work of translating those decisions into running code? That's where agents shine.

The real unlock isn't the AI itself — it's the context you build around it. Skill files, structured workflows, and quality gates transform an AI from a generic code generator into a team member who understands your codebase, follows your conventions, and produces work you're confident shipping.

112 API tests. 14 UI tests. One month. Invest a day building your skill files and try pairing with an AI agent for a week. You won't go back.


Full disclosure: The ideas, workflow, skill files, and real-world experience in this post are entirely mine — born from months of actually doing this work day in, day out. AI helped me write and structure the blog post itself. Practice what you preach, right?

Top comments (0)