DEV Community

Shiplight
Shiplight

Posted on • Originally published at shiplight.ai

Pragmatic Testing for AI-Generated Code

AI coding agents are already changing how software gets built. They implement UI updates quickly, refactor aggressively, and ship more surface area per sprint than most teams planned for. The bottleneck has moved: if code is produced faster than it can be verified, quality becomes a matter of luck.

This post outlines a practical workflow for teams that want to trust AI-generated code — not just review it, but actually prove it works in a real browser before it ships.

Why AI-Generated Code Needs Automated Verification

Traditional automation assumes a clear boundary between "building" and "testing." AI-native development blurs that line. When an agent can implement a feature in minutes, waiting hours for manual QA or brittle UI scripts isn't just slow — it's structurally misaligned.

Manual code review catches logic errors, but it cannot verify that a UI actually renders correctly. Traditional E2E frameworks like Playwright require someone to write test scripts after the code is done — a separate step that rarely keeps pace with AI-generated output. The gap between "code written" and "code verified" is where regressions live.

Step 1: Connect Your Coding Agent to a Real Browser

Using MCP (Model Context Protocol), AI coding agents can now open a real browser, navigate, click, type, take screenshots, and run "verify" actions — without leaving the coding workflow.

# Claude Code setup
claude mcp add shiplight -- npx -y @shiplightai/mcp@latest
Enter fullscreen mode Exit fullscreen mode

Two important details:

  1. Start with browser automation. Core browser actions work without AI keys; AI-powered VERIFY actions need an AI provider key.
  2. This is for real development work. Validate the UI changes the agent just made on your local environment, staging, or a preview deploy.

Step 2: Verify a Change, Then Turn It Into a Test

The verification loop should be fast enough that engineers actually use it:

  1. Start a browser session
  2. Navigate to the changed UI
  3. Act on the UI (click, fill, submit)
  4. Confirm the outcome with a VERIFY assertion
  5. Save the interaction as a YAML test file

Tests are expressed in YAML with natural language intent — readable in code review and accessible to non-QA engineers:

goal: Verify checkout flow after payment UI update
statements:
  - intent: Navigate to product catalog
  - intent: Add first product to cart
  - intent: Proceed to checkout
  - intent: Enter test card details
  - intent: Submit payment
  - VERIFY: Order confirmation page shows order number
Enter fullscreen mode Exit fullscreen mode

This test lives in your git repo, appears in PR diffs, and runs in CI on every future merge.

Step 3: Make Tests Fast Without Making Them Fragile

Natural language intent is great for reviewability, but CI needs deterministic replay. The solution: enrich steps with cached locators for speed, with an AI fallback that kicks in when the UI changes.

This is the intent-cache-heal pattern:

  • First run: AI resolves intent to a locator, caches it
  • Subsequent runs: uses the cached locator (fast)
  • After a UI change: cached locator fails → AI re-resolves from intent → updates cache

This removes the classic automation tax: minor UI refactors no longer demand constant selector repairs.

Step 4: Run Tests Locally Like a Normal Playwright Suite

YAML test files live alongside *.test.ts tests and execute via npx playwright test. The YAML transpiles transparently to Playwright-compatible spec files, compatible with your existing Playwright configuration.

This keeps verification in the same place as development: your repo, your review process, your CI conventions.

Step 5: Gate PRs on Flow Verification

Add a CI step that runs your critical user flows on every PR:

# .github/workflows/e2e.yml
- name: Run E2E smoke tests
  run: npx playwright test --grep @smoke
  env:
    BASE_URL: ${{ secrets.STAGING_URL }}
Enter fullscreen mode Exit fullscreen mode

Not the full suite — just the 5–10 flows that, if broken, would cause a production incident. This keeps CI fast while catching the bugs that matter.

Step 6: Scale Into CI and Ongoing Visibility

When you're ready to operationalize:

  • Schedule full regression suites on merge to main
  • Use AI-generated failure summaries (screenshot-aware, with root cause guidance)
  • Add email flow testing — Shiplight supports LLM-based extraction of verification codes and links from real inboxes

The Conceptual Shift That Makes This Work

In an AI-native workflow, testing is not a separate project. Verification becomes a byproduct of shipping:

  • Agent implements a change
  • Tool validates it in a real browser
  • The validation becomes a durable test
  • The suite grows with every meaningful release

If your team is already building with AI agents, the next competitive advantage isn't writing more code. It's proving, continuously, that what you built still works.

Quick-Start Checklist

  • [ ] Install Shiplight Plugin MCP server in your coding agent
  • [ ] Verify your next AI-generated UI change in a real browser before PR
  • [ ] Save the verification as a YAML test file
  • [ ] Add the test to your CI smoke suite
  • [ ] Measure: how many bugs caught before review vs. before?

FAQ

What is the fastest way to start testing AI-generated code?

Install the Shiplight Plugin, make your next AI code change, and immediately ask the agent to verify it in a browser before you write the PR description. The verification itself is the test — save it as YAML and you're done.

Do I need to rewrite my existing tests?

No. YAML tests run alongside your existing Playwright *.test.ts files. Start with new flows from AI-generated code; migrate existing tests only if the maintenance burden justifies it.

How do intent-based tests handle UI framework migrations?

Because intent describes what the user wants ("click the submit button") rather than how to find it (#submit-btn-v2), tests survive CSS renames, component refactors, and even framework migrations. The AI re-resolves the element from the live DOM when the cached locator breaks.

What's the ROI on this workflow?

Teams report eliminating 60%+ of test maintenance time. More importantly, they catch logic and flow bugs that AI-generated code introduces before they reach production — the kind of bugs that code review alone misses because the code looks correct.


Originally published at shiplight.ai

Top comments (0)