Nova Elvaris

Posted on Mar 28

The Debug-First AI Workflow: Why I Make My Assistant Break Things on Purpose

#ai #testing #productivity #programming

Most people use AI assistants to write code. I've started using mine to break code first.

It sounds counterintuitive, but this one change to my workflow cut my bug rate in half and made code reviews actually meaningful.

The Problem With Write-First

The default AI coding workflow looks like this:

Describe what you want
AI writes the code
You review it
You find bugs (maybe)
You fix them

Step 4 is where things fall apart. Reviewing AI-generated code is hard because it looks correct. It's well-structured, properly indented, has reasonable variable names. Your brain pattern-matches it as "good code" and skips over the logic errors.

I call this correctness theater — the code performs competence without actually being correct.

The Debug-First Approach

Here's what I do instead:

Step 1: Write the Spec (30 seconds)

Function: calculateShippingCost(order)
- Input: order with items[], destination, shipping_method
- Output: { cost: number, estimatedDays: number }
- Rules: free shipping over $100, express = 2x base rate,
  international adds flat $15, empty orders throw error

Step 2: Ask for Failure Scenarios (AI's first job)

Given this spec, list 10 ways this function could fail or produce
wrong results. Be specific — include the input that would trigger
each failure.

AI returns something like:

Empty items array → should throw, might return $0
Order total exactly $100 → boundary: is it > or >=?
Negative item prices → cost could go negative
Missing destination → international check fails silently
Express + international + free shipping → which rules apply in what order?
Single item worth $101 → free shipping, but express is 2x of what?
Floating point: $33.33 × 3 = $99.99, not $100
Null shipping_method → undefined behavior
destination = "" (empty string) → not null, not valid
Items with quantity: 0 → included in total?

This takes 10 seconds and I already have a better understanding of my own spec than I did when I wrote it.

Step 3: Turn Failures Into Tests (AI's second job)

Write a test for each failure scenario. Each test should currently FAIL
(because we haven't written the implementation yet). Use Vitest.

Now I have 10 tests that define what "correct" actually means — before a single line of implementation exists.

Step 4: Write the Implementation (AI's third job)

Write calculateShippingCost() that passes all 10 tests.
Run through each test mentally and confirm it would pass.

The implementation is now constrained by real edge cases. The AI can't take shortcuts because the tests are already there, waiting to catch them.

Step 5: Verify (Mine)

I run the tests. Usually 8-9 pass on the first try. The 1-2 that fail reveal genuine ambiguities in my spec that I need to decide on. That's a feature — I'd rather resolve ambiguity now than in production.

Why This Works Better

You Review Tests, Not Code

Reading 10 test descriptions is faster and easier than reading 50 lines of implementation. If the tests cover the right cases, the implementation is almost certainly correct. If they don't, you know exactly what's missing.

AI Is Better at Breaking Than Building

Ironically, AI assistants are more reliable at finding problems than solving them. When you ask "what could go wrong?", the model draws on thousands of bug reports and failure modes from its training data. When you ask "write correct code," it draws on thousands of code samples — many of which contain the very bugs it could have warned you about.

The Spec Improves Before Implementation

Half the bugs I ship come from ambiguous specs, not bad code. The debug-first workflow forces spec clarification before I write anything. By step 3, I've already caught the "$100 boundary" and "express + free shipping" ambiguities.

Code Reviews Become Meaningful

When someone reviews my PR, they see the test names:

✓ throws on empty order
✓ free shipping at exactly $100
✓ rejects negative prices
✓ handles express + international combo

The reviewer immediately understands what the code does, what edge cases were considered, and what "correct" means. They can focus on what's missing from the tests rather than trying to read implementation logic.

The Template

Here's the prompt sequence I use every time:

Prompt 1 — Failure Analysis:

Here's my function spec: [spec]
List 10 specific ways this could fail or produce wrong results.
Include the exact input for each.

Prompt 2 — Test Generation:

Write a failing test for each scenario. Use [test framework].
Tests should fail because the function doesn't exist yet.

Prompt 3 — Implementation:

Write the function to pass all tests. After writing it,
walk through each test and explain why it passes.

Prompt 4 — Gap Check:

Look at the tests and implementation together.
Are there any edge cases the tests don't cover?

Four prompts. Usually takes 5-10 minutes total. The resulting code is more correct than anything I'd get from a single "write this function" prompt.

When to Skip This

This is overkill for trivial functions. I don't debug-first a formatDate() helper. My rule of thumb: if the function has more than 2 conditional branches or touches external state (DB, API, file system), it gets the debug-first treatment.

Try this on your next feature. The first time all 10 tests pass on the first run, you'll feel like you have a superpower.

DEV Community