Rohit Gavali

Posted on Mar 18

My Workflow for Validating AI Outputs Before Shipping Code

#ai #webdev #programming #productivity

I shipped AI-generated code to production exactly once without a validation workflow. It took down our payment processing for forty minutes and cost us three customer escalations.

The code looked perfect. Clean structure, proper error handling, comprehensive logging. It passed our test suite. The AI that generated it—Claude Opus 4.6—confidently assured me it was production-ready.

The bug was subtle: the payment retry logic used exponential backoff with no maximum delay. After five retries, it was waiting sixteen minutes before attempting the sixth retry. Users saw pending payments that never resolved. Our monitoring didn't catch it because technically nothing crashed—the code was just waiting.

A human would have questioned sixteen-minute delays. The AI never considered whether the behavior made sense in production context. It implemented the algorithm correctly but didn't reason about the consequences.

That incident forced me to build a systematic validation workflow. Not because AI code is inherently bad, but because AI-generated code fails in different ways than human-written code, and our traditional review processes don't catch those failures.

The Core Problem With AI Code Review

Traditional code review assumes the author understood the requirements and attempted to meet them. The reviewer checks if the implementation matches the intent.

AI-generated code breaks this assumption. The AI didn't understand requirements—it pattern-matched against similar code in its training data. Sometimes the pattern is right. Sometimes it's subtly wrong in ways that look correct until you reason about behavior.

This means standard code review questions don't work:

"Does this implementation match the requirements?" — AI code usually matches the literal requirements while missing implicit constraints you'd assume any developer would understand.

"Are there edge cases that aren't handled?" — AI code often handles edge cases you specified while introducing new edge cases you didn't think to mention.

"Is this maintainable?" — AI code is usually well-structured and readable. Maintainability isn't the problem. Correctness is.

I needed a validation workflow that accounted for AI's specific failure modes, not just general code quality issues.

The Validation Workflow That Actually Works

After six months of shipping AI-generated code without incidents, here's the workflow that survived:

Stage 1: Multi-Model Generation

I never ship code generated by a single AI model. I generate implementations from at least two different models and compare them.

When I needed a function to parse and validate user-uploaded configuration files, I asked both Claude Opus 4.6 and Gemini 3.1 Pro to implement it independently.

Claude's version prioritized error messages and validation feedback. It returned detailed errors explaining what was wrong with malformed configs.

Gemini's version prioritized performance. It validated config structure in a single pass and returned boolean valid/invalid with minimal error detail.

Neither was wrong. But the comparison revealed an implicit requirement I hadn't specified: we needed detailed error messages for user feedback, not just validation results.

If I'd shipped the first implementation I received, I would have implemented the wrong behavior. The multi-model comparison forced me to clarify requirements I'd assumed were obvious.

Using platforms that let you compare AI models side-by-side makes this stage practical. You can see both implementations simultaneously without copy-pasting between interfaces.

Stage 2: Behavioral Verification

I don't review AI-generated code the way I review human code. I don't ask "does this look right?" I ask "what does this actually do?"

For every AI-generated function, I manually trace execution with specific inputs:

Happy path input: Does it produce the expected output?

Boundary conditions: Empty strings, null values, zero, maximum values—what happens at the edges?

Malformed input: What happens with invalid data? Does it fail gracefully or crash?

Production-scale input: What happens with realistic data volumes? Does performance degrade?

For the payment retry logic that failed, this stage would have caught the issue. Tracing through the exponential backoff with actual numbers would have revealed the sixteen-minute delay.

I use tools that help verify the logical flow of generated code, not just syntax. The goal is to confirm the code behaves correctly under all conditions, not just that it compiles and runs.

Stage 3: Cross-Model Review

After selecting an implementation, I have a different AI model review it.

If Claude generated the code, I ask Gemini to review it. If Gemini generated it, I ask GPT-5.4 to review it.

Each model has different blind spots. Code that passes Claude's conceptual review might fail Gemini's performance analysis. Code that passes GPT's readability check might have architectural issues Claude would catch.

The key is asking the right review questions:

Not: "Is this code correct?"

Instead: "What could go wrong with this code in production?"

Not: "Does this follow best practices?"

Instead: "What implicit assumptions does this code make?"

Not: "Is this well-written?"

Instead: "What edge cases might this code not handle?"

Cross-model review isn't about finding syntax errors. It's about surfacing assumptions the generating model made that might be invalid for your specific context.

Stage 4: Test Case Generation

I have AI generate comprehensive test cases for the code, then review those tests more carefully than the code itself.

AI-generated tests reveal assumptions the model made during implementation. If the tests don't cover a scenario you care about, the code probably doesn't handle it correctly.

For the payment retry function, I had Claude Sonnet 4.5 generate test cases. The tests covered retry counts, error handling, and backoff timing—but none tested total elapsed time.

That omission revealed the model didn't consider time limits as a constraint worth testing. Which meant it didn't consider them during implementation either.

I now add test cases the AI didn't generate, specifically targeting:

Time-based behavior (timeouts, delays, expiration)
Resource constraints (memory, connections, file handles)
Concurrent access (race conditions, locking)
Production-scale data (performance, pagination)

These are areas where AI-generated code consistently has gaps.

Stage 5: Context Validation

This is the stage most developers skip, and it's where the subtlest bugs hide.

AI doesn't know your system architecture, your constraints, or your operational requirements. It generates code that works in isolation but might fail in context.

For every AI-generated component, I explicitly verify:

Does this integrate correctly with existing systems? AI might use patterns that conflict with how the rest of your codebase works.

Does this match our performance requirements? AI optimizes for correctness, not performance. It might choose approaches that work but don't scale.

Does this handle our operational constraints? Retry limits, timeout budgets, connection pools—AI doesn't know these exist unless you specify them explicitly.

Does this maintain our security posture? AI might use libraries or approaches that introduce vulnerabilities in your specific context.

I use AI-powered analysis tools to validate that generated code handles our specific data patterns correctly. But the final verification is always manual—checking that the code makes sense within our system's constraints.

The Validation Checklist I Actually Use

Before shipping any AI-generated code, I run through this checklist:

[ ] Generated by at least two different models and compared

Different implementations reveal ambiguities in requirements

[ ] Manually traced execution with realistic inputs

Confirms code does what I think it does, not just what it claims to do

[ ] Reviewed by a different AI model than the one that generated it

Catches blind spots specific to the generating model

[ ] Test cases generated and reviewed for gaps

AI-generated tests reveal what the model considered important

[ ] Additional tests written for time, resources, concurrency, scale

Areas where AI consistently misses edge cases

[ ] Verified integration with existing systems

Confirms code works in context, not just in isolation

[ ] Checked against operational constraints

Ensures code respects system-specific limits and requirements

[ ] Security review for libraries, approaches, data handling

AI might introduce vulnerabilities specific to your context

[ ] Performance tested with production-scale data

Confirms code doesn't just work but works at scale

[ ] Documentation reviewed for accuracy

AI-generated docs often describe what code should do, not what it actually does

This sounds like a lot. In practice, it takes 10-15 minutes for a typical function. That's longer than reviewing human-written code, but shorter than debugging production incidents caused by skipping validation.

What This Workflow Catches

Implicit requirements the AI missed: Multi-model generation reveals ambiguities you didn't realize existed in your requirements.

Logic errors that look syntactically correct: Manual execution tracing catches bugs that pass automated testing.

Model-specific blind spots: Cross-model review surfaces assumptions one model made that another would question.

Missing edge cases: Test case generation plus manual additions ensure coverage of scenarios AI doesn't naturally consider.

Context mismatches: Validation against system constraints catches code that works in isolation but fails in production.

What This Workflow Costs

Time: 10-15 minutes per function instead of 2-3 minutes for standard review.

Context switching: Using multiple models means explaining the same requirements multiple times.

Cognitive load: Comparing implementations and tracing execution requires more mental effort than accepting the first plausible solution.

Tool overhead: Managing multiple AI models and comparison workflows requires infrastructure.

But here's what I learned: the time cost of validation is negligible compared to the time cost of debugging production issues caused by skipped validation.

That forty-minute payment outage cost me six hours of debugging, incident response, and customer communication. Plus reputation damage that's harder to quantify.

The validation workflow would have caught that bug in ten minutes. The ROI is obvious.

The Skills This Workflow Requires

Validating AI code isn't about knowing how to prompt better. It's about developing specific review skills:

The ability to read code behaviorally, not structurally. Don't ask if the code looks right. Ask what it actually does with specific inputs.

Pattern recognition for AI failure modes. After validating dozens of AI-generated functions, you start recognizing the types of bugs AI consistently introduces.

The discipline to check what seems obvious. AI code looks so clean and confident that your brain wants to trust it. You need to develop skepticism that overrides that instinct.

Comfort with multiple models. You need to be fluent enough with different AI systems to quickly generate and compare implementations.

The judgment to know when validation is overkill. Not every AI-generated snippet needs full validation. A one-line string transformation doesn't need multi-model review. A payment processing function does.

When I Skip Steps

I don't run every AI-generated code snippet through full validation. That would be inefficient.

I skip validation for:

Pure data transformations with no side effects. If the function just transforms input to output with no external dependencies, the input/output tests are usually sufficient.

Code I'm going to manually rewrite anyway. Sometimes I use AI to generate a starting point that I'll completely refactor. Full validation is overkill.

Non-critical scripts and tools. Deployment scripts, data migration helpers, one-off analysis tools—if failure is low-cost, lightweight validation is fine.

Code that's easy to verify through use. UI components, formatting utilities, display logic—if you can immediately see whether it works through normal use, formal validation isn't necessary.

I run full validation for:

Anything that handles money, authentication, or user data. High-stakes code gets maximum scrutiny.

Performance-critical paths. Code that needs to scale or run efficiently requires validation of resource usage and timing behavior.

Complex business logic. Anything implementing domain-specific rules where correctness isn't obvious from casual inspection.

Integration points between systems. Code that connects different parts of your architecture where bugs can cascade.

The judgment about when to validate thoroughly is a skill you develop by seeing what types of AI-generated code tend to have subtle bugs versus what's usually fine.

What Changed After Adopting This Workflow

I ship AI-generated code confidently. Before the workflow, every deploy felt risky. Now I trust validated AI code as much as code I wrote myself.

I catch bugs before they reach production. The last six months: zero production incidents from AI-generated code. Compare that to one major incident in the first month before I had a validation workflow.

I write less code but understand it better. AI handles implementation, I focus on verification. This forces me to think deeply about behavior rather than syntax.

I'm faster overall despite validation overhead. AI generates code in seconds. Validation takes minutes. Writing code manually takes hours. Net win.

I've developed pattern recognition for AI failures. After validating hundreds of functions, I can spot likely bugs in AI code quickly. It's a learnable skill.

What I'd Tell Someone Starting Today

Don't ship AI code without validation. The time savings from AI generation disappear instantly when you have to debug production issues.

Build validation into your workflow from day one. Use multiple models to compare implementations. Manually trace execution. Cross-model review. Test comprehensively.

Use tools that make multi-model workflows practical. Platforms like Crompt AI let you generate and compare outputs from different models without switching between interfaces. This makes validation fast enough to actually do it.

Develop skepticism for code that looks too clean. AI-generated code is suspiciously well-structured. That's a red flag, not a green light.

Learn to recognize AI's failure patterns. Off-by-one errors, missing timeouts, ignored resource constraints, subtle regex bugs—these show up repeatedly. Pattern recognition makes validation faster.

The goal isn't to not use AI. It's to use it safely. AI can generate code faster than you can type. But only careful validation ensures that code actually works in production.

-Rohit

DEV Community