I shipped AI-generated code to production last week.
Not a toy project. Not a side hustle. Production code serving 50,000+ daily active users. The code passed code review, cleared CI/CD, handled edge cases, and has been running without issues for six days.
This isn't a "look how easy AI makes everything" post. This is about the unglamorous reality: the verification steps, the failure modes, the manual testing, the parts where AI screwed up, and the specific points in the workflow where human judgment is non-negotiable.
Because here's what nobody tells you about using AI in production: the prompt is 5% of the work. The other 95% is knowing when AI is lying to you.
I – The Failure Modes Nobody Talks About
AI-generated code fails in predictable ways.
I've reviewed about 200 AI-generated PRs over the last six months—my own and my team's. The patterns are consistent across GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro. Here's what actually breaks:
Failure Mode 1: Context Window Amnesia
AI forgets what you told it 50 messages ago. You're debugging a function, you've explained the business logic three times, then you ask it to refactor a related module and it completely ignores the constraints you specified earlier.
This isn't a model limitation—it's a workflow problem. You're treating AI like a stateful system when it's actually stateless beyond its context window.
Failure Mode 2: Confident Hallucination of APIs
AI will generate code that calls methods that don't exist. Not deprecated methods. Not misnamed methods. Methods it invented because they sound plausible.
I've seen it hallucinate Stripe API endpoints, Postgres functions, and AWS SDK methods. The code compiles. The types check. The logic looks sound. Then you run it and get a NoSuchMethodError.
Failure Mode 3: Edge Case Blindness
AI optimizes for the happy path. Ask it to write input validation and it'll handle empty strings and null values. It won't handle:
- Unicode edge cases that break regex
- Integer overflow on large inputs
- Timezone handling across DST transitions
- Race conditions under concurrent load
- Memory leaks with long-running processes
It's not that AI can't reason about edge cases. It's that it doesn't prioritize them unless you explicitly prompt for them. And by the time you've prompted for every edge case, you might as well have written the code yourself.
Failure Mode 4: Security Antipatterns
AI loves patterns. Unfortunately, some patterns are security vulnerabilities.
I've seen AI generate:
- SQL queries with string interpolation instead of parameterized queries
- JWT validation that doesn't verify signatures
- Password hashing with weak algorithms
- CORS configs that allow
*in production - File upload handlers that don't validate content types
It mimics what it saw in training data. And training data includes a lot of bad code from StackOverflow and GitHub repos that prioritize "it works" over "it's secure."
The pattern is clear: AI accelerates velocity but introduces risk. The question is how to keep the velocity while mitigating the risk.
II – The Verification Protocol That Actually Works
Here's the workflow I use. It's not elegant. It's not fast. But it catches 95% of AI mistakes before they hit production.
Stage 1: Generation with Constraints
Don't just describe what you want. Describe what you don't want.
Bad prompt:
Write a function to process user uploads
Good prompt:
Write a function to process user uploads with these constraints:
- Max file size: 10MB
- Allowed types: PDF, DOCX only
- Must validate content type, not just extension
- Must scan for malware before processing
- Must handle S3 upload failures gracefully
- Must rate-limit to 5 uploads per user per hour
- Return specific error codes for each failure case
The constraints force AI to think about edge cases upfront. It won't catch everything, but it significantly reduces Stage 2 effort.
Stage 2: Diff Review (5 minutes per function)
Read every line AI generated. Not skimming—actually reading.
I use a checklist:
- [ ] Are all method calls real? (Check docs, don't trust AI)
- [ ] Are types correct? (Especially with TypeScript/Go generics)
- [ ] Is error handling comprehensive? (Not just
try/catcheverything) - [ ] Are edge cases covered? (Null, empty, max values, unicode)
- [ ] Is this code defensive? (Assumes external input is hostile)
- [ ] Would this pass security review? (OWASP Top 10 basics)
If any answer is "no" or "unsure," don't merge. Send it back to AI with specific corrections or write it manually.
Stage 3: Unit Testing with Adversarial Cases
AI-generated code needs adversarial unit tests. Tests that try to break it.
Don't just test the happy path. Test:
- Empty inputs
- Null values
- Max integer values (Integer.MAX_VALUE, 2^63-1)
- Malformed strings (unterminated quotes, SQL injection attempts)
- Concurrent access (simulate race conditions)
- Timeout scenarios (mock slow external APIs)
When you can analyze data patterns across test results, you start seeing where AI consistently fails. GPT-5 forgets null checks. Claude overcomplicates error handling. Gemini misses concurrent access issues.
Run these tests. Watch them fail. Fix the failures manually.
Stage 4: Integration Testing in Staging
AI doesn't understand your production environment.
It doesn't know your database is eventually consistent. It doesn't know your CDN has a 60-second cache TTL. It doesn't know your rate limiter works at the load balancer level, not the application level.
Deploy to staging. Run realistic load. Watch for:
- Memory leaks (use profilers, not guesses)
- N+1 queries (check query logs)
- Timeout cascades (simulate upstream service failures)
- Resource exhaustion (what happens at 10x load?)
If staging behaves differently than expected, don't assume production will be fine. Debug until you understand why.
Stage 5: Gradual Rollout
Never deploy AI-generated code to 100% of users immediately.
Feature flag it. Route 5% of traffic. Monitor error rates, latency, resource usage. If metrics look good after 24 hours, bump to 25%. Then 50%. Then 100%.
The moment error rates spike or latency degrades, roll back. Don't try to debug in production with real user traffic.
This protocol feels slow. It is slow. But "fast and broken" is slower than "deliberate and working."
III – When to Use AI vs When to Write It Yourself
Not every task benefits from AI assistance.
Use AI for:
-
Boilerplate that follows known patterns
- CRUD endpoints with standard validation
- Data model migrations
- API client wrappers
- Test fixtures and mocks
- Configuration files (Terraform, Kubernetes manifests)
-
Refactoring existing code
- AI is decent at structural changes when it can see the entire file
- Renaming variables consistently
- Extracting repeated logic into functions
- Converting callbacks to async/await
-
Exploratory prototyping
- Trying different algorithm approaches
- Comparing library APIs before committing
- Generating test data
- Scaffolding new services
Don't use AI for:
-
Security-critical code
- Authentication logic
- Authorization checks
- Cryptography
- Input sanitization
- Session management
Write this yourself. Or use battle-tested libraries. AI will generate something that "looks right" and has subtle vulnerabilities.
-
Performance-critical code
- Database query optimization
- Caching strategies
- Concurrency primitives
- Memory-efficient algorithms
AI optimizes for correctness, not performance. It will generate O(n²) algorithms when O(n log n) is needed. It will add locks that create contention. It will cache the wrong things.
-
Novel problem spaces
- Architecture decisions for new systems
- Distributed system coordination
- Complex state machines
- Domain-specific business logic with nuanced rules
AI pattern-matches. If your problem hasn't been solved a thousand times already, AI won't solve it correctly. It will generate something that sounds plausible and is fundamentally wrong.
-
Code you don't understand
- If you can't explain what the AI-generated code does line by line, don't ship it
- If you can't debug it when it breaks, don't ship it
- If you can't modify it confidently, don't ship it
The rule is simple: only use AI for code you could have written yourself, just slower. AI accelerates what you already know how to do. It doesn't replace what you don't.
IV – The Multi-Model Strategy
Different models have different strengths. Use them strategically.
GPT-5 for Initial Implementation
GPT-5 generates clean, idiomatic code quickly. It's best for:
- Writing first drafts of new features
- Generating multiple implementation approaches
- Prototyping algorithms
Weakness: Hallucinates APIs and forgets constraints after long conversations.
Claude Opus 4.1 for Logic Review
Claude excels at analyzing existing code for logical flaws. It's best for:
- Reviewing AI-generated code for edge cases
- Identifying race conditions and concurrency issues
- Explaining complex code written by others
Weakness: Overly verbose. Overthinks simple problems.
Gemini 2.5 Pro for Documentation and Testing
Gemini is excellent at synthesis. It's best for:
- Generating comprehensive unit tests
- Writing API documentation
- Creating integration test scenarios
Weakness: Struggles with nuanced business logic and domain-specific rules.
Grok 3 for Library Selection
Grok has real-time awareness of the current ecosystem. It's best for:
- Comparing actively maintained libraries
- Finding recent security patches
- Identifying deprecated dependencies
Weakness: Sometimes prioritizes "new" over "stable."
The workflow: Generate with GPT-5. Review logic with Claude. Test with Gemini. Verify dependencies with Grok.
When you can switch between these intelligences in one conversation, you stop context-switching and start building on what each model surfaces. One generates, another critiques, another tests. The result is better than any single model could produce.
But even with multiple models, you're still the final reviewer.
V – The Code Review Checklist (What Actually Matters)
Before approving any AI-generated code, I run through this:
Correctness
- [ ] Does this do what the spec says?
- [ ] Are all edge cases handled?
- [ ] Is error handling comprehensive?
- [ ] Do the types make sense?
- [ ] Are APIs used correctly? (Verify against docs)
Security
- [ ] Is user input validated?
- [ ] Are queries parameterized? (No string interpolation)
- [ ] Is authentication checked?
- [ ] Is authorization enforced?
- [ ] Are secrets kept out of logs/errors?
- [ ] Is rate limiting in place?
Performance
- [ ] Is the algorithmic complexity acceptable?
- [ ] Are database queries optimized?
- [ ] Is caching used appropriately?
- [ ] Will this scale to production load?
- [ ] Are there obvious memory leaks?
Maintainability
- [ ] Can I understand this code in 6 months?
- [ ] Are variable names clear?
- [ ] Is the structure logical?
- [ ] Is there a test that would catch regressions?
- [ ] Is this consistent with our codebase conventions?
Production Readiness
- [ ] Is logging sufficient for debugging?
- [ ] Are metrics instrumented?
- [ ] Is there a rollback plan?
- [ ] What happens if dependencies fail?
- [ ] Can this be deployed incrementally?
If the answer to any question is "no" or "unclear," the code isn't ready. Fix it or rewrite it.
Don't trust AI's confidence. AI will generate broken code with the same confident tone it uses for correct code. Your judgment is the quality gate.
VI – What I've Learned Shipping AI Code for 6 Months
Lesson 1: AI Code Requires More Review, Not Less
AI accelerates writing. It doesn't accelerate review. You still need to verify everything. The time savings come from faster drafts, not faster reviews.
Lesson 2: Model Selection Matters
Using one model for everything is suboptimal. Different models have different failure modes. Route tasks to the model that's most reliable for that task type.
Lesson 3: Context Persistence is Critical
Losing context between model switches kills productivity. When you can maintain conversation continuity across multiple models, you stop re-explaining project context and start building complex features that require input from different intelligences.
Lesson 4: AI is a Junior Dev, Not a Senior Dev
Treat AI output like you'd treat code from a junior engineer: capable of good work, but requiring oversight. The more critical the code, the more scrutiny it needs.
Lesson 5: The Prompt is the Spec
If your prompt is vague, your code will be vague. If your prompt omits constraints, your code will violate constraints. The quality of AI output is bounded by the quality of your specification.
Lesson 6: Tests are Non-Negotiable
AI-generated code without tests is technical debt. Write tests. Run them. Watch them fail. Fix the failures. This isn't optional.
Lesson 7: Production is the Real Test
Staging catches 80% of issues. Production catches the remaining 20%. Deploy gradually. Monitor aggressively. Roll back quickly when metrics degrade.
The Honest Trade-off
Here's what using AI in production actually looks like:
Time to first draft: 10x faster
Time to reviewed, tested, production-ready code: 2-3x faster
Risk of subtle bugs: Higher if you skip verification steps
Code quality: Equal to manually written code if you verify properly
The 10x claim is marketing. The 2-3x improvement is real, but only if you have robust verification processes.
If your process is "generate code, ship code," you will ship bugs. If your process is "generate, review, test, stage, gradual rollout," you'll ship quality code faster than writing it manually.
The gap between developers with disciplined AI workflows and those winging it is growing every week.
I shipped AI-generated code to production. It's still running. But that's not because AI is magic—it's because I treated it like any other code: with skepticism, verification, and gradual rollout.
The question isn't whether to use AI in production. It's whether you have the discipline to use it safely.
-Leena:)
Top comments (0)