I broke production three times learning these rules.
Not catastrophically. No data loss, no security breaches, no angry customer emails. But three incidents where AI-generated code made it through review, passed tests, deployed successfully, then failed under real load in ways staging didn't catch.
The first time, I blamed the AI model. The second time, I blamed my testing. The third time, I realized the problem wasn't the tool—it was that I didn't have rules for when to trust it.
Here are the rules I use now. They're not comprehensive. They won't prevent every bug. But they've eliminated the entire class of "AI confidently generated something completely wrong" incidents that plagued my first six months of shipping AI code.
Rule 1: Never Trust AI on Code You Can't Debug
This sounds obvious. It's not.
You'll be tempted to use AI for complex algorithms, distributed system patterns, performance optimizations—areas where you're not an expert. AI will generate something that looks sophisticated. It will include design patterns you've heard of but never implemented. It will use language features you didn't know existed.
Don't ship it.
If you can't step through the code with a debugger and explain every line, you don't understand it well enough to maintain it. And code you can't maintain will become technical debt the moment something breaks.
Here's the test: Could you fix a bug in this code at 2 AM without Googling anything? If the answer is no, either learn it deeply enough that the answer becomes yes, or write simpler code that you understand.
The gap between "this looks right" and "I know why this is right" is where production incidents live.
What this looks like in practice:
AI suggests using a lock-free concurrent data structure to fix a race condition. You don't fully understand memory ordering guarantees.
Wrong move: Ship it because AI said it's correct.
Right move: Use a standard mutex. It's slower but you understand the guarantees.
AI generates a complex SQL query with multiple CTEs and window functions that somehow returns correct results.
Wrong move: Ship it because the tests pass.
Right move: Rewrite it with joins you understand, even if it's less elegant.
Sophistication is the enemy of maintainability. Simple code that works beats clever code that might work.
Rule 2: AI-Generated Code Gets Adversarial Testing
Unit tests verify the happy path. Adversarial tests try to break things.
AI optimizes for "does this work in the common case?" It doesn't optimize for "what happens when the input is malicious, the network is flaky, the database is slow, and memory is constrained?"
Your job is to think like an attacker, a chaos engineer, and a user who does everything wrong.
Adversarial test categories:
1. Boundary conditions
- Empty strings, null values, undefined
- Integer overflow (2^31-1, 2^63-1)
- Floating point edge cases (NaN, Infinity, -0)
- Empty arrays, single-element arrays, massive arrays
2. Malicious input
- SQL injection attempts (
' OR '1'='1) - XSS payloads (
<script>alert('xss')</script>) - Path traversal (
../../etc/passwd) - Regex DoS (deeply nested patterns)
- Unicode exploits (zero-width characters, RTL overrides)
3. Resource exhaustion
- What happens at 10x expected load?
- What happens when memory is limited?
- What happens when CPU is pegged?
- What happens when disk is full?
4. Timing and concurrency
- Race conditions under concurrent access
- Deadlocks when multiple resources are locked
- Starvation when one process monopolizes resources
- Data corruption when writes aren't atomic
5. Dependency failures
- What happens when the database is unreachable?
- What happens when external APIs timeout?
- What happens when cache is unavailable?
- What happens when file system writes fail?
When you can systematically analyze failure patterns across test runs, you start seeing which edge cases AI consistently misses. Then you add those to your adversarial test suite.
Write these tests before you review the AI code. Let the tests fail. Then evaluate whether AI handled the failure gracefully or catastrophically.
Rule 3: Context Switching Kills Quality—Minimize It
Every time you explain context to a new AI chat, you're playing telephone.
You explain the project. AI misunderstands something subtle. You don't notice. You ask for code. AI generates something based on its misunderstanding. You review it, see that it looks reasonable, and ship it. Then it breaks in production because the foundational assumption was wrong.
The problem isn't AI's fault—it's that you're re-explaining context every time you switch models or start a new conversation. And every re-explanation introduces drift.
The pattern I see constantly:
Developer starts a feature in GPT-5. Explains the requirements. Gets 60% of the way done. Switches to Claude Opus 4.1 for better logic. Re-explains the requirements from memory. Misses a constraint. Claude generates code that violates that constraint. Developer doesn't notice because Claude's code looks correct in isolation.
The fix isn't better memory. The fix is maintaining context across model switches.
When you can carry conversation history across different models, each model sees the full context. GPT-5 explains the architectural constraints. Claude sees those constraints when it generates the implementation. Gemini sees both when it writes tests. Nothing gets lost in translation.
This isn't about convenience. It's about correctness. Fragmented context creates fragmented understanding. Continuous context creates continuous verification.
Rule 4: Different Models for Different Risks
Not all code is equal. Authentication logic isn't the same as a button's hover state. Production database migrations aren't the same as test fixtures.
Match the model to the risk profile.
Low risk, high repetition → GPT-5
- UI components with no business logic
- Test fixtures and mock data
- Configuration files (YAML, JSON, TOML)
- Documentation and comments
- Boilerplate CRUD operations
GPT-5 is fast. Use it for code where mistakes are obvious and easy to fix.
Medium risk, logic-heavy → Claude Sonnet 4.5
- Complex business logic
- State machines and workflows
- Data transformation pipelines
- Algorithm implementations
- API integrations
Claude is precise. Use it for code where logical correctness matters more than speed.
High risk, research-required → Gemini 2.5 Pro
- Evaluating library trade-offs
- Security-sensitive implementations
- Performance-critical paths
- Compliance-related code
- Integration with unfamiliar systems
Gemini synthesizes well. Use it when you need to compare multiple approaches before committing.
Critical risk → Write it yourself
- Authentication and authorization
- Cryptography and key management
- Payment processing
- Data deletion and privacy
- Disaster recovery
No AI. No exceptions. These are the parts where "looks right" isn't good enough.
The mistake I see developers make: using one model for everything. GPT-5 is great for prototyping but hallucinates APIs. Claude is thorough but over-engineers simple tasks. Gemini is comprehensive but verbose.
Know which model fails where. Route accordingly.
Rule 5: Stage Like Your Job Depends on It
AI code that works in development will fail in staging. AI code that works in staging will fail in production. This is inevitable.
The question isn't whether it will fail. The question is whether you catch the failure before users do.
Staging must be production-like:
Use production data volumes
AI-generated queries that work on 100 test records might timeout on 10M production records. Test with realistic data sizes.Use production traffic patterns
AI doesn't understand your actual usage patterns. Spike traffic. Simulate bursty load. Test concurrent users.Use production resource constraints
AI assumes unlimited memory and CPU. Constrain resources in staging to match production limits.Use production dependency behaviors
Mock external services to return realistic latencies, error rates, timeout patterns. AI optimizes for "everything works perfectly." Reality doesn't.Use production monitoring
Deploy the same observability stack. Watch metrics in real-time. If error rates spike, latency increases, or memory grows unbounded, don't deploy to production.
The deployment protocol:
- Deploy to staging
- Route 5% of staging traffic to the new code
- Monitor for 24 hours
- If metrics are stable, bump to 50%
- Monitor for another 24 hours
- If still stable, deploy to production at 5%
- Gradual rollout: 5% → 25% → 50% → 100% over 3-4 days
The moment anything looks wrong—error rate increase, latency spike, memory leak—roll back immediately. Don't try to debug in production.
This feels slow. It is slow. But "slow and stable" beats "fast and broken" every time.
Rule 6: Code Review is Non-Negotiable (Even for "Simple" Changes)
AI will convince you that the code is so simple it doesn't need review.
It's lying.
I've seen AI generate code that:
- Looked like a trivial refactor but introduced a subtle race condition
- Appeared to be a simple null check but broke error propagation
- Seemed like basic input validation but missed Unicode edge cases
- Looked like an optimization but changed semantics
Every AI-generated change gets reviewed. No exceptions.
The review checklist I use:
Correctness
- Does this actually solve the problem?
- Are edge cases handled?
- Is error handling appropriate?
- Do types make sense?
Security
- Is input validated?
- Are queries parameterized?
- Are secrets kept out of logs?
- Is authorization enforced?
Performance
- Will this scale to production load?
- Are database queries optimized?
- Is there unnecessary computation?
- Are there obvious memory leaks?
Maintainability
- Can I understand this in 6 months?
- Is it consistent with our codebase?
- Are variable names clear?
- Is there test coverage?
If you can't confidently answer "yes" to everything, don't merge.
And here's the uncomfortable part: most AI code fails at least one of these checks. You need to be willing to reject AI suggestions, ask for revisions, or rewrite sections manually.
Treating AI like a senior developer who doesn't need oversight is how bugs reach production.
Rule 7: AI Explanations are Not Documentation
AI will explain its code confidently. Those explanations are often wrong.
I've seen AI:
- Claim code is thread-safe when it has race conditions
- Say error handling is comprehensive when it only catches one exception type
- Describe algorithms as O(n) when they're O(n²)
- Assert security practices are followed when they're violated
Don't trust AI's explanations. Verify the code itself.
What to do instead:
Read the actual code, not the explanation
Line by line. What does it actually do, not what AI says it does?Run the code in a debugger
Step through with real inputs. Watch what actually happens. Compare to what AI claimed would happen.Write your own tests
Don't ask AI to generate tests for its own code. It will generate tests that pass, not tests that verify correctness.Check external documentation
If AI says "this API endpoint accepts these parameters," verify against official docs. AI hallucinates API behavior constantly.Use static analysis tools
Linters, type checkers, security scanners. They catch what code review misses.
The rule: Trust what the code does, not what AI says it does.
Rule 8: Keep Humans in the Loop on High-Stakes Decisions
AI should accelerate your decisions, not make them for you.
Architecture choices, library selections, database schema changes, API contracts—these are one-way doors. Once you commit, reversing is expensive.
For high-stakes decisions:
Use AI to generate options
Ask multiple models to propose different approaches. GPT-5 might suggest microservices, Claude might suggest a modular monolith, Gemini might suggest serverless.Use AI to analyze trade-offs
Each approach has pros and cons. AI can articulate them. But it can't weigh them according to your specific context.Make the decision yourself
You know your team's skill level, your timeline, your constraints, your organization's risk tolerance. AI doesn't.Use AI to implement the decision
Once you've chosen the approach, AI can help build it. But you chose the direction.
The pattern I've seen fail: Developer asks AI "should I use Postgres or MongoDB?" AI picks one. Developer builds the entire system around that choice. Six months later, it's the wrong choice and refactoring is prohibitively expensive.
AI can inform. It can't decide. Keep the strategic decisions in human hands.
What Actually Works
Here's the honest reality after a year of shipping AI code:
AI velocity multiplier: 2-3x, not 10x
The marketing says 10x faster. The reality is 2-3x faster if you have disciplined processes. Without processes, you ship bugs faster, not features.
Review time doesn't decrease
AI accelerates writing. Review still takes the same time because you're verifying correctness, not just reading code.
Bugs per line of code stays constant
AI doesn't magically write bug-free code. It writes code at roughly the same defect rate as humans. You ship more code, you ship more bugs—unless you verify rigorously.
The quality gap compounds
Teams with strong AI + review processes pull ahead. Teams treating AI like magic fall behind because they're constantly fixing incidents.
Model selection matters more than prompting
Better prompts help. But routing tasks to the right model prevents entire classes of failures.
When you can switch intelligences mid-workflow without losing context, you stop debating "which model should I use?" and start using the right model for each task. Generate with GPT-5, review with Claude, test with Gemini, all in one continuous conversation.
The developers who figure this out ship faster and break less. The developers who treat AI as a black box ship faster and break more.
I broke production three times. I learned eight rules. I haven't broken production with AI code since.
The rules aren't complex. They're just discipline. Discipline to verify. Discipline to test adversarially. Discipline to stage properly. Discipline to reject code you don't understand.
AI makes it easy to move fast. These rules make it safe.
-Leena:)
Top comments (0)