Leena Malhotra

Posted on Jan 7

Ship Faster, Break Less: My Rules for Using AI Safely in Codebases

#programming #ai #coding #webdev

I broke production three times learning these rules.

Not catastrophically. No data loss, no security breaches, no angry customer emails. But three incidents where AI-generated code made it through review, passed tests, deployed successfully, then failed under real load in ways staging didn't catch.

The first time, I blamed the AI model. The second time, I blamed my testing. The third time, I realized the problem wasn't the tool—it was that I didn't have rules for when to trust it.

Here are the rules I use now. They're not comprehensive. They won't prevent every bug. But they've eliminated the entire class of "AI confidently generated something completely wrong" incidents that plagued my first six months of shipping AI code.

Rule 1: Never Trust AI on Code You Can't Debug

This sounds obvious. It's not.

You'll be tempted to use AI for complex algorithms, distributed system patterns, performance optimizations—areas where you're not an expert. AI will generate something that looks sophisticated. It will include design patterns you've heard of but never implemented. It will use language features you didn't know existed.

Don't ship it.

If you can't step through the code with a debugger and explain every line, you don't understand it well enough to maintain it. And code you can't maintain will become technical debt the moment something breaks.

Here's the test: Could you fix a bug in this code at 2 AM without Googling anything? If the answer is no, either learn it deeply enough that the answer becomes yes, or write simpler code that you understand.

The gap between "this looks right" and "I know why this is right" is where production incidents live.

What this looks like in practice:

AI suggests using a lock-free concurrent data structure to fix a race condition. You don't fully understand memory ordering guarantees.

Wrong move: Ship it because AI said it's correct.

Right move: Use a standard mutex. It's slower but you understand the guarantees.

AI generates a complex SQL query with multiple CTEs and window functions that somehow returns correct results.

Wrong move: Ship it because the tests pass.

Right move: Rewrite it with joins you understand, even if it's less elegant.

Sophistication is the enemy of maintainability. Simple code that works beats clever code that might work.

Rule 2: AI-Generated Code Gets Adversarial Testing

Unit tests verify the happy path. Adversarial tests try to break things.

AI optimizes for "does this work in the common case?" It doesn't optimize for "what happens when the input is malicious, the network is flaky, the database is slow, and memory is constrained?"

Your job is to think like an attacker, a chaos engineer, and a user who does everything wrong.

Adversarial test categories:

1. Boundary conditions

Empty strings, null values, undefined
Integer overflow (2^31-1, 2^63-1)
Floating point edge cases (NaN, Infinity, -0)
Empty arrays, single-element arrays, massive arrays

2. Malicious input

SQL injection attempts (' OR '1'='1)
XSS payloads (<script>alert('xss')</script>)
Path traversal (../../etc/passwd)
Regex DoS (deeply nested patterns)
Unicode exploits (zero-width characters, RTL overrides)

3. Resource exhaustion

What happens at 10x expected load?
What happens when memory is limited?
What happens when CPU is pegged?
What happens when disk is full?

4. Timing and concurrency

Race conditions under concurrent access
Deadlocks when multiple resources are locked
Starvation when one process monopolizes resources
Data corruption when writes aren't atomic

5. Dependency failures

What happens when the database is unreachable?
What happens when external APIs timeout?
What happens when cache is unavailable?
What happens when file system writes fail?

When you can systematically analyze failure patterns across test runs, you start seeing which edge cases AI consistently misses. Then you add those to your adversarial test suite.

Write these tests before you review the AI code. Let the tests fail. Then evaluate whether AI handled the failure gracefully or catastrophically.

Rule 3: Context Switching Kills Quality—Minimize It

Every time you explain context to a new AI chat, you're playing telephone.

You explain the project. AI misunderstands something subtle. You don't notice. You ask for code. AI generates something based on its misunderstanding. You review it, see that it looks reasonable, and ship it. Then it breaks in production because the foundational assumption was wrong.

The problem isn't AI's fault—it's that you're re-explaining context every time you switch models or start a new conversation. And every re-explanation introduces drift.

The pattern I see constantly:

Developer starts a feature in GPT-5. Explains the requirements. Gets 60% of the way done. Switches to Claude Opus 4.1 for better logic. Re-explains the requirements from memory. Misses a constraint. Claude generates code that violates that constraint. Developer doesn't notice because Claude's code looks correct in isolation.

The fix isn't better memory. The fix is maintaining context across model switches.

When you can carry conversation history across different models, each model sees the full context. GPT-5 explains the architectural constraints. Claude sees those constraints when it generates the implementation. Gemini sees both when it writes tests. Nothing gets lost in translation.

This isn't about convenience. It's about correctness. Fragmented context creates fragmented understanding. Continuous context creates continuous verification.

Rule 4: Different Models for Different Risks

Not all code is equal. Authentication logic isn't the same as a button's hover state. Production database migrations aren't the same as test fixtures.

Match the model to the risk profile.

Low risk, high repetition → GPT-5

UI components with no business logic
Test fixtures and mock data
Configuration files (YAML, JSON, TOML)
Documentation and comments
Boilerplate CRUD operations

GPT-5 is fast. Use it for code where mistakes are obvious and easy to fix.

Medium risk, logic-heavy → Claude Sonnet 4.5

Complex business logic
State machines and workflows
Data transformation pipelines
Algorithm implementations
API integrations

Claude is precise. Use it for code where logical correctness matters more than speed.

High risk, research-required → Gemini 2.5 Pro

Evaluating library trade-offs
Security-sensitive implementations
Performance-critical paths
Compliance-related code
Integration with unfamiliar systems

Gemini synthesizes well. Use it when you need to compare multiple approaches before committing.

Critical risk → Write it yourself

Authentication and authorization
Cryptography and key management
Payment processing
Data deletion and privacy
Disaster recovery

No AI. No exceptions. These are the parts where "looks right" isn't good enough.

The mistake I see developers make: using one model for everything. GPT-5 is great for prototyping but hallucinates APIs. Claude is thorough but over-engineers simple tasks. Gemini is comprehensive but verbose.

Know which model fails where. Route accordingly.

Rule 5: Stage Like Your Job Depends on It

AI code that works in development will fail in staging. AI code that works in staging will fail in production. This is inevitable.

The question isn't whether it will fail. The question is whether you catch the failure before users do.

Staging must be production-like:

Use production data volumes

AI-generated queries that work on 100 test records might timeout on 10M production records. Test with realistic data sizes.
Use production traffic patterns

AI doesn't understand your actual usage patterns. Spike traffic. Simulate bursty load. Test concurrent users.
Use production resource constraints

AI assumes unlimited memory and CPU. Constrain resources in staging to match production limits.
Use production dependency behaviors

Mock external services to return realistic latencies, error rates, timeout patterns. AI optimizes for "everything works perfectly." Reality doesn't.
Use production monitoring

Deploy the same observability stack. Watch metrics in real-time. If error rates spike, latency increases, or memory grows unbounded, don't deploy to production.

The deployment protocol:

Deploy to staging
Route 5% of staging traffic to the new code
Monitor for 24 hours
If metrics are stable, bump to 50%
Monitor for another 24 hours
If still stable, deploy to production at 5%
Gradual rollout: 5% → 25% → 50% → 100% over 3-4 days

The moment anything looks wrong—error rate increase, latency spike, memory leak—roll back immediately. Don't try to debug in production.

This feels slow. It is slow. But "slow and stable" beats "fast and broken" every time.

Rule 6: Code Review is Non-Negotiable (Even for "Simple" Changes)

AI will convince you that the code is so simple it doesn't need review.

It's lying.

I've seen AI generate code that:

Looked like a trivial refactor but introduced a subtle race condition
Appeared to be a simple null check but broke error propagation
Seemed like basic input validation but missed Unicode edge cases
Looked like an optimization but changed semantics

Every AI-generated change gets reviewed. No exceptions.

The review checklist I use:

Correctness

Does this actually solve the problem?
Are edge cases handled?
Is error handling appropriate?
Do types make sense?

Security

Is input validated?
Are queries parameterized?
Are secrets kept out of logs?
Is authorization enforced?

Performance

Will this scale to production load?
Are database queries optimized?
Is there unnecessary computation?
Are there obvious memory leaks?

Maintainability

Can I understand this in 6 months?
Is it consistent with our codebase?
Are variable names clear?
Is there test coverage?

If you can't confidently answer "yes" to everything, don't merge.

And here's the uncomfortable part: most AI code fails at least one of these checks. You need to be willing to reject AI suggestions, ask for revisions, or rewrite sections manually.

Treating AI like a senior developer who doesn't need oversight is how bugs reach production.

Rule 7: AI Explanations are Not Documentation

AI will explain its code confidently. Those explanations are often wrong.

I've seen AI:

Claim code is thread-safe when it has race conditions
Say error handling is comprehensive when it only catches one exception type
Describe algorithms as O(n) when they're O(n²)
Assert security practices are followed when they're violated

Don't trust AI's explanations. Verify the code itself.

What to do instead:

Read the actual code, not the explanation

Line by line. What does it actually do, not what AI says it does?
Run the code in a debugger

Step through with real inputs. Watch what actually happens. Compare to what AI claimed would happen.
Write your own tests

Don't ask AI to generate tests for its own code. It will generate tests that pass, not tests that verify correctness.
Check external documentation

If AI says "this API endpoint accepts these parameters," verify against official docs. AI hallucinates API behavior constantly.
Use static analysis tools

Linters, type checkers, security scanners. They catch what code review misses.

The rule: Trust what the code does, not what AI says it does.

Rule 8: Keep Humans in the Loop on High-Stakes Decisions

AI should accelerate your decisions, not make them for you.

Architecture choices, library selections, database schema changes, API contracts—these are one-way doors. Once you commit, reversing is expensive.

For high-stakes decisions:

Use AI to generate options

Ask multiple models to propose different approaches. GPT-5 might suggest microservices, Claude might suggest a modular monolith, Gemini might suggest serverless.
Use AI to analyze trade-offs

Each approach has pros and cons. AI can articulate them. But it can't weigh them according to your specific context.
Make the decision yourself

You know your team's skill level, your timeline, your constraints, your organization's risk tolerance. AI doesn't.
Use AI to implement the decision

Once you've chosen the approach, AI can help build it. But you chose the direction.

The pattern I've seen fail: Developer asks AI "should I use Postgres or MongoDB?" AI picks one. Developer builds the entire system around that choice. Six months later, it's the wrong choice and refactoring is prohibitively expensive.

AI can inform. It can't decide. Keep the strategic decisions in human hands.

What Actually Works

Here's the honest reality after a year of shipping AI code:

AI velocity multiplier: 2-3x, not 10x

The marketing says 10x faster. The reality is 2-3x faster if you have disciplined processes. Without processes, you ship bugs faster, not features.

Review time doesn't decrease

AI accelerates writing. Review still takes the same time because you're verifying correctness, not just reading code.

Bugs per line of code stays constant

AI doesn't magically write bug-free code. It writes code at roughly the same defect rate as humans. You ship more code, you ship more bugs—unless you verify rigorously.

The quality gap compounds

Teams with strong AI + review processes pull ahead. Teams treating AI like magic fall behind because they're constantly fixing incidents.

Model selection matters more than prompting

Better prompts help. But routing tasks to the right model prevents entire classes of failures.

When you can switch intelligences mid-workflow without losing context, you stop debating "which model should I use?" and start using the right model for each task. Generate with GPT-5, review with Claude, test with Gemini, all in one continuous conversation.

The developers who figure this out ship faster and break less. The developers who treat AI as a black box ship faster and break more.

I broke production three times. I learned eight rules. I haven't broken production with AI code since.

The rules aren't complex. They're just discipline. Discipline to verify. Discipline to test adversarially. Discipline to stage properly. Discipline to reject code you don't understand.

AI makes it easy to move fast. These rules make it safe.

-Leena:)

DEV Community