Leena Malhotra

Posted on Jan 6

From Prompt to Production: Using AI Safely in a Real Dev Workflow

#webdev #programming #ai

I shipped AI-generated code to production last week.

Not a toy project. Not a side hustle. Production code serving 50,000+ daily active users. The code passed code review, cleared CI/CD, handled edge cases, and has been running without issues for six days.

This isn't a "look how easy AI makes everything" post. This is about the unglamorous reality: the verification steps, the failure modes, the manual testing, the parts where AI screwed up, and the specific points in the workflow where human judgment is non-negotiable.

Because here's what nobody tells you about using AI in production: the prompt is 5% of the work. The other 95% is knowing when AI is lying to you.

I – The Failure Modes Nobody Talks About

AI-generated code fails in predictable ways.

I've reviewed about 200 AI-generated PRs over the last six months—my own and my team's. The patterns are consistent across GPT-5, Claude Opus 4.1, and Gemini 2.5 Pro. Here's what actually breaks:

Failure Mode 1: Context Window Amnesia

AI forgets what you told it 50 messages ago. You're debugging a function, you've explained the business logic three times, then you ask it to refactor a related module and it completely ignores the constraints you specified earlier.

This isn't a model limitation—it's a workflow problem. You're treating AI like a stateful system when it's actually stateless beyond its context window.

Failure Mode 2: Confident Hallucination of APIs

AI will generate code that calls methods that don't exist. Not deprecated methods. Not misnamed methods. Methods it invented because they sound plausible.

I've seen it hallucinate Stripe API endpoints, Postgres functions, and AWS SDK methods. The code compiles. The types check. The logic looks sound. Then you run it and get a NoSuchMethodError.

Failure Mode 3: Edge Case Blindness

AI optimizes for the happy path. Ask it to write input validation and it'll handle empty strings and null values. It won't handle:

Unicode edge cases that break regex
Integer overflow on large inputs
Timezone handling across DST transitions
Race conditions under concurrent load
Memory leaks with long-running processes

It's not that AI can't reason about edge cases. It's that it doesn't prioritize them unless you explicitly prompt for them. And by the time you've prompted for every edge case, you might as well have written the code yourself.

Failure Mode 4: Security Antipatterns

AI loves patterns. Unfortunately, some patterns are security vulnerabilities.

I've seen AI generate:

SQL queries with string interpolation instead of parameterized queries
JWT validation that doesn't verify signatures
Password hashing with weak algorithms
CORS configs that allow * in production
File upload handlers that don't validate content types

It mimics what it saw in training data. And training data includes a lot of bad code from StackOverflow and GitHub repos that prioritize "it works" over "it's secure."

The pattern is clear: AI accelerates velocity but introduces risk. The question is how to keep the velocity while mitigating the risk.

II – The Verification Protocol That Actually Works

Here's the workflow I use. It's not elegant. It's not fast. But it catches 95% of AI mistakes before they hit production.

Stage 1: Generation with Constraints

Don't just describe what you want. Describe what you don't want.

Bad prompt:

Write a function to process user uploads

Good prompt:

Write a function to process user uploads with these constraints:
- Max file size: 10MB
- Allowed types: PDF, DOCX only
- Must validate content type, not just extension
- Must scan for malware before processing
- Must handle S3 upload failures gracefully
- Must rate-limit to 5 uploads per user per hour
- Return specific error codes for each failure case

The constraints force AI to think about edge cases upfront. It won't catch everything, but it significantly reduces Stage 2 effort.

Stage 2: Diff Review (5 minutes per function)

Read every line AI generated. Not skimming—actually reading.

I use a checklist:

[ ] Are all method calls real? (Check docs, don't trust AI)
[ ] Are types correct? (Especially with TypeScript/Go generics)
[ ] Is error handling comprehensive? (Not just try/catch everything)
[ ] Are edge cases covered? (Null, empty, max values, unicode)
[ ] Is this code defensive? (Assumes external input is hostile)
[ ] Would this pass security review? (OWASP Top 10 basics)

If any answer is "no" or "unsure," don't merge. Send it back to AI with specific corrections or write it manually.

Stage 3: Unit Testing with Adversarial Cases

AI-generated code needs adversarial unit tests. Tests that try to break it.

Don't just test the happy path. Test:

Empty inputs
Null values
Max integer values (Integer.MAX_VALUE, 2^63-1)
Malformed strings (unterminated quotes, SQL injection attempts)
Concurrent access (simulate race conditions)
Timeout scenarios (mock slow external APIs)

When you can analyze data patterns across test results, you start seeing where AI consistently fails. GPT-5 forgets null checks. Claude overcomplicates error handling. Gemini misses concurrent access issues.

Run these tests. Watch them fail. Fix the failures manually.

Stage 4: Integration Testing in Staging

AI doesn't understand your production environment.

It doesn't know your database is eventually consistent. It doesn't know your CDN has a 60-second cache TTL. It doesn't know your rate limiter works at the load balancer level, not the application level.

Deploy to staging. Run realistic load. Watch for:

Memory leaks (use profilers, not guesses)
N+1 queries (check query logs)
Timeout cascades (simulate upstream service failures)
Resource exhaustion (what happens at 10x load?)

If staging behaves differently than expected, don't assume production will be fine. Debug until you understand why.

Stage 5: Gradual Rollout

Never deploy AI-generated code to 100% of users immediately.

Feature flag it. Route 5% of traffic. Monitor error rates, latency, resource usage. If metrics look good after 24 hours, bump to 25%. Then 50%. Then 100%.

The moment error rates spike or latency degrades, roll back. Don't try to debug in production with real user traffic.

This protocol feels slow. It is slow. But "fast and broken" is slower than "deliberate and working."

III – When to Use AI vs When to Write It Yourself

Not every task benefits from AI assistance.

Use AI for:

Boilerplate that follows known patterns
- CRUD endpoints with standard validation
- Data model migrations
- API client wrappers
- Test fixtures and mocks
- Configuration files (Terraform, Kubernetes manifests)
Refactoring existing code
- AI is decent at structural changes when it can see the entire file
- Renaming variables consistently
- Extracting repeated logic into functions
- Converting callbacks to async/await
Exploratory prototyping
- Trying different algorithm approaches
- Comparing library APIs before committing
- Generating test data
- Scaffolding new services

Don't use AI for:

Security-critical code
- Authentication logic
- Authorization checks
- Cryptography
- Input sanitization
- Session management

Write this yourself. Or use battle-tested libraries. AI will generate something that "looks right" and has subtle vulnerabilities.

Performance-critical code
- Database query optimization
- Caching strategies
- Concurrency primitives
- Memory-efficient algorithms

AI optimizes for correctness, not performance. It will generate O(n²) algorithms when O(n log n) is needed. It will add locks that create contention. It will cache the wrong things.

Novel problem spaces
- Architecture decisions for new systems
- Distributed system coordination
- Complex state machines
- Domain-specific business logic with nuanced rules

AI pattern-matches. If your problem hasn't been solved a thousand times already, AI won't solve it correctly. It will generate something that sounds plausible and is fundamentally wrong.

Code you don't understand
- If you can't explain what the AI-generated code does line by line, don't ship it
- If you can't debug it when it breaks, don't ship it
- If you can't modify it confidently, don't ship it

The rule is simple: only use AI for code you could have written yourself, just slower. AI accelerates what you already know how to do. It doesn't replace what you don't.

IV – The Multi-Model Strategy

Different models have different strengths. Use them strategically.

GPT-5 for Initial Implementation

GPT-5 generates clean, idiomatic code quickly. It's best for:

Writing first drafts of new features
Generating multiple implementation approaches
Prototyping algorithms

Weakness: Hallucinates APIs and forgets constraints after long conversations.

Claude Opus 4.1 for Logic Review

Claude excels at analyzing existing code for logical flaws. It's best for:

Reviewing AI-generated code for edge cases
Identifying race conditions and concurrency issues
Explaining complex code written by others

Weakness: Overly verbose. Overthinks simple problems.

Gemini 2.5 Pro for Documentation and Testing

Gemini is excellent at synthesis. It's best for:

Generating comprehensive unit tests
Writing API documentation
Creating integration test scenarios

Weakness: Struggles with nuanced business logic and domain-specific rules.

Grok 3 for Library Selection

Grok has real-time awareness of the current ecosystem. It's best for:

Comparing actively maintained libraries
Finding recent security patches
Identifying deprecated dependencies

Weakness: Sometimes prioritizes "new" over "stable."

The workflow: Generate with GPT-5. Review logic with Claude. Test with Gemini. Verify dependencies with Grok.

When you can switch between these intelligences in one conversation, you stop context-switching and start building on what each model surfaces. One generates, another critiques, another tests. The result is better than any single model could produce.

But even with multiple models, you're still the final reviewer.

V – The Code Review Checklist (What Actually Matters)

Before approving any AI-generated code, I run through this:

Correctness

[ ] Does this do what the spec says?
[ ] Are all edge cases handled?
[ ] Is error handling comprehensive?
[ ] Do the types make sense?
[ ] Are APIs used correctly? (Verify against docs)

Security

[ ] Is user input validated?
[ ] Are queries parameterized? (No string interpolation)
[ ] Is authentication checked?
[ ] Is authorization enforced?
[ ] Are secrets kept out of logs/errors?
[ ] Is rate limiting in place?

Performance

[ ] Is the algorithmic complexity acceptable?
[ ] Are database queries optimized?
[ ] Is caching used appropriately?
[ ] Will this scale to production load?
[ ] Are there obvious memory leaks?

Maintainability

[ ] Can I understand this code in 6 months?
[ ] Are variable names clear?
[ ] Is the structure logical?
[ ] Is there a test that would catch regressions?
[ ] Is this consistent with our codebase conventions?

Production Readiness

[ ] Is logging sufficient for debugging?
[ ] Are metrics instrumented?
[ ] Is there a rollback plan?
[ ] What happens if dependencies fail?
[ ] Can this be deployed incrementally?

If the answer to any question is "no" or "unclear," the code isn't ready. Fix it or rewrite it.

Don't trust AI's confidence. AI will generate broken code with the same confident tone it uses for correct code. Your judgment is the quality gate.

VI – What I've Learned Shipping AI Code for 6 Months

Lesson 1: AI Code Requires More Review, Not Less

AI accelerates writing. It doesn't accelerate review. You still need to verify everything. The time savings come from faster drafts, not faster reviews.

Lesson 2: Model Selection Matters

Using one model for everything is suboptimal. Different models have different failure modes. Route tasks to the model that's most reliable for that task type.

Lesson 3: Context Persistence is Critical

Losing context between model switches kills productivity. When you can maintain conversation continuity across multiple models, you stop re-explaining project context and start building complex features that require input from different intelligences.

Lesson 4: AI is a Junior Dev, Not a Senior Dev

Treat AI output like you'd treat code from a junior engineer: capable of good work, but requiring oversight. The more critical the code, the more scrutiny it needs.

Lesson 5: The Prompt is the Spec

If your prompt is vague, your code will be vague. If your prompt omits constraints, your code will violate constraints. The quality of AI output is bounded by the quality of your specification.

Lesson 6: Tests are Non-Negotiable

AI-generated code without tests is technical debt. Write tests. Run them. Watch them fail. Fix the failures. This isn't optional.

Lesson 7: Production is the Real Test

Staging catches 80% of issues. Production catches the remaining 20%. Deploy gradually. Monitor aggressively. Roll back quickly when metrics degrade.

The Honest Trade-off

Here's what using AI in production actually looks like:

Time to first draft: 10x faster

Time to reviewed, tested, production-ready code: 2-3x faster

Risk of subtle bugs: Higher if you skip verification steps

Code quality: Equal to manually written code if you verify properly

The 10x claim is marketing. The 2-3x improvement is real, but only if you have robust verification processes.

If your process is "generate code, ship code," you will ship bugs. If your process is "generate, review, test, stage, gradual rollout," you'll ship quality code faster than writing it manually.

The gap between developers with disciplined AI workflows and those winging it is growing every week.

I shipped AI-generated code to production. It's still running. But that's not because AI is magic—it's because I treated it like any other code: with skepticism, verification, and gradual rollout.

The question isn't whether to use AI in production. It's whether you have the discipline to use it safely.

-Leena:)

DEV Community