Manoj Mishra

Posted on Mar 22 • Edited on Apr 11

Debugging & Production Incidents with AI

#ai #devops #llm #distributedsystems

When production is on fire, AI can seem like a lifeline. But using AI carelessly during an incident often makes things worse. This post covers five mistakes developers make when using AI to debug or fix production issues, and how to keep your system safe while still leveraging AI’s power.

Mistake 1: Using AI to Fix Production Without Rollback Plan

Description: Applying AI‑suggested fixes directly to production without ability to rollback.

Realistic Scenario: 5xx errors spike. AI suggests code change. Developer applies without preparing rollback, makes things worse.

❌ Wrong Prompt:

Fix this production error: NullPointerException in payment processing

text
Developer applies AI fix directly to production.

⚠️ Why it is wrong: No rollback plan; if fix introduces new bug, outage extends.

✅ Better Prompt:

Payment service has NullPointerException in production (error rate 15%). Need fix with rollback strategy.

Current state:

Last deployment: 2 hours ago

Canary: 10% traffic

Rollback: kubectl rollout undo (last known good version: v2.3.1)

Plan:

AI suggests fix candidate

Test in staging with production traffic replay

Deploy to canary (10%) for 15 mins

Monitor error rate, latency, CPU

If successful, ramp to 50%, then 100%

Rollback script ready (./scripts/rollback-payment.sh)

Please suggest fix with these constraints.

💡 What changed: Added deployment strategy, rollback plan, and validation steps.

Mistake 2: AI Suggests Schema Change Under Load

Description: AI recommends schema migration that causes locks or downtime under production load.

Realistic Scenario: Database connection pool exhaustion during migration due to long-running ALTER TABLE.

❌ Wrong Prompt:

Add new column to users table in production

⚠️ Why it is wrong: AI may suggest ALTER TABLE users ADD COLUMN ... without considering locks on 50M row table.

✅ Better Prompt:

Add new column (preferences JSONB) to users table (50M rows, PostgreSQL 14, 2000 QPS).

Requirements:

Zero-downtime migration

Avoid table locks

Use pgroll or gh-ost for online migration

Backfill data in batches (1000 rows per batch)

Monitor replication lag during migration

Current approach: Use pgroll with:
ALTER TABLE users ADD COLUMN preferences JSONB DEFAULT '{}';
Followed by batch update script with throttling.

💡 What changed: Specified zero-downtime requirements and appropriate tools.

Mistake 3: No Observability Data in Prompt

Description: Asking for incident resolution without providing metrics, logs, or traces.

Realistic Scenario: Memory leak in production. Developer asks AI for fix without providing heap dump or GC logs.

❌ Wrong Prompt:

Fix memory leak in my Java app

⚠️ Why it is wrong: No data to identify leak source (caches, thread pools, or connections).

✅ Better Prompt:

Java app (Spring Boot, OpenJDK 17) has memory leak in production.

Observability:

Heap usage grows from 2GB to 8GB over 12 hours then OOM

GC logs show Old Gen not being collected

Memory leak suspects: Redis cache (no TTL) and WebSocket connections

Heap dump analysis: 3GB retained by Redis cache, 2GB by WebSocket sessions

Prometheus metrics attached: memory_usage_bytes, active_sessions

Current settings:

Xmx: 8GB

MaxWebSocketSessions: 10000

Redis cache max-size: 10k entries, no TTL

Need solution: add TTL to cache, limit session lifetime, and add metrics.

💡 What changed: Provided heap dump analysis, metrics, and config for targeted fixes.

Mistake 4: Applying AI Fix Without Replication in Staging

Description: Using AI to generate hotfix that hasn't been tested in staging with production-like data.

Realistic Scenario: AI suggests adding retry logic for database connections. Applied to production without testing staging, causes cascading failures.

❌ Wrong Prompt:

Add retry logic for database connection failures

Developer applies to production without staging test.

⚠️ Why it is wrong: Retry storms can amplify failures; staging test with traffic replay would reveal this.

✅ Better Prompt:

Add retry logic for database connection failures.

Process:

Generate fix with exponential backoff (1s, 2s, 4s), max 3 retries

Deploy to staging with production traffic replay (GoReplay)

Test failure scenarios: kill DB connection, network partition

Verify circuit breaker prevents cascading failures

After staging validation, deploy to production with gradual rollout

Current staging environment mirrors production with same load (2000 req/s).

💡 What changed: Added validation in staging before production deployment.

Mistake 5: AI‑Assisted Hotfix Bypassing Code Review

Description: Using AI-generated fix in production without peer review due to urgency.

Realistic Scenario: P0 incident; senior dev uses AI to generate fix and deploys without review; fix introduces another bug.

❌ Wrong Prompt:

Emergency: fix payment processing error NOW

Developer applies and deploys without review.

⚠️ Why it is wrong: Rushed AI-generated code may have side effects or introduce new bugs under pressure.

✅ Better Prompt:

Emergency fix for payment processing error.

Process:

Pair with another engineer for code review of AI-generated fix

Document the fix and reasoning in incident ticket

Test in staging with recent production traffic (last 5 min replay)

Deploy with feature flag for instant rollback

Post-incident: write regression test and run security review

Fix requirements: [error details]...

💡 What changed: Maintained review process even during incidents to prevent secondary failures.

Summary & Best Practices

Always have a rollback plan before applying any AI‑suggested production change.
Use zero‑downtime migration tools for schema changes.
Include observability data (logs, metrics, traces) in your incident prompts.
Test fixes in staging with production traffic replay before touching production.
Maintain code review discipline even during outages—two‑person review saves more time than it costs.

AI can accelerate incident resolution, but only if you integrate it into a safe, controlled process.

💬 Have you used AI during a live production incident?

What worked—and what backfired? Share your story or tips in the comments!

DEV Community

Debugging & Production Incidents with AI

Mistake 1: Using AI to Fix Production Without Rollback Plan

Mistake 2: AI Suggests Schema Change Under Load

Mistake 3: No Observability Data in Prompt

Mistake 4: Applying AI Fix Without Replication in Staging

Mistake 5: AI‑Assisted Hotfix Bypassing Code Review

Summary & Best Practices

💬 Have you used AI during a live production incident?

Top comments (0)