kol kol

Posted on Jun 6

I Broke Production 3 Times This Week — How a CI/CD Pipeline Audit Fixed Everything

#devops #cicd #programming #webdev

Last week I broke production three times. Not because of bad code — because our CI/CD pipeline was quietly lying to us.

Here's what happened, what I found during the audit, and the exact pipeline changes that eliminated deployment failures.

The Three Breakages

Break #1: A database migration ran twice because our pipeline didn't track which migrations had already executed. Result: duplicate key errors across 200+ records.

Break #2: Environment variable interpolation silently dropped a critical API key in production. The staging build passed because the variable was set in our .env.staging but not .env.production.

Break #3: A dependency update changed a function signature. Our test suite passed because the mocked version still matched the old signature. Production exploded at runtime.

Three different failure modes. One root cause: our CI/CD pipeline had more gaps than a net.

The Pipeline Audit

I mapped every step from git push to production deployment. Here's what I found:

Gap 1: No Migration Tracking

Our pipeline ran prisma migrate deploy blindly on every deployment. No check for already-applied migrations. No rollback plan.

Fix: Added a migration status check that queries _prisma_migrations before running anything new:

# Check pending migrations before applying
npx prisma migrate status
npx prisma migrate deploy --skip-generate

Gap 2: Env Var Validation Missing

We had 23 environment variables in production. Zero validation that they existed before deploying.

Fix: Added a pre-deployment validation step:

# Required env vars checklist
REQUIRED_VARS="DATABASE_URL NEXTAUTH_SECRET API_KEY STRIPE_SECRET"
for var in $REQUIRED_VARS; do
  if [ -z "${!var}" ]; then
    echo "❌ Missing required variable: $var"
    exit 1
  fi
done
echo "✅ All required environment variables present"

This single check has blocked 4 bad deployments since I added it.

Gap 3: Tests Didn't Match Reality

Our mock data was stale. Tests passed against a mocked API that hadn't been updated in months.

Fix: Two changes:

Contract testing: Added @stoplight/spectral to validate our OpenAPI spec against actual responses
Integration tests in CI: Running real API calls against a staging database, not just unit tests with mocks

The New Pipeline

git push → Lint → Type Check → Unit Tests → Contract Tests 
  → Build → Env Validation → Staging Deploy → Integration Tests 
  → Migration Check → Production Deploy → Health Check

Each stage blocks the next on failure. No more "tests passed but prod broke."

The Results

Since the audit (2 weeks ago):

0 production breakages (vs 3 in the previous week)
4 blocked bad deployments before they reached staging
Deploy confidence: Team actually ships on Fridays now

The Takeaway

Your CI/CD pipeline isn't just automation — it's your last line of defense. If it has gaps, production will find them.

Audit your pipeline. Map every step. Ask "what happens if this lies to me?" for each one.

Then fix the gaps before production finds them for you.

What's the worst CI/CD failure you've dealt with? Drop it in the comments — misery loves company.

DEV Community