Last week I broke production three times. Not because of bad code — because our CI/CD pipeline was quietly lying to us.
Here's what happened, what I found during the audit, and the exact pipeline changes that eliminated deployment failures.
The Three Breakages
Break #1: A database migration ran twice because our pipeline didn't track which migrations had already executed. Result: duplicate key errors across 200+ records.
Break #2: Environment variable interpolation silently dropped a critical API key in production. The staging build passed because the variable was set in our .env.staging but not .env.production.
Break #3: A dependency update changed a function signature. Our test suite passed because the mocked version still matched the old signature. Production exploded at runtime.
Three different failure modes. One root cause: our CI/CD pipeline had more gaps than a net.
The Pipeline Audit
I mapped every step from git push to production deployment. Here's what I found:
Gap 1: No Migration Tracking
Our pipeline ran prisma migrate deploy blindly on every deployment. No check for already-applied migrations. No rollback plan.
Fix: Added a migration status check that queries _prisma_migrations before running anything new:
# Check pending migrations before applying
npx prisma migrate status
npx prisma migrate deploy --skip-generate
Gap 2: Env Var Validation Missing
We had 23 environment variables in production. Zero validation that they existed before deploying.
Fix: Added a pre-deployment validation step:
# Required env vars checklist
REQUIRED_VARS="DATABASE_URL NEXTAUTH_SECRET API_KEY STRIPE_SECRET"
for var in $REQUIRED_VARS; do
if [ -z "${!var}" ]; then
echo "❌ Missing required variable: $var"
exit 1
fi
done
echo "✅ All required environment variables present"
This single check has blocked 4 bad deployments since I added it.
Gap 3: Tests Didn't Match Reality
Our mock data was stale. Tests passed against a mocked API that hadn't been updated in months.
Fix: Two changes:
-
Contract testing: Added
@stoplight/spectralto validate our OpenAPI spec against actual responses - Integration tests in CI: Running real API calls against a staging database, not just unit tests with mocks
The New Pipeline
git push → Lint → Type Check → Unit Tests → Contract Tests
→ Build → Env Validation → Staging Deploy → Integration Tests
→ Migration Check → Production Deploy → Health Check
Each stage blocks the next on failure. No more "tests passed but prod broke."
The Results
Since the audit (2 weeks ago):
- 0 production breakages (vs 3 in the previous week)
- 4 blocked bad deployments before they reached staging
- Deploy confidence: Team actually ships on Fridays now
The Takeaway
Your CI/CD pipeline isn't just automation — it's your last line of defense. If it has gaps, production will find them.
Audit your pipeline. Map every step. Ask "what happens if this lies to me?" for each one.
Then fix the gaps before production finds them for you.
What's the worst CI/CD failure you've dealt with? Drop it in the comments — misery loves company.
Top comments (0)