binadit

Posted on Apr 8 • Originally published at binadit.com

Why deployments break production systems

#deployment #production #reliability #infrastructure

The deployment that killed Black Friday (and how to prevent it)

Picture this: It's Black Friday afternoon, your site is handling peak traffic, and someone pushes a "quick fix" that immediately breaks checkout. Revenue drops to zero while your team frantically debugs in production.

Sound familiar? You're not alone. Research shows that 70% of production incidents occur within 48 hours of a deployment. The harsh truth is that most systems don't fail on their own, they fail when we change them.

The deployment danger zone

Every code push is a roll of the dice because you're modifying a live system. Even trivial changes can cascade into catastrophic failures due to one fundamental issue: environmental drift.

Your local environment differs from staging. Staging differs from production. These gaps accumulate over time, creating a minefield of potential failures.

Schema mismatches that kill apps instantly

-- Your code expects this column
SELECT user_id, created_at, email_verified 
FROM users WHERE id = ?

-- But production database doesn't have email_verified yet
-- Result: Immediate 500 errors for every user query

This happens when schema migrations run separately from code deployments. Deploy before the migration completes? Instant failure. Deploy too long after? You might miss your rollback window.

Configuration chaos

Modern apps depend on countless config values: database URLs, API keys, timeouts, feature flags. When production config doesn't match your expectations, failures cascade silently.

Worst case: A payment gateway timeout that doesn't crash your site but silently drops transactions. Customers think they've purchased something, but the payment never processes.

Resource assumptions that don't scale

That function processing 10 records in development? It might need to handle 10,000 in production. Without proper resource limits, one poorly optimized query can consume all database connections and crash your entire system.

Deployment antipatterns that guarantee pain

❌ Testing in isolation only
Unit tests pass, code review approves, but nobody verified the complete system works with real data volumes and traffic patterns.

❌ No rollback strategy
Planning for success but not failure. When things break, "fixing forward" under pressure usually makes everything worse.

❌ Deploying during peak hours
High traffic amplifies every risk and makes diagnosis nearly impossible. Yet teams do this because developers are available to fix issues.

❌ Batch deployments with multiple changes
Combining bug fixes, features, and infrastructure changes makes it impossible to identify what broke when things go sideways.

# This deploy includes too many changes
- Fix payment gateway timeout
- Add new user dashboard
- Update database connection pooling
- Enable new caching layer
# When it breaks, which change caused it?

Battle-tested deployment strategies

Blue-green deployments with validation gates

Run new and current versions simultaneously. Route small traffic percentages to the new version while monitoring metrics. Only complete the switch after validation.

# Nginx config for gradual traffic shifting
upstream backend {
    server blue-env:8080 weight=90;
    server green-env:8080 weight=10;
}

Requires double infrastructure but eliminates downtime and provides instant rollbacks.

Feature flags for risk mitigation

Separate deployment from activation. Deploy with features disabled, then gradually enable for increasing user percentages.

if (featureFlag.isEnabled('new-checkout-flow', userId)) {
    return newCheckoutProcess(cartItems);
}
return legacyCheckoutProcess(cartItems);

When problems appear, disable the feature instantly without redeploying code.

Automated deployment validation

Build health checks into your pipeline that run after each deployment stage:

#!/bin/bash
# Post-deployment validation script
curl -f http://api/health || exit 1
curl -f http://api/database-check || exit 1
./run-critical-user-flow-tests.sh || exit 1
echo "Deployment validation passed"

Backward-compatible database migrations

Handle schema changes in three phases:

Add new structures alongside old ones
Deploy code that works with both versions
Remove deprecated structures after validation

-- Phase 1: Add new column
ALTER TABLE users ADD COLUMN email_verified BOOLEAN DEFAULT FALSE;

-- Phase 2: Deploy code that can handle both schemas
-- Phase 3: Remove old email_verification_status column

Real war story: The silent payment failure

A team deployed checkout optimizations to their e-commerce platform during moderate Tuesday traffic. Database migration succeeded, code deployed without errors, response times looked normal.

But payments started failing silently. New code expected a timestamp format that differed between staging and production databases. The payment processor received malformed data and quietly rejected transactions.

Customers completed "purchases" that never processed. By the time the team noticed the revenue drop, they had 200+ failed transactions and angry customers.

The fix: Immediate feature flag disable to restore old payment flow, then a proper backward-compatible migration.

The lesson: Monitor business metrics, not just technical ones, during deployments.

Your deployment reliability checklist

[ ] Comprehensive integration testing with production-like data
[ ] Backward-compatible database migrations
[ ] Feature flags for gradual rollouts
[ ] Blue-green or canary deployment strategy
[ ] Real-time business and technical metric monitoring
[ ] Documented rollback procedures
[ ] Deploy during low-traffic windows
[ ] One change per deployment when possible

Deployments will always carry risk, but with proper engineering practices, you can deploy confidently without destroying your Black Friday.

Originally published on binadit.com

DEV Community