Why your production servers are failing health checks (and how to fix it for good)
Your staging environment passes all tests. Your production deployment worked flawlessly last month. But now your servers are throwing random 500s, failing health checks, and behaving differently across instances.
Sound familiar? You're dealing with configuration drift, and it's about to make your next zero downtime migration a nightmare.
Let me walk you through the two approaches to solving this problem, and when to choose each one.
The configuration drift trap
Configuration drift is death by a thousand cuts. Someone applies a security patch during an incident. Another engineer tweaks a config file to fix a performance issue. A dependency gets updated on one server but not others.
Each change makes sense in isolation. Together, they create infrastructure that nobody fully understands.
Managing drift: the gradual fix
Most teams reach for configuration management tools like Ansible or Puppet. The approach is straightforward:
- Define your desired system state
- Scan servers for differences
- Automatically correct drift when found
# Ansible playbook example
- name: Ensure nginx config is correct
template:
src: nginx.conf.j2
dest: /etc/nginx/nginx.conf
notify: restart nginx
- name: Verify service is running
systemd:
name: nginx
state: started
enabled: yes
Why teams choose this approach:
- Works with existing infrastructure
- Preserves institutional knowledge
- Lower upfront costs
- Gradual implementation
The hidden problems:
- Detection happens after drift occurs
- Corrections often require service restarts
- Complex dependencies resist automated fixes
- Root cause remains: systems are still mutable
During zero downtime migrations, these problems compound. You're never certain what state your servers are actually in, making rollbacks risky and deployments unpredictable.
The immutable alternative
Immutable infrastructure flips the script entirely. Instead of fixing drifted servers, you replace them.
Every deployment follows the same pattern:
- Build new infrastructure from scratch
- Deploy application to new servers
- Switch traffic over
- Destroy old infrastructure
# Dockerfile ensuring consistent base
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --only=production
COPY . .
EXPOSE 3000
CMD ["npm", "start"]
# Kubernetes deployment with immutable containers
apiVersion: apps/v1
kind: Deployment
metadata:
name: api-deployment
spec:
replicas: 3
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: myapp:v1.2.3
ports:
- containerPort: 3000
Why this works better for zero downtime:
- Identical infrastructure every time
- Trivial rollbacks (switch traffic back)
- Predictable behavior during migrations
- No accumulated drift
The tradeoffs:
- Requires significant automation investment
- Double capacity needed during deployments
- Applications must be stateless or externalize state
- Different debugging workflow
Quick decision framework
Choose drift management if:
- You deploy less than weekly
- Limited automation expertise on team
- Legacy applications with local state
- Budget constraints prevent infrastructure redesign
Choose immutable infrastructure if:
- You need reliable zero downtime migrations
- You deploy multiple times per week
- Applications are already containerized
- Team has strong automation skills
My recommendation
If you're reading this because migrations are causing downtime, immutable infrastructure is probably your answer. The upfront investment is significant, but the operational benefits compound over time.
Start small: containerize one service, implement blue-green deployments for it, then expand the pattern to other components.
Configuration drift management can work, but it's fighting entropy instead of designing around it. For teams serious about zero downtime operations, immutable patterns are worth the investment.
Originally published on binadit.com
Top comments (0)