DEV Community

FT MJ
FT MJ

Posted on

Production Maintenance Drill: An OPS Checklist in Action Validating Production Readiness Through Hands-On Practice

In this assignment, I stepped into the shoes of a production support engineer and performed a comprehensive maintenance drill on a live EC2 instance running an Nginx web server. The goal was simple but critical: validate that the production server is healthy, secure, and reliable — and document every step of the process.

This drill simulated exactly what real DevOps and SRE teams do when they investigate production issues, perform routine maintenance, or respond to incidents.

📋 Phase 1: Network & Connectivity Checks
Before diving into application-level checks, I verified that the server could actually communicate with the outside world. Network issues are often the root cause of "it's down" scenarios.

Check Network Interfaces
bash
echo "Manjay Verma - Maintenance Drill"
ip a
What I observed: Active network interfaces with a valid private IP.

Why it matters: If no interface is up, the server cannot communicate with anything — including you.

Verify Default Gateway
bash
echo "Manjay Verma - Maintenance Drill"
ip route
What I observed: Default gateway route is present.

Why it matters: Without a default route, the server cannot reach external services or the internet.

Test DNS Resolution
bash
echo "Manjay Verma - Maintenance Drill"
host pravinmishra.com
What I observed: DNS resolution works — domain successfully resolved to IP.

Why it matters: If DNS fails, applications cannot reach external APIs or services.

Check Packet-Level Connectivity
bash
echo "Manjay Verma - Maintenance Drill"
ping -c 4 thecloudadvisory.com
What I observed: 0% packet loss.

Why it matters: Packet loss in production can indicate network instability or congestion.

Inspect Open Ports
bash
echo "Manjay Verma - Maintenance Drill"
sudo ss -tulpen
What I observed:

Port 22 → SSH (listening on 0.0.0.0)

Port 80 → Nginx (listening on 0.0.0.0)

No unexpected ports open

Why it matters: Every open port is a potential attack surface. Only required services should be exposed.

Check Firewall Status
bash
echo "Manjay Verma - Maintenance Drill"
sudo ufw status
What I observed: ufw not installed

Why it matters: In production, firewall rules should restrict traffic to required ports only. This is a security gap that needs addressing.

🚦 Phase 2: Service Health Validation
Once network connectivity was confirmed, I validated that the Nginx service was running correctly.

Check Service Status
bash
echo "Manjay Verma - Maintenance Drill"
systemctl status nginx --no-pager
What I observed: Nginx service is active and running under systemd.

Verify Boot-Time Enablement
bash
echo "Manjay Verma - Maintenance Drill"
systemctl is-enabled nginx
What I observed: Nginx is enabled, meaning it starts automatically after reboot.

Why it matters: Ensures service availability after server restarts.

Test Configuration Syntax
bash
echo "Manjay Verma - Maintenance Drill"
sudo nginx -t
What I observed: Configuration syntax is valid.

Why it matters: Always validate configuration before restarting to prevent downtime from bad configs.

Verify Process Structure
bash
echo "Manjay Verma - Maintenance Drill"
ps aux | grep -E "nginx: master|nginx: worker" | grep -v grep
What I observed: Both master and worker processes are running.

Why it matters: Nginx uses a master process to manage workers. Missing worker processes means the service isn't actually serving traffic.

Confirm PID on Port 80
bash
echo "Manjay Verma - Maintenance Drill"
sudo ss -lptn '( sport = :80 )'
What I observed: Nginx PID is correctly bound to port 80.

Why it matters: Confirms the service is listening on the expected port.

Perform Safe Restart Drill
bash
echo "Manjay Verma - Maintenance Drill"
sudo systemctl restart nginx
systemctl status nginx --no-pager
What I observed: Restart completed successfully.

Rollback plan: If nginx fails to restart, revert config changes and validate using nginx -t before restarting again.

📊 Phase 3: Log Analysis
Logs are the production engineer's best friend. I generated test traffic and examined what got logged.

Generate Test Traffic
bash
curl -s http://34.230.8.38 > /dev/null
curl -I http://34.230.8.38
Review Access Logs
bash
echo "Manjay Verma - Maintenance Drill"
sudo tail -n 30 /var/log/nginx/access.log
What I observed: My curl requests appeared in the logs, confirming traffic reached Nginx.

Why it matters: Access logs show who is requesting what and help identify traffic patterns or attacks.

Check Error Logs
bash
echo "Manjay Verma - Maintenance Drill"
sudo tail -n 30 /var/log/nginx/error.log
What I observed: No recent errors.

Why it matters: Empty error logs suggest no current runtime failures — a good sign.

Examine Systemd Journal
bash
echo "Manjay Verma - Maintenance Drill"
sudo journalctl -u nginx --no-pager -n 50
What I observed: Service-level events and restart history.

Why it matters: Journal provides additional context that might not appear in application logs.

💾 Phase 4: Resource Health
Production services don't just need working software — they need healthy infrastructure.

Check Uptime
bash
echo "Manjay Verma - Maintenance Drill"
uptime
What I observed: Server has been running continuously.

Review Memory Usage
bash
echo "Manjay Verma - Maintenance Drill"
free -h
What I observed: Memory usage is within acceptable limits.

Check Disk Space
bash
echo "Manjay Verma - Maintenance Drill"
df -h
What I observed: Disk usage is below critical threshold.

Inspect /var Directory Usage
bash
echo "Manjay Verma - Maintenance Drill"
sudo du -sh /var/* | sort -h
What I observed: No single directory consuming excessive space.

Why it matters: If disk fills up completely, services cannot write logs and may crash. Monitoring disk usage prevents this.

⚠️ Phase 5: Incident Simulation (The Best Part)
Theory is useful, but practice is irreplaceable. I intentionally broke the configuration to understand real incident response.

Break the Config
bash
sudo nano /etc/nginx/sites-available/default

Removed one semicolon to break the config

Observe the Failure
bash
sudo nginx -t
Result: Configuration test failed with syntax error.

Fix and Recover
bash

Restored the missing semicolon

sudo nginx -t # Configuration test passes
sudo systemctl restart nginx
curl -I http://34.230.8.38 # HTTP/1.1 200 OK
Root Cause: Syntax error in nginx configuration (missing semicolon)

Fix: Restored valid configuration and validated using nginx -t

Prevention: Always test configuration before restarting. Never assume — verify.

🔒 Security & Reliability Notes
Based on my maintenance drill, here's the current security posture:

✅ SSH access uses key-based authentication (no password login)

✅ Open ports restricted to 22 (SSH) and 80 (HTTP)

✅ Nginx is enabled on boot for automatic recovery

✅ No secrets or private keys were exposed during this exercise

⚠️ Firewall not configured — this should be addressed

💡 Cost optimization: I will stop the EC2 instance when not in use to prevent unnecessary charges

📌 Key Learnings
This week taught me that real production validation goes far beyond checking if a website loads. Here's what I learned:

Area Key Takeaway
Network Connectivity issues often masquerade as application problems
Services Verify not just "running" but "actually serving traffic"
Logs Logs tell the real story — access, errors, and system events
Resources Disk, memory, and CPU monitoring prevent surprises
Incidents Breaking things safely teaches more than reading docs
Recovery Always have a rollback plan and test configuration changes
🚀 Next Steps
This maintenance drill revealed both strengths and gaps. Moving forward, I will:

Implement firewall rules to restrict traffic to only required ports

Set up automated monitoring with alerts for:

High disk usage

Service failures

Unusual traffic patterns

Practice more failure scenarios to build muscle memory for incident response

Document runbooks so recovery steps are clear and repeatable

💭 Final Thoughts
Production maintenance isn't glamorous, but it's essential. This drill simulated exactly what on-call engineers do daily: validate, investigate, fix, and learn.

The most valuable takeaway? Simulating failures in a controlled environment builds confidence for handling real outages. By deliberately breaking and fixing Nginx, I experienced the full cycle of detection, diagnosis, recovery, and prevention — all without impacting real users.

Have you performed similar maintenance drills? What unexpected issues did you discover? Share your experiences below!

Top comments (0)