In this assignment, I stepped into the shoes of a production support engineer and performed a comprehensive maintenance drill on a live EC2 instance running an Nginx web server. The goal was simple but critical: validate that the production server is healthy, secure, and reliable — and document every step of the process.
This drill simulated exactly what real DevOps and SRE teams do when they investigate production issues, perform routine maintenance, or respond to incidents.
📋 Phase 1: Network & Connectivity Checks
Before diving into application-level checks, I verified that the server could actually communicate with the outside world. Network issues are often the root cause of "it's down" scenarios.
Check Network Interfaces
bash
echo "Manjay Verma - Maintenance Drill"
ip a
What I observed: Active network interfaces with a valid private IP.
Why it matters: If no interface is up, the server cannot communicate with anything — including you.
Verify Default Gateway
bash
echo "Manjay Verma - Maintenance Drill"
ip route
What I observed: Default gateway route is present.
Why it matters: Without a default route, the server cannot reach external services or the internet.
Test DNS Resolution
bash
echo "Manjay Verma - Maintenance Drill"
host pravinmishra.com
What I observed: DNS resolution works — domain successfully resolved to IP.
Why it matters: If DNS fails, applications cannot reach external APIs or services.
Check Packet-Level Connectivity
bash
echo "Manjay Verma - Maintenance Drill"
ping -c 4 thecloudadvisory.com
What I observed: 0% packet loss.
Why it matters: Packet loss in production can indicate network instability or congestion.
Inspect Open Ports
bash
echo "Manjay Verma - Maintenance Drill"
sudo ss -tulpen
What I observed:
Port 22 → SSH (listening on 0.0.0.0)
Port 80 → Nginx (listening on 0.0.0.0)
No unexpected ports open
Why it matters: Every open port is a potential attack surface. Only required services should be exposed.
Check Firewall Status
bash
echo "Manjay Verma - Maintenance Drill"
sudo ufw status
What I observed: ufw not installed
Why it matters: In production, firewall rules should restrict traffic to required ports only. This is a security gap that needs addressing.
🚦 Phase 2: Service Health Validation
Once network connectivity was confirmed, I validated that the Nginx service was running correctly.
Check Service Status
bash
echo "Manjay Verma - Maintenance Drill"
systemctl status nginx --no-pager
What I observed: Nginx service is active and running under systemd.
Verify Boot-Time Enablement
bash
echo "Manjay Verma - Maintenance Drill"
systemctl is-enabled nginx
What I observed: Nginx is enabled, meaning it starts automatically after reboot.
Why it matters: Ensures service availability after server restarts.
Test Configuration Syntax
bash
echo "Manjay Verma - Maintenance Drill"
sudo nginx -t
What I observed: Configuration syntax is valid.
Why it matters: Always validate configuration before restarting to prevent downtime from bad configs.
Verify Process Structure
bash
echo "Manjay Verma - Maintenance Drill"
ps aux | grep -E "nginx: master|nginx: worker" | grep -v grep
What I observed: Both master and worker processes are running.
Why it matters: Nginx uses a master process to manage workers. Missing worker processes means the service isn't actually serving traffic.
Confirm PID on Port 80
bash
echo "Manjay Verma - Maintenance Drill"
sudo ss -lptn '( sport = :80 )'
What I observed: Nginx PID is correctly bound to port 80.
Why it matters: Confirms the service is listening on the expected port.
Perform Safe Restart Drill
bash
echo "Manjay Verma - Maintenance Drill"
sudo systemctl restart nginx
systemctl status nginx --no-pager
What I observed: Restart completed successfully.
Rollback plan: If nginx fails to restart, revert config changes and validate using nginx -t before restarting again.
📊 Phase 3: Log Analysis
Logs are the production engineer's best friend. I generated test traffic and examined what got logged.
Generate Test Traffic
bash
curl -s http://34.230.8.38 > /dev/null
curl -I http://34.230.8.38
Review Access Logs
bash
echo "Manjay Verma - Maintenance Drill"
sudo tail -n 30 /var/log/nginx/access.log
What I observed: My curl requests appeared in the logs, confirming traffic reached Nginx.
Why it matters: Access logs show who is requesting what and help identify traffic patterns or attacks.
Check Error Logs
bash
echo "Manjay Verma - Maintenance Drill"
sudo tail -n 30 /var/log/nginx/error.log
What I observed: No recent errors.
Why it matters: Empty error logs suggest no current runtime failures — a good sign.
Examine Systemd Journal
bash
echo "Manjay Verma - Maintenance Drill"
sudo journalctl -u nginx --no-pager -n 50
What I observed: Service-level events and restart history.
Why it matters: Journal provides additional context that might not appear in application logs.
💾 Phase 4: Resource Health
Production services don't just need working software — they need healthy infrastructure.
Check Uptime
bash
echo "Manjay Verma - Maintenance Drill"
uptime
What I observed: Server has been running continuously.
Review Memory Usage
bash
echo "Manjay Verma - Maintenance Drill"
free -h
What I observed: Memory usage is within acceptable limits.
Check Disk Space
bash
echo "Manjay Verma - Maintenance Drill"
df -h
What I observed: Disk usage is below critical threshold.
Inspect /var Directory Usage
bash
echo "Manjay Verma - Maintenance Drill"
sudo du -sh /var/* | sort -h
What I observed: No single directory consuming excessive space.
Why it matters: If disk fills up completely, services cannot write logs and may crash. Monitoring disk usage prevents this.
⚠️ Phase 5: Incident Simulation (The Best Part)
Theory is useful, but practice is irreplaceable. I intentionally broke the configuration to understand real incident response.
Break the Config
bash
sudo nano /etc/nginx/sites-available/default
Removed one semicolon to break the config
Observe the Failure
bash
sudo nginx -t
Result: Configuration test failed with syntax error.
Fix and Recover
bash
Restored the missing semicolon
sudo nginx -t # Configuration test passes
sudo systemctl restart nginx
curl -I http://34.230.8.38 # HTTP/1.1 200 OK
Root Cause: Syntax error in nginx configuration (missing semicolon)
Fix: Restored valid configuration and validated using nginx -t
Prevention: Always test configuration before restarting. Never assume — verify.
🔒 Security & Reliability Notes
Based on my maintenance drill, here's the current security posture:
✅ SSH access uses key-based authentication (no password login)
✅ Open ports restricted to 22 (SSH) and 80 (HTTP)
✅ Nginx is enabled on boot for automatic recovery
✅ No secrets or private keys were exposed during this exercise
⚠️ Firewall not configured — this should be addressed
💡 Cost optimization: I will stop the EC2 instance when not in use to prevent unnecessary charges
📌 Key Learnings
This week taught me that real production validation goes far beyond checking if a website loads. Here's what I learned:
Area Key Takeaway
Network Connectivity issues often masquerade as application problems
Services Verify not just "running" but "actually serving traffic"
Logs Logs tell the real story — access, errors, and system events
Resources Disk, memory, and CPU monitoring prevent surprises
Incidents Breaking things safely teaches more than reading docs
Recovery Always have a rollback plan and test configuration changes
🚀 Next Steps
This maintenance drill revealed both strengths and gaps. Moving forward, I will:
Implement firewall rules to restrict traffic to only required ports
Set up automated monitoring with alerts for:
High disk usage
Service failures
Unusual traffic patterns
Practice more failure scenarios to build muscle memory for incident response
Document runbooks so recovery steps are clear and repeatable
💭 Final Thoughts
Production maintenance isn't glamorous, but it's essential. This drill simulated exactly what on-call engineers do daily: validate, investigate, fix, and learn.
The most valuable takeaway? Simulating failures in a controlled environment builds confidence for handling real outages. By deliberately breaking and fixing Nginx, I experienced the full cycle of detection, diagnosis, recovery, and prevention — all without impacting real users.
Have you performed similar maintenance drills? What unexpected issues did you discover? Share your experiences below!
Top comments (0)