Vivian Chiamaka Okose

Posted on Jan 23

Production Maintenance Drill: Learning to Keep Systems Running

#devops #cloudcomputing #networking #learning

Introduction

Deploying an application is exciting — seeing your app load in a browser feels like success. But in real-world DevOps, deployment is only the beginning. What truly matters is whether the application stays available, recoverable, and stable when things go wrong.

As part of Week 2 of the DevOps Micro Internship (DMI Cohort-2), I completed a Production Maintenance Drill (OPS Checklist) on a React application deployed on an Ubuntu EC2 instance using Nginx. This exercise was designed to simulate how DevOps engineers think and act during real production checks and on-call situations.

Pre-flight Check: Is the Application Actually Live?

The first step was simple but critical: confirm that the application is truly reachable by users.

I accessed the application via its public IP and verified that it loaded successfully in the browser. This confirmed that:

The EC2 instance was reachable from the internet
Nginx was running and serving the React build correctly
Port 80 was open and functioning as expected

This step reinforced an important lesson: if the app is not reachable, nothing else matters yet.

Phase 1: Networking & Access Checks

Once availability was confirmed, I moved into networking fundamentals:

Verified network interfaces and IP configuration
Checked default routing to confirm outbound internet access
Tested DNS resolution to ensure domain-to-IP translation worked
Confirmed basic connectivity using ICMP checks
Reviewed listening ports and firewall status

These checks are essential in production because many outages are not caused by application bugs, but by network misconfigurations or unintended exposure of services.

Phase 2: Service Health & Process Validation

Next, I validated Nginx using a systemd-style production approach:

Confirmed Nginx service status (active and running)
Ensured Nginx is enabled to start on boot
Validated configuration syntax before restart
Verified master and worker processes
Confirmed which process was bound to port 80

This phase emphasized the importance of service reliability. A service that runs but fails after a reboot or restart is a hidden production risk.

Phase 3: Logs & Request Tracing

In DevOps, saying “it works” is not enough — logs tell the real story.

I generated traffic to the application and examined:

Nginx access logs to confirm real requests were processed
Error logs to detect failures or misconfigurations
Systemd journal logs for service-level issues

No critical errors were observed, and the access logs confirmed my test requests. This highlighted how logs serve as the first line of truth during incidents.

Phase 4: System Resource Health

I then checked system capacity indicators:

Uptime and load average
Memory availability
Disk usage
High-usage directories under /var

Although no immediate risks were detected, this phase taught me that disk exhaustion is one of the most common silent causes of production failure, especially when logs grow unchecked.

Phase 5: Configuration & Content Integrity

To ensure release safety, I verified:

The correct React build files existed in the Nginx web root
Nginx was serving the deployed build, not the default page
SPA routing was correctly configured using try_files

This step is critical because a successful deployment does not guarantee the correct version is being served to users.

Phase 6: Incident Simulation & Recovery

This was the most impactful part of the drill.

Faulty Configuration Simulation

I intentionally introduced a small syntax error into the Nginx configuration. As expected, configuration validation failed. After fixing the issue, I validated the configuration and restarted Nginx successfully.

Missing Content Simulation

I safely removed the web root directory to simulate a failed deployment. The application became unavailable. After restoring the content from backup and restarting Nginx, service was fully restored.

These simulations reinforced a key DevOps principle: mistakes are inevitable — fast, safe recovery is the real skill.

Security & Reliability Notes

SSH access was secured using key-based authentication
Only required ports (22 and 80) were exposed
Nginx was enabled on boot for reliability
No secrets or credentials were exposed publicly
Cloud resources will be stopped or terminated when not in use

Key Takeaways

This production maintenance drill reshaped how I think about DevOps work. I learned that reliability, observability, and recovery matter just as much as deployment. Logs are not optional, validation should never be skipped, and proactive checks prevent reactive firefighting.

Most importantly, I now understand that DevOps is not just about building systems — it’s about keeping them running under real-world conditions.

DEV Community