Introduction
Deploying an application is exciting — seeing your app load in a browser feels like success. But in real-world DevOps, deployment is only the beginning. What truly matters is whether the application stays available, recoverable, and stable when things go wrong.
As part of Week 2 of the DevOps Micro Internship (DMI Cohort-2), I completed a Production Maintenance Drill (OPS Checklist) on a React application deployed on an Ubuntu EC2 instance using Nginx. This exercise was designed to simulate how DevOps engineers think and act during real production checks and on-call situations.
Pre-flight Check: Is the Application Actually Live?
The first step was simple but critical: confirm that the application is truly reachable by users.
I accessed the application via its public IP and verified that it loaded successfully in the browser. This confirmed that:
- The EC2 instance was reachable from the internet
- Nginx was running and serving the React build correctly
- Port 80 was open and functioning as expected
This step reinforced an important lesson: if the app is not reachable, nothing else matters yet.
Phase 1: Networking & Access Checks
Once availability was confirmed, I moved into networking fundamentals:
- Verified network interfaces and IP configuration
- Checked default routing to confirm outbound internet access
- Tested DNS resolution to ensure domain-to-IP translation worked
- Confirmed basic connectivity using ICMP checks
- Reviewed listening ports and firewall status
These checks are essential in production because many outages are not caused by application bugs, but by network misconfigurations or unintended exposure of services.
Phase 2: Service Health & Process Validation
Next, I validated Nginx using a systemd-style production approach:
- Confirmed Nginx service status (active and running)
- Ensured Nginx is enabled to start on boot
- Validated configuration syntax before restart
- Verified master and worker processes
- Confirmed which process was bound to port 80
This phase emphasized the importance of service reliability. A service that runs but fails after a reboot or restart is a hidden production risk.
Phase 3: Logs & Request Tracing
In DevOps, saying “it works” is not enough — logs tell the real story.
I generated traffic to the application and examined:
- Nginx access logs to confirm real requests were processed
- Error logs to detect failures or misconfigurations
- Systemd journal logs for service-level issues
No critical errors were observed, and the access logs confirmed my test requests. This highlighted how logs serve as the first line of truth during incidents.
Phase 4: System Resource Health
I then checked system capacity indicators:
- Uptime and load average
- Memory availability
- Disk usage
- High-usage directories under
/var
Although no immediate risks were detected, this phase taught me that disk exhaustion is one of the most common silent causes of production failure, especially when logs grow unchecked.
Phase 5: Configuration & Content Integrity
To ensure release safety, I verified:
- The correct React build files existed in the Nginx web root
- Nginx was serving the deployed build, not the default page
- SPA routing was correctly configured using
try_files
This step is critical because a successful deployment does not guarantee the correct version is being served to users.
Phase 6: Incident Simulation & Recovery
This was the most impactful part of the drill.
Faulty Configuration Simulation
I intentionally introduced a small syntax error into the Nginx configuration. As expected, configuration validation failed. After fixing the issue, I validated the configuration and restarted Nginx successfully.
Missing Content Simulation
I safely removed the web root directory to simulate a failed deployment. The application became unavailable. After restoring the content from backup and restarting Nginx, service was fully restored.
These simulations reinforced a key DevOps principle: mistakes are inevitable — fast, safe recovery is the real skill.
Security & Reliability Notes
- SSH access was secured using key-based authentication
- Only required ports (22 and 80) were exposed
- Nginx was enabled on boot for reliability
- No secrets or credentials were exposed publicly
- Cloud resources will be stopped or terminated when not in use
Key Takeaways
This production maintenance drill reshaped how I think about DevOps work. I learned that reliability, observability, and recovery matter just as much as deployment. Logs are not optional, validation should never be skipped, and proactive checks prevent reactive firefighting.
Most importantly, I now understand that DevOps is not just about building systems — it’s about keeping them running under real-world conditions.







Top comments (0)