Deploynix

Posted on Apr 16 • Originally published at deploynix.io

Your Server Is Down: A Step-by-Step Incident Response Playbook for Deploynix Users

#deploynix #incidentresponse #servermanagement #healthalerts

It is 2 AM. Your phone buzzes with a health alert from Deploynix: your production server is unreachable. Your stomach drops. Customers are seeing error pages. Revenue is bleeding by the minute. What do you do first?

Every engineering team will face server downtime at some point. The difference between a minor blip and a catastrophic outage often comes down to preparation. Having a clear, rehearsed incident response playbook transforms panic into methodical problem-solving. This guide walks you through exactly what to do when your Deploynix-managed server goes down, from the moment you receive the alert to the post-incident review that prevents it from happening again.

Step 1: Acknowledge the Alert and Assess Severity

When Deploynix detects that your server is unhealthy, it triggers a health alert through your configured notification channels. The first thing you should do is acknowledge the alert and resist the urge to start randomly restarting services.

Take 30 seconds to assess what you know:

Which server is affected? Check your Deploynix dashboard to see if the issue is isolated to a single server or affecting multiple servers. If you are running a load-balanced setup with multiple app servers behind a load balancer, one server going down may not mean total downtime for your users.
What type of server is it? An app server going down has different implications than a database server or a cache server. Deploynix lets you provision specialized server types including App, Web, Database, Cache, Worker, and Meilisearch servers, and each has different recovery priorities.
When did it start? Check the monitoring timeline in your Deploynix dashboard. Did it coincide with a recent deployment? A traffic spike? A scheduled cron job?

Classify the severity immediately. If your primary database server is down, that is a Severity 1 incident affecting all users. If a single worker server in a pool has stopped processing jobs, that might be a Severity 3 that can wait until morning.

Step 2: Use the Web Terminal for Immediate Diagnosis

One of the most powerful features Deploynix provides during an incident is the web terminal. Instead of scrambling to find SSH keys or remembering server IP addresses, you can connect directly to your server through the Deploynix dashboard.

Open the web terminal and start with these diagnostic commands:

# Check system uptime - did the server actually reboot?
uptime

# Check disk usage - full disks are a top cause of outages
df -h

# Check memory usage - OOM kills are extremely common
free -m

# Check running processes and resource consumption
top -bn1 | head -20

# Check if key services are running
systemctl status nginx
systemctl status php8.4-fpm
systemctl status mysql

If the web terminal itself is not connecting, the server may be completely unresponsive. In that case, you will need to use your cloud provider's console access. Deploynix supports DigitalOcean, Vultr, Hetzner, Linode, AWS, and custom providers, each of which offers some form of emergency console access through their own dashboard.

Step 3: Identify the Root Cause

Server outages almost always fall into one of a handful of categories. Here is how to diagnose each one.

Disk Full

A full disk is one of the most common and most preventable causes of server downtime. When the disk fills up, databases cannot write, logs cannot rotate, and applications crash.

# Find the largest directories
du -sh /* 2>/dev/null | sort -rh | head -10

# Check log file sizes
du -sh /var/log/*

# Check for old deployment releases consuming space
du -sh /home/deploynix/*/releases/*

Common culprits include unrotated log files, accumulated deployment releases, database binary logs, and uploaded files that were never cleaned up. Deploynix keeps a configurable number of deployment releases for rollback purposes, but if you have dozens of releases and a small disk, they add up.

Out of Memory (OOM)

The Linux kernel's OOM killer will terminate processes when memory is exhausted. Check the system logs to confirm:

# Check for OOM kills
dmesg | grep -i "out of memory"
grep -i "oom" /var/log/syslog

# Check what is consuming memory right now
ps aux --sort=-%mem | head -10

Laravel applications running with Octane via FrankenPHP, Swoole, or RoadRunner can be particularly susceptible to memory leaks since the application process persists between requests. If you are using Octane, check the worker memory consumption and consider configuring the maximum request count before workers are recycled.

Process Crash

Sometimes a critical process simply crashes. Check the status of your core services:

# Check Nginx
systemctl status nginx
journalctl -u nginx --since "1 hour ago"

# Check PHP-FPM
systemctl status php8.4-fpm
journalctl -u php8.4-fpm --since "1 hour ago"

# Check your database
systemctl status mysql    # or mariadb, or postgresql
journalctl -u mysql --since "1 hour ago"

# Check queue workers (daemons managed by Deploynix)
supervisorctl status

Deployment Gone Wrong

If the outage started immediately after a deployment, the cause is almost certainly the new code. Deploynix supports zero-downtime deployments, but a deployment can still break things if the new code has a bug, a migration failed, or environment variables are misconfigured.

Check the deployment log in your Deploynix dashboard. If you see a failed deployment, the fastest recovery option is a rollback, which we will cover in the next step.

Step 4: Execute the Recovery

Once you have identified the root cause, execute the appropriate recovery.

For Disk Full

# Clear old log files
sudo truncate -s 0 /var/log/nginx/error.log
sudo truncate -s 0 /var/log/nginx/access.log

# Remove old deployment releases (keep last 5)
cd /home/deploynix/your-site/releases
ls -t | tail -n +6 | xargs rm -rf

# Clear Laravel caches
php artisan cache:clear
php artisan view:clear

For OOM

Restart the affected services and consider upgrading your server:

sudo systemctl restart php8.4-fpm
sudo systemctl restart nginx

If OOM kills are recurring, it is time to either optimize your application's memory usage or upgrade to a larger server through your cloud provider. Deploynix makes it easy to provision a new, larger server and migrate your sites.

For Process Crashes

Restart the crashed service:

sudo systemctl restart nginx
sudo systemctl restart php8.4-fpm
sudo systemctl restart mysql

For Bad Deployments

Use Deploynix's rollback feature. In your dashboard, navigate to the site's deployment history and click rollback on the last known good deployment. This instantly symlinks the previous release as the active one and restarts the necessary services, bringing your site back online in seconds.

Step 5: Verify Recovery

After executing your recovery steps, verify that everything is actually working:

Check the Deploynix dashboard. Your server's health status should return to green.
Hit your application endpoints. Use curl or your browser to verify the site loads.
Check your logs. Make sure no new errors are appearing.
Monitor for 15 minutes. Do not declare victory too soon. Watch the server metrics in your Deploynix monitoring dashboard to ensure CPU, memory, and disk usage are stable.

# Quick smoke test from the server
curl -s -o /dev/null -w "%{http_code}" https://your-app.com
# Should return 200

Step 6: Communicate with Stakeholders

While you are deep in the technical weeds, someone needs to be communicating with stakeholders. If you are a solo founder or a small team, this might mean posting a status update on your status page or sending a message in your community channel.

Key information to communicate:

What happened (in non-technical terms)
How many users were affected
How long the outage lasted
What you are doing to prevent it from happening again

Step 7: Conduct a Post-Incident Review

Once the dust settles, schedule a post-incident review within 48 hours while the details are still fresh. This is not about blame. It is about learning and improving.

Document the following:

Timeline: When did the incident start? When was it detected? When was it resolved?
Root cause: What specifically caused the outage?
Detection: How was the incident detected? Could it have been detected earlier?
Response: What steps were taken to resolve it? Were there any missteps?
Prevention: What changes will prevent this from happening again?

Preventive Measures to Implement

Based on your post-incident review, consider implementing these safeguards using Deploynix features:

Set up health alerts. If you were not already using health alerts, configure them now. Deploynix can monitor your servers and alert you when they become unreachable or when resource usage exceeds thresholds.

Configure firewall rules. If the incident was caused by a traffic flood or unauthorized access, review your firewall rules in the Deploynix dashboard. Add rules to block suspicious traffic patterns.

Schedule database backups. If your incident involved data loss, set up automated backups. Deploynix supports backup storage to AWS S3, DigitalOcean Spaces, Wasabi, and any custom S3-compatible storage. Schedule backups to run at least daily.

Add monitoring for disk and memory. Set up alerts before you hit critical thresholds. Getting an alert at 80% disk usage gives you time to act before you hit 100% and everything breaks.

Review your cron jobs. Sometimes a runaway cron job is the root cause. Check your configured cron jobs in Deploynix and make sure they have appropriate timeouts and are not overlapping.

Consider a load balancer. If uptime is critical to your business, a single server is a single point of failure. Deploynix supports load balancers with Round Robin, Least Connections, and IP Hash methods. Putting multiple app servers behind a load balancer means one server going down does not take your entire application offline.

Building Your Incident Response Runbook

Every team should have a written runbook that anyone on the team can follow during an incident. Here is a template to adapt for your organization:

Acknowledge the alert within 5 minutes
Assess severity (S1 through S4)
Diagnose using the web terminal and monitoring dashboard
Recover using the appropriate playbook for the root cause
Verify recovery by checking endpoints and monitoring
Communicate with stakeholders
Document the incident within 48 hours
Implement preventive measures within one sprint

Assign clear roles if you have a team. Who is the incident commander? Who communicates with stakeholders? Who writes the post-incident report? Having these roles defined before an incident means less confusion during one.

Conclusion

Server downtime is stressful, but it does not have to be chaotic. With a clear playbook, the right tools, and a commitment to learning from every incident, you can minimize both the duration and impact of outages.

Deploynix gives you the tools to respond effectively: health alerts catch problems early, the web terminal lets you diagnose from anywhere, rollback capabilities let you undo bad deployments in seconds, and monitoring dashboards help you understand what went wrong and why.

DEV Community