It's midnight. You get an incident call on your phone that your application's web server has crashed, and users are seeing the dreaded 500 Internal server error. You stumble to your laptop, sleepy-eyes, to run the restart command or to run your restart script.
This is the "Old Way." It's manual, it's reactive, and it ruins your sleep.
In the world of DevOps, we don't just fix things; we build things that fix themselves (self-healing). In this article, we're going to build a simple automation that detects a web server crash and restarts the service before you even roll over in bed.
The "Healing" Architecture
To build this, we need five simple AWS resources:
- The Sensor (Amazon CloudWatch): The eye that monitors your server.
- The Tripwire (CloudWatch Alarms): The logic that says, "Wait, something is wrong."
- The Router (Amazon EventBridge): The nervous system that delivers the alert to our code.
- The Brain (AWS Lambda): The function that decides what to do.
- The Hands (AWS Systems Manager): The tool that actually reaches out and restarts the service.
The flow looks like this:
Step 1: Prepare your server (The EC2 instance)
First, we need a server to monitor.
- Launch an EC2 instance: Use a t3.micro running Amazon Linux 2023 (it's Free Tier eligible). The commands in this tutorial use dnf, which is the package manager for Amazon Linux 2023. If you choose Amazon Linux 2 instead, swap dnf for yum.
- Configure the security group. Remember to allow port 80 (HTTP) and port 22 (SSH) in your security group for this test. Without these, you won't be able to SSH into the instance or load the Nginx welcome page in your browser.
- Attach an IAM role: This is where most beginners get stuck. Your server needs an IAM role that allows AWS Systems Manager (SSM) to talk to it. Attach the AmazonSSMManagedInstanceCore and CloudWatchAgentServerPolicy policies to your instance.
- Start a service: SSH into your server and install Nginx:
sudo dnf update -y
sudo dnf install nginx -y
sudo systemctl enable --now nginx
Open your browser to the instance's public IP. You should see the "Welcome to Nginx" page.
Step 2: The "Eyes" Installation (CloudWatch Agent)
The default EC2 metrics can't see inside your server. We need the agent to monitor the Nginx process.
- Install the Agent: On your EC2, run:
sudo dnf install amazon-cloudwatch-agent -y
- Configure the "Nginx Watcher": CloudWatch doesn't know what Nginx is until you tell it. Create a configuration file to monitor the process:
sudo nano /opt/aws/amazon-cloudwatch-agent/bin/config.json
Paste this configuration:
{
"agent": {
"metrics_collection_interval": 60
},
"metrics": {
"namespace": "CWAgent",
"metrics_collected": {
"procstat": [
{
"exe": "nginx",
"measurement": ["pid_count"]
}
]
}
}
}
Save the file and start the agen, you have to tell the agent to load the file:
sudo /opt/aws/amazon-cloudwatch-agent/bin/amazon-cloudwatch-agent-ctl -a fetch-config -m ec2 -c file:/opt/aws/amazon-cloudwatch-agent/bin/config.json -s
The agent is now running. Verify with:
sudo systemctl status amazon-cloudwatch-agent
After a minute or two, you should see a procstat_lookup pid_count metric appear in CloudWatch under the CWAgent namespace.
Step 3: Write the "Healing Script" (The Lambda)
We need a tiny function that tells AWS: "Hey, go to Instance X and restart the web server." Create a new Lambda function (Python 3.12+) and paste this code:
import os
import boto3
def lambda_handler(event, context):
ssm = boto3.client('ssm')
# In production, use an environment variable. Hardcoded for this lab:
instance_id = os.environ.get('INSTANCE_ID', 'i-YOUR_INSTANCE_ID_HERE')
alarm_name = event.get('detail', {}).get('alarmName', 'unknown')
print(f"Alarm '{alarm_name}' triggered. Restarting Nginx on {instance_id}...")
ssm.send_command(
InstanceIds=[instance_id],
DocumentName="AWS-RunShellScript",
Parameters={'commands': ['sudo systemctl restart nginx']}
)
return {"status": "Restart command sent"}
IAM permissions for the Lambda. In the Lambda console, open your function's Configuration → Permissions tab and click the execution role name to open it in IAM.
Confirm AWSLambdaBasicExecutionRole is attached (it usually is by default). Then click Add permissions → Create inline policy, switch to the JSON tab, and paste:
{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Action": "ssm:SendCommand",
"Resource": [
"arn:aws:ec2:REGION:ACCOUNT_ID:instance/i-YOUR_INSTANCE_ID",
"arn:aws:ssm:REGION::document/AWS-RunShellScript"
]
}]
}
Replace REGION, ACCOUNT_ID, and the instance ID with your values.
Step 4: Set the Tripwire (The Alarm)
Now we tell AWS when to trigger that code.
Go to CloudWatch > Alarms and click "Create Alarm."
- Select metric: CWAgent > procstat > pid_count (filter for your Nginx process).
- Condition: Lower than 1 (meaning Nginx is not running).
- Missing data treatment: Treat missing data as breaching. This is important — when Nginx dies, the procstat plugin may stop reporting the metric entirely rather than reporting zero. Without this setting, the alarm may never fire.
- Notification: You can skip the SNS step entirely. With EventBridge, you don't configure any alarm action CloudWatch automatically publishes every alarm state change to the default event bus.
Give the alarm a memorable name like nginx-down-alarm. You'll reference it in the next step.
Step 5: Route the Alarm to Lambda with EventBridge
Here's where EventBridge shines. CloudWatch Alarms publish state-change events to the default event bus automatically, we just need a rule that catches the ones we care about and sends them to our Lambda.
Go to EventBridge > Rules and click "Create rule."
- Name: nginx-down-rule
- Event bus: default
- Rule type: Rule with an event pattern
- Event source: AWS services
- Event pattern: Use the custom pattern editor and paste this:
{
"source": ["aws.cloudwatch"],
"detail-type": ["CloudWatch Alarm State Change"],
"detail": {
"alarmName": ["nginx-down-alarm"],
"state": {
"value": ["ALARM"]
}
}
}
This pattern says: "Match only when this specific alarm enters the ALARM state." No filtering logic needed inside the Lambda, EventBridge handles it.
- Target: Select "AWS service" → "Lambda function" → pick your Lambda.
EventBridge will automatically add the invoke permission on your function. No subscriptions, no topic management.
Step 6: Testing the Self-Heal
We're going to kill Nginx and watch it come back.
- Confirm your server's IP shows the Nginx welcome page in a browser.
- SSH into your server and stop the service:
sudo systemctl stop nginx - Refresh the browser; the site is down.
- Wait. Within 1–2 minutes, the metric drops, the alarm trips, EventBridge fires, Lambda runs, and SSM restarts Nginx.
- Refresh again; The site is back up, and you didn't lift a finger.
If something doesn't fire, check the alarm history in CloudWatch first, then your Lambda's CloudWatch Logs. EventBridge also has a "Monitoring" tab on each rule showing invocation counts and failures, which is handy for debugging.
Why This Matters for Your Career
If you can walk an interviewer through this project, you're demonstrating skills that hiring managers genuinely look for in entry-level DevOps roles. You aren't just "using AWS"; you are demonstrating:
- Event-Driven Architecture: Triggering actions based on state changes.
- Least Privilege: Giving the Lambda only the permissions it needs.
- MTTR (Mean Time To Recovery): You just reduced your recovery time from "whenever I wake up" to "under 120 seconds."
A Note on Cost
Most of what we used in this article fits inside the AWS Free Tier, the EC2 t3.micro, Lambda invocations, and EventBridge events from AWS services (like CloudWatch alarm state changes) are all free. One caveat: CloudWatch custom metrics (which is what procstat produces) are only free for the first 10 metrics, so a single procstat metric is fine, but the cost can scale up if you expand this pattern broadly.
When you're done experimenting, terminate the EC2 instance, delete the CloudWatch alarm, and remove the EventBridge rule to avoid any surprise bills.






Top comments (0)