How We Built a Self-Learning, Self-Healing AWS Operations Engine from Real Production Failures
Many cloud environments today are thoroughly monitored.
Alarms trigger. Dashboards display warnings. Notifications pop up in Slack.
Yet, when an issue arises at 3:17 AM, engineers still find themselves asking:
“Has this happened before… and how did we resolve it previously?”
The real source of operational distress lies not in insufficient monitoring, but in this gap.
In this blog, I aim to share how we designed and implemented an Incident Memory System on AWS — not as a theoretical concept, but as a fully operational system, created by intentionally disrupting production-like infrastructure and compelling it to restore itself.
This is not a tutorial.
This is not a PoC based on ideal scenarios.
This is a real exercise.
Architecture: Designing an Incident Memory System That Can Actually Learn
Before automation or self-healing can exist, responsibilities must be clearly defined.
We did not start by choosing services.
We started by defining roles.
At a high level, the system needed to do five things reliably:
- Detect a real failure
- Decide whether the failure is significant
- Record the incident in a way the system can remember
- Apply a known recovery action
- Track whether the recovery actually worked
Only after these responsibilities were clear did we map them to AWS services.
Logical Architecture Breakdown
The Incident Memory System was designed as a pipeline, not a monolith.
Each stage has a single responsibility.Detection Layer
CloudWatch alarms are responsible only for answering one question:
Is something broken right now?Event Routing Layer
EventBridge is responsible for routing only meaningful state changes.
It does not execute logic. It only forwards signals.Memory Layer
DynamoDB acts as the system’s long-term memory.
Every incident is stored with its symptom, timestamp, and resolution state.Execution Layer
Lambda functions execute predefined actions.
They do not diagnose. They do not guess.Control Plane Execution
AWS Systems Manager performs the actual recovery on the instance.This separation was intentional.
If any layer fails, the system degrades safely instead of silently pretending to heal.High-level responsibility flow of the Incident Memory System
This diagram shows how detection, memory, and recovery are decoupled to avoid tight coupling.
Launching the Compute Layer
The foundation of this system starts with a single Linux EC2 instance.
Nothing special was chosen here.
No autoscaling.
No containers.
No managed service.
The goal was to begin with the most common real-world backend setup: a VM running a web service.
After launching the instance, I connected to it using EC2 Instance Connect and verified basic OS access.
At this point, the instance was empty.
No application.
No web server.
Installing and Verifying the Web Service
I chose nginx for one reason only: predictability.
nginx is simple, stable, and widely used in production environments.
If something breaks here, it reflects a real operational failure, not a framework issue.
The installation was done directly from the package manager.
Once installed, I verified:
nginx binaries were present
the service could start
the default page rendered correctly
This confirmed the backend was operational before introducing any AWS-level complexity.
Placing an Application Load Balancer in Front
With a working backend, the next step was to expose it properly.
An Application Load Balancer was created with:
- An HTTP listener on port 80
- A target group pointing to the EC2 instance
- Health checks configured on the root path
This step is critical.
Most production failures do not happen on the instance itself.
They are observed at the load balancer layer.
So the system had to fail where users actually feel it.
Once the target group showed the instance as healthy, traffic through the ALB was tested and verified.
Opening the Network Path Correctly
Before generating traffic or failures, I validated the network layer.
Security group rules were adjusted to allow:
HTTP traffic on port 80 from the internet
ALB to instance communication
This might look trivial, but misconfigured security groups are one of the most common root causes of silent failures.
Only after confirming the network path was correct did I move forward.
Generating Controlled Traffic
Before breaking anything, I wanted to understand baseline behavior.
Traffic was generated manually from the instance using curl loops and ApacheBench.
This served two purposes:
- Confirm the ALB was routing traffic correctly
- Establish a normal performance baseline
At this point:
- Requests succeeded
- No errors were generated
- The system was stable
This baseline is important because later failures can be compared against it.
## Introducing Failure Intentionally
With everything working, I introduced failure deliberately.
A small shell script was created that repeatedly stopped and restarted nginx.
This was not random chaos.
This was controlled instability.
The script simulated a backend service that flaps under load or crashes intermittently.
As expected:
- ALB health checks began failing
- Users started receiving 502 Bad Gateway responses
This was the exact failure pattern I wanted to detect and respond to.
Detecting the Failure Using CloudWatch
Only after the failure existed did monitoring come into play.
A CloudWatch alarm was created on:
- Namespace: AWS/ApplicationELB
- Metric: HTTPCode_ELB_5XX_Count
- Threshold: greater than or equal to 1 within 1 minute
This alarm did one thing only.
It confirmed that the system could see the failure.
No automation yet.
No recovery logic.
Just detection.
Once the backend instability continued, the alarm transitioned into ALARM state.
Introducing the Incident Collector Lambda
Once the alarm could emit an event, something had to consume it.
This responsibility belonged to a Lambda function I called the incident collector.
The purpose of this function was deliberately limited.
It did not fix anything.
It did not analyze logs.
It did not make decisions.
Its only responsibility was to record the incident.
When the EventBridge rule fired, the Lambda extracted:
- Alarm name
- Timestamp
- Symptom type
- Initial incident state
This was the first time the system moved from detection into memory.
An incident was no longer just an alert.
It became a structured record.
Converting an Alarm Into an Actionable Event
At this stage, the system could detect failure, but detection alone is passive.
An alarm changing state does not fix anything.
It only tells humans that something went wrong.
To move beyond that, the alarm had to become an event that other services could react to.
This is where Amazon EventBridge was introduced.
I created an EventBridge rule that listens specifically for CloudWatch alarm state change events.
The rule was scoped carefully to avoid noise:
- Source set to CloudWatch
- Event type restricted to alarm state changes
- Filtered only when the alarm enters ALARM state
This ensured that only real failures triggered downstream logic.
No polling.
No scripts.
No manual checks.
The system was now event-driven.
Creating Persistent Incident Memory With DynamoDB
Alerts are transient.
Logs rotate.
Dashboards reset.
Memory needs persistence.
To store incident history, I created a DynamoDB table named incident_memory.
The schema was intentionally simple:
IncidentID as the partition key
- AlarmName
- Status
- Symptom
- Timestamp
When the incident collector Lambda executed, it wrote a new item with status set to OPEN.
This mattered more than it looked.
OPEN meant unresolved.
OPEN meant the system knew work remained.
OPEN meant recovery could be tracked.
For the first time, the system had state.
Resolving the Incident Manually, Once
At this point, I intentionally stopped automating.
The backend issue was resolved manually by restarting nginx.
This step was critical.
The goal was never to remove humans from the loop.
The goal was to learn from the first fix.
Once nginx was restarted:
- Health checks recovered
- ALB stopped serving 5XX responses
- The CloudWatch alarm returned to OK state
This state transition was just as important as the failure itself.
Teaching the System How to Resolve the Incident
With a successful manual resolution completed, I introduced the auto resolver Lambda.
This function was not reactive in the traditional sense.
It did not run blindly on every alarm.
It did not attempt diagnosis.
Instead, it followed a simple rule:
If an incident exists in OPEN state and its resolution pattern is known, apply the same fix.
For this incident type, the fix was clear:
Restart nginx on the affected instance
This action was executed using AWS Systems Manager Run Command.
No SSH keys.
No open ports.
No human login.
Once the command succeeded, the incident status was updated to AUTO RESOLVED in DynamoDB.
The system now had proof of recovery.
This system did not prevent failures.
Services still crashed.
Traffic still failed.
Alarms still fired.
What changed was what happened after the first failure.
Instead of starting from zero every time, the system began to reuse what it had already learned.
The fix that worked once became a repeatable action, not tribal knowledge.
That shift is subtle, but powerful.
Most operational automation struggles because it tries to be intelligent too early.
It guesses causes.
It assumes fixes.
It reacts without context.
This approach was different.
The system waited for a real incident.
It observed how it was resolved.
Then it remembered that resolution.
Nothing more. Nothing less.
Over time, this kind of design can reduce on-call fatigue, shorten recovery times, and preserve operational knowledge that would otherwise disappear when people change teams.
Not because the system is smart, but because it is grounded in real outcomes.
In cloud operations, reliability often comes not from predicting the future, but from respecting the past.





















Top comments (0)