DEV Community: Abhijeet Yadav

Infrasheet Automation

Abhijeet Yadav — Thu, 29 Jan 2026 11:08:22 +0000

I Stopped Copy-Pasting AWS Console Data and Let Lambda Do It for Me

It did not start as an automation project.

It started with one simple request.
“Can you share the VPC and subnet details?”

I opened the AWS console. Clicked into VPCs. Checked subnets. Copied a few values. Dropped them into a spreadsheet. Done.

No complaints. No resistance.

Then the same request came again.
Then it came with a deadline.
Then it came when I was already busy with something else.

That is usually how these things begin.

The Moment Manual Work Starts Feeling Heavy

At first, the AWS console feels friendly. Everything is visual. Filters work. Lists make sense.

But as the environment grows, the console starts telling a different story.

More VPCs.
More subnets.
More availability zones.
More scrolling.

You stop trusting your eyes.
You double check values.
You still miss things.

That is when a simple task turns into fragile work.

At this point, everything still looks manageable.

Repetition Is the Real Problem

The problem was not difficulty.
The problem was repetition.

Every report followed the same pattern.
Open console.
Navigate.
Filter.
Copy.
Paste.
Format.

The output depended entirely on how careful someone was that day.

That is not a system. That is luck.

And engineers do not like relying on luck.

Choosing a Small Entry Point

I did not try to automate everything.

No EC2.
No IAM.
No billing.

Just VPCs and Subnets.

They are foundational. If networking data is wrong, everything above it becomes questionable. It was the safest place to start.

The goal became clear.

Stop asking humans.
Ask AWS directly.

Before Code, the Flow Was Already Clear

Even before writing the function, the flow made sense in my head.

Lambda would run the logic.
Python would talk to AWS.
The data would be structured properly.
Excel would be the output because everyone understands Excel.
S3 would store the result so it is always available.

Simple. Linear. Predictable.

Permissions Decide Everything

Before the function could do anything useful, IAM had to be right.

Lambda needed to describe VPCs.
It needed to describe subnets.
It needed to upload a file to S3.

Nothing more.

Keeping permissions minimal saved time later and avoided unnecessary debugging.

This shows the exact permissions that make the automation work.

The First Real Friction Point

Then came Pandas.

Lambda does not ship with Pandas or OpenPyXL. That meant building a custom Lambda Layer.

I built it locally with only what I needed. No extra libraries. No bloated packages. Just Pandas and OpenPyXL, aligned with Python 3.11.

This step mattered more than expected.

A clean layer meant fewer surprises later.

When Defaults Quietly Fail You

The first execution failed.

Not because of bad logic.
Because of default limits.
Pandas needs memory.

Excel generation needs time.

Once memory was increased to 512 MB and timeout was adjusted, the function behaved perfectly.

Serverless does not mean resource-less.

Lambda Function Setup

RuntimePython 3.11
Memory512 MB or higher
Timeout 1 5 mins 0 seconds

Boto3 fetched VPCs.
Boto3 fetched subnets.
Pandas shaped the data into rows and columns.
OpenPyXL created two sheets in a single Excel file.

Everything stayed in memory. No temp files. No disk usage.

The function did one thing and did it cleanly.

import boto3
import pandas as pd
from io import BytesIO
import time

boto3 connects to AWS APIs.
pandas structures tabular data.
BytesIO handles Excel creation in-memory.
time is used for tracking execution duration.

def lambda_handler(event, context):
start = time.time()
print("Lambda started")

Logs the start of execution and records time.

ec2 = boto3.client('ec2')

Establishes a client to interact with EC2 services.

vpcs = ec2.describe_vpcs()['Vpcs']
print(f"Fetched {len(vpcs)} VPCs")

Retrieves all VPC metadata from the AWS account.

Vpc_data = [{
'VPC ID': vpc['VpcId'],
'CIDR Block': vpc['CidrBlock'],
'State': vpc['State'],
'IsDefault': vpc.get('IsDefault', False)
} for vpc in vpcs]

Extracts required information from the VPCs into a list of dictionaries.

subnets = ec2.describe_subnets()['Subnets']
print(f"Fetched {len(subnets)} Subnets")

Retrieves all subnet data.

subnet_data = [{
'Subnet ID': subnet['SubnetId'],
'VPC ID': subnet['VpcId'],
'CIDR Block': subnet['CidrBlock'],
'Availability Zone': subnet['AvailabilityZone'],
'State': subnet['State']
} for subnet in subnets]

Gathers details like subnet ID, VPC association, AZ, etc.

Creates Excel sheets.

Finalizes and prepares Excel file for upload.

Uploads the Excel to S3.

duration = round(time.time() - start, 2)
print(f"Total Execution Time: {duration} seconds")
return {
'statusCode': 200,
'body': f'Report uploaded to s3://{bucket_name}/{file_name} in {duration}
seconds.' }

The Classic AWS Reminder

The next failure was quick.

Access denied.

The IAM role was missing permission to upload objects to S3. One small update fixed it immediately.

AWS errors are rarely mysterious. They are usually precise.

The First Successful Run

The execution finally completed.

Status showed “Succeeded”.
Logs confirmed the runtime.
S3 showed a new Excel file.

No console clicking.
No copy paste.
No manual formatting.

The system worked.

Opening the Report Felt Different

The Excel file had Different Services sheets showing all the resources.

VPCs, Subnets,EC2-servers, RDS, Volumes and all whichever was mentioned over Lambda Function.

Clean columns.
Accurate data.
Live infrastructure information.

This was no longer a document created by a human. It was a report generated by the system itself.

This shows the real value of the automation.

What This Quietly Became

This started as a small fix for a repeating task.

It turned into a reusable reporting foundation.

Today, the same pattern can be extended easily. EC2 inventories. IAM users. Security groups. Cost summaries.

Once the pipeline exists, expansion feels natural.

Final Thoughts

AWS already knows everything about your infrastructure.

The real work is asking the right questions and shaping the answers into something humans can trust.

This automation does exactly that.

Quietly. Reliably. Every time it runs.

If needed Full Lambda Code, just drop a comment i will share the IAM role (JSON format) and the github link for automation Code.

Incident Memory System Building a Self-Learning, Self-Healing AWS Operations Engine

Abhijeet Yadav — Fri, 16 Jan 2026 11:51:49 +0000

How We Built a Self-Learning, Self-Healing AWS Operations Engine from Real Production Failures

Many cloud environments today are thoroughly monitored.

Alarms trigger. Dashboards display warnings. Notifications pop up in Slack.

Yet, when an issue arises at 3:17 AM, engineers still find themselves asking:

“Has this happened before… and how did we resolve it previously?”

The real source of operational distress lies not in insufficient monitoring, but in this gap.

In this blog, I aim to share how we designed and implemented an Incident Memory System on AWS — not as a theoretical concept, but as a fully operational system, created by intentionally disrupting production-like infrastructure and compelling it to restore itself.

This is not a tutorial.

This is not a PoC based on ideal scenarios.

This is a real exercise.

Architecture: Designing an Incident Memory System That Can Actually Learn

Before automation or self-healing can exist, responsibilities must be clearly defined.

We did not start by choosing services.
We started by defining roles.

At a high level, the system needed to do five things reliably:

Detect a real failure
Decide whether the failure is significant
Record the incident in a way the system can remember
Apply a known recovery action
Track whether the recovery actually worked

Only after these responsibilities were clear did we map them to AWS services.

Logical Architecture Breakdown

The Incident Memory System was designed as a pipeline, not a monolith.
Each stage has a single responsibility.

Detection Layer
CloudWatch alarms are responsible only for answering one question:
Is something broken right now?

Event Routing Layer
EventBridge is responsible for routing only meaningful state changes.
It does not execute logic. It only forwards signals.

Memory Layer
DynamoDB acts as the system’s long-term memory.
Every incident is stored with its symptom, timestamp, and resolution state.

Execution Layer
Lambda functions execute predefined actions.
They do not diagnose. They do not guess.

Control Plane Execution
AWS Systems Manager performs the actual recovery on the instance.

This separation was intentional.
If any layer fails, the system degrades safely instead of silently pretending to heal.

High-level responsibility flow of the Incident Memory System

This diagram shows how detection, memory, and recovery are decoupled to avoid tight coupling.

Launching the Compute Layer

The foundation of this system starts with a single Linux EC2 instance.

Nothing special was chosen here.
No autoscaling.
No containers.
No managed service.

The goal was to begin with the most common real-world backend setup: a VM running a web service.

After launching the instance, I connected to it using EC2 Instance Connect and verified basic OS access.

At this point, the instance was empty.
No application.
No web server.

Installing and Verifying the Web Service

I chose nginx for one reason only: predictability.

nginx is simple, stable, and widely used in production environments.
If something breaks here, it reflects a real operational failure, not a framework issue.

The installation was done directly from the package manager.

Once installed, I verified:

nginx binaries were present

the service could start

the default page rendered correctly

This confirmed the backend was operational before introducing any AWS-level complexity.

Placing an Application Load Balancer in Front

With a working backend, the next step was to expose it properly.

An Application Load Balancer was created with:

An HTTP listener on port 80
A target group pointing to the EC2 instance
Health checks configured on the root path

This step is critical.

Most production failures do not happen on the instance itself.
They are observed at the load balancer layer.

So the system had to fail where users actually feel it.

Once the target group showed the instance as healthy, traffic through the ALB was tested and verified.

Opening the Network Path Correctly

Before generating traffic or failures, I validated the network layer.

Security group rules were adjusted to allow:

HTTP traffic on port 80 from the internet

ALB to instance communication

This might look trivial, but misconfigured security groups are one of the most common root causes of silent failures.

Only after confirming the network path was correct did I move forward.

Generating Controlled Traffic

Before breaking anything, I wanted to understand baseline behavior.

Traffic was generated manually from the instance using curl loops and ApacheBench.

This served two purposes:

Confirm the ALB was routing traffic correctly
Establish a normal performance baseline

At this point:

Requests succeeded
No errors were generated
The system was stable

This baseline is important because later failures can be compared against it.

## Introducing Failure Intentionally

With everything working, I introduced failure deliberately.

A small shell script was created that repeatedly stopped and restarted nginx.

This was not random chaos.
This was controlled instability.

The script simulated a backend service that flaps under load or crashes intermittently.

As expected:

ALB health checks began failing
Users started receiving 502 Bad Gateway responses

This was the exact failure pattern I wanted to detect and respond to.

Detecting the Failure Using CloudWatch

Only after the failure existed did monitoring come into play.

A CloudWatch alarm was created on:

Namespace: AWS/ApplicationELB
Metric: HTTPCode_ELB_5XX_Count
Threshold: greater than or equal to 1 within 1 minute

This alarm did one thing only.

It confirmed that the system could see the failure.

No automation yet.
No recovery logic.

Just detection.

Once the backend instability continued, the alarm transitioned into ALARM state.

Introducing the Incident Collector Lambda

Once the alarm could emit an event, something had to consume it.

This responsibility belonged to a Lambda function I called the incident collector.

The purpose of this function was deliberately limited.

It did not fix anything.
It did not analyze logs.
It did not make decisions.

Its only responsibility was to record the incident.

When the EventBridge rule fired, the Lambda extracted:

Alarm name
Timestamp
Symptom type
Initial incident state

This was the first time the system moved from detection into memory.

An incident was no longer just an alert.
It became a structured record.

Converting an Alarm Into an Actionable Event

At this stage, the system could detect failure, but detection alone is passive.

An alarm changing state does not fix anything.
It only tells humans that something went wrong.

To move beyond that, the alarm had to become an event that other services could react to.

This is where Amazon EventBridge was introduced.

I created an EventBridge rule that listens specifically for CloudWatch alarm state change events.

The rule was scoped carefully to avoid noise:

Source set to CloudWatch
Event type restricted to alarm state changes
Filtered only when the alarm enters ALARM state

This ensured that only real failures triggered downstream logic.

No polling.
No scripts.
No manual checks.

The system was now event-driven.

Creating Persistent Incident Memory With DynamoDB

Alerts are transient.
Logs rotate.
Dashboards reset.

Memory needs persistence.

To store incident history, I created a DynamoDB table named incident_memory.

The schema was intentionally simple:

IncidentID as the partition key

AlarmName
Status
Symptom
Timestamp

When the incident collector Lambda executed, it wrote a new item with status set to OPEN.

This mattered more than it looked.

OPEN meant unresolved.
OPEN meant the system knew work remained.
OPEN meant recovery could be tracked.

For the first time, the system had state.

Resolving the Incident Manually, Once

At this point, I intentionally stopped automating.

The backend issue was resolved manually by restarting nginx.

This step was critical.

The goal was never to remove humans from the loop.
The goal was to learn from the first fix.

Once nginx was restarted:

Health checks recovered
ALB stopped serving 5XX responses
The CloudWatch alarm returned to OK state

This state transition was just as important as the failure itself.

Teaching the System How to Resolve the Incident

With a successful manual resolution completed, I introduced the auto resolver Lambda.

This function was not reactive in the traditional sense.

It did not run blindly on every alarm.
It did not attempt diagnosis.

Instead, it followed a simple rule:

If an incident exists in OPEN state and its resolution pattern is known, apply the same fix.

For this incident type, the fix was clear:

Restart nginx on the affected instance

This action was executed using AWS Systems Manager Run Command.

No SSH keys.
No open ports.
No human login.

Once the command succeeded, the incident status was updated to AUTO RESOLVED in DynamoDB.

The system now had proof of recovery.

This system did not prevent failures.

Services still crashed.
Traffic still failed.
Alarms still fired.

What changed was what happened after the first failure.

Instead of starting from zero every time, the system began to reuse what it had already learned.
The fix that worked once became a repeatable action, not tribal knowledge.

That shift is subtle, but powerful.

Most operational automation struggles because it tries to be intelligent too early.
It guesses causes.
It assumes fixes.
It reacts without context.

This approach was different.

The system waited for a real incident.
It observed how it was resolved.
Then it remembered that resolution.

Nothing more. Nothing less.

Over time, this kind of design can reduce on-call fatigue, shorten recovery times, and preserve operational knowledge that would otherwise disappear when people change teams.
Not because the system is smart, but because it is grounded in real outcomes.

In cloud operations, reliability often comes not from predicting the future, but from respecting the past.

AI-Augmented Cloud Operations

Abhijeet Yadav — Wed, 07 Jan 2026 07:49:48 +0000

I've spent the last year buried in AWS stuff. You know, chasing CloudWatch alerts, digging through logs, hunting weird cost spikes, and sorting random errors that pop up right when you want to call it a day.

Cloud-ops isn't that smooth Afterall. Actually, it's getting way louder with alarms going off nonstop.

Think about it. Monday rolls around and RDS CPU spikes so high for no reason. Tuesday hits with a Lambda retrying over and over in some endless loop. Come Wednesday, a cost alert keeps beeping because that Staging(testing) EC2 server never got shut down or just like me when I forgot to stop the policy of AMI & Snapshots. (It actually burnt my 262$ when I ignored the Alerts.)

If you've messed with AWS a bit, you totally get the routine.

Then it hit me hard. Most of our work, close to 70 percent, boils down to the same grind. We analyze logs. Recap the chaos. Pinpoint what failed. Guess the real culprit.
Wash, rinse, repeat every week.

Things shifted when I troubleshot a client's Lambda glitch using CloudWatch and Bedrock. I stumbled into a setup where the AI ripped through a huge log file. It summed everything up and zeroed in on the problem within seconds. Way quicker than me pounding coffee over a screen.

That's the aha moment. AI won't take over jobs. Instead, it acts like that reliable buddy who stays up all night. It catches patterns we overlook and lets us tackle bigger challenges.

I kept experimenting after that. Linked Bedrock with CloudWatch, Lambda, Code-Commit for code trails, Cost Explorer data, plus Slack alerts. Suddenly the whole thing locked into place.

AWS lays it all out ready to go. Logs, metrics, events, billing details, repos, old incident records, models that actually reason through stuff.

Observing It smartly and we can build an AI helper that runs around the clock like a solid junior dev. No futuristic talk. This runs right now in real accounts.

How I assembled it step by step, using actual services and headaches I fixed, with code anyone can fire up.
But not like any dry tutorials, this is just my numerous Attempts and trying different ways until I get the solution and hit the spot.

I didn’t plan any of this properly. It sort of came together while handling a random RDS CPU alert. The alarm fired (again), and instead of going through the usual drill, I thought I’d quickly put something in place to save myself the back-and-forth.

So the first thing I did was open the CloudWatch alarm to check the settings, mainly to confirm the threshold and the evaluation period. Since this is part of the story anyway.

After confirming the alarm settings, I jumped into the logs.
I wanted Lambda to automatically pick up a small window of logs, so I opened the RDS log group to see what format and timestamps I’d be dealing with.

Once I saw the structure, I created a simple Lambda function. Nothing fancy Python, default runtime, one file. The idea was:

Setting Up Slack and Wiring the Notifications

Once I had the basic flow in my head, the next thing I needed was a clean way to get alerts somewhere I actually check. Email works sometimes, but Slack is where most of us live anyway. So I created a small Slack app just for this setup.

The moment you open the Slack developer page, it looks something like this.

All I had to do was enable incoming webhooks for the workspace.

Then Slack asked me where exactly I wanted the messages to appear. I picked a fresh channel I made only for this experiment, mostly so I could keep the noise separate from everything else.

After approving it, Slack generated the webhook URL. This is the only thing Lambda needs to trigger messages into that channel.

I didn’t want to hardcode this webhook into Lambda though. Storing secrets in code is a recipe for embarrassment later. So I opened Secrets Manager and created a new secret for the webhook.

With that done, the notification path was ready.
Slack would receive the message.
Secrets Manager would hold the sensitive piece.
And Lambda would glue the two together.

Alarm fires → Lambda gets event → fetch 10 mins logs → send to Bedrock.

Wiring Slack into the Flow

Once the idea was clear, I needed a place where all this noise could land. Slack was the obvious choice.

I created a small Slack app just for this project. Nothing fancy, just a name, a workspace, and a clean slate.

Then I enabled incoming webhooks so external services could post messages into a channel.

After that, Slack asked which channel should receive the alerts. I created a fresh channel named #ai-ops and approved the app.

Once approved, Slack generated the webhook URL.

That tiny URL is the bridge between AWS and my phone lighting up at 2 AM.

Hiding the Webhook Where It Belongs

Hardcoding the webhook into Lambda felt wrong, so I pushed it into Secrets Manager.

Setting Up Secrets and Notifications

Once the basic pieces started forming, I observed that I needed somewhere to store the Slack webhook properly. I could not do Hardcoding into Lambda, it felt messy and also slightly dangerous, so I moved it into Secrets Manager. It took just a minute but instantly made the whole thing feel cleaner. Almost like, ok fine, now this setup looks more aligned.

After saving, the secret was sitting quietly in AWS, invisible to anyone who doesn’t have permission.

This one step instantly made the setup feel less hacky and more production-ish.

The Lambda That Holds Everything Together

Now came the dispatcher.

I created a new Lambda function using Python. No layers, no frameworks, just a single file.

Then I added environment variables so Lambda would know:

which bucket to write to

where the Slack secret lives

which region it is running in

After saving, the configuration finally looked complete.

At this point, the plumbing was done.

Pressing Run and Watching It Come Alive

I triggered the Lambda manually using a small test payload:

Within seconds, Slack pinged.

Not a test message, A real alert from something I just built.

Capturing the Full Story in S3

Every execution drops a JSON file into S3. That file contains the logs, timestamps, event data and whatever analysis is produced.

And when I opened the results folder, the structure was clean and traceable.

How Everything Is Actually Connected

This is the entire setup I have running today.
No Bedrock yet. No heavy ML. Just automation that removes one painful step from my daily cloud routine.

What This Actually Changed

I didn’t build this system because I wanted to “do AI in the cloud”.
I built it because I was tired of reacting blindly.

Before this, every alert meant the same ritual.
Open CloudWatch.
Scroll through logs.
Guess which 3 lines out of 50,000 actually mattered.
Hope I didn’t miss the real problem.

Now the flow is different.

An alarm fires.
A message hits Slack.
A full execution trace lands in S3.
By the time I even open the console, the context is already waiting for me.

That changes the entire mental state.
You don’t start from confusion anymore.
You start from evidence.

What surprised me most is how small the system is.

1. One Lambda.
2. One secret.
3. One S3 folder.
4. One Slack channel.

That’s enough to cut out a painful slice of daily cloud work.

And the future part isn’t some wild AI promise. It’s extremely practical.

This same pattern can listen to cost anomalies.
It can watch WAF logs.
It can monitor security events.
It can summarise incidents before anyone joins the bridge call.

Not replacing engineers.
Not auto-fixing production blindly.

Just making sure the first thing you see is signal, not noise.

That’s the kind of automation I actually want in my career.
Quiet, Helpful And already working in my account today.