Abhijeet Yadav

Posted on Jan 7 • Edited on Jan 13

AI-Augmented Cloud Operations

#aws #serverless #aiops #automation

I've spent the last year buried in AWS stuff. You know, chasing CloudWatch alerts, digging through logs, hunting weird cost spikes, and sorting random errors that pop up right when you want to call it a day.

Cloud-ops isn't that smooth Afterall. Actually, it's getting way louder with alarms going off nonstop.

Think about it. Monday rolls around and RDS CPU spikes so high for no reason. Tuesday hits with a Lambda retrying over and over in some endless loop. Come Wednesday, a cost alert keeps beeping because that Staging(testing) EC2 server never got shut down or just like me when I forgot to stop the policy of AMI & Snapshots. (It actually burnt my 262$ when I ignored the Alerts.)

If you've messed with AWS a bit, you totally get the routine.

Then it hit me hard. Most of our work, close to 70 percent, boils down to the same grind. We analyze logs. Recap the chaos. Pinpoint what failed. Guess the real culprit.
Wash, rinse, repeat every week.

Things shifted when I troubleshot a client's Lambda glitch using CloudWatch and Bedrock. I stumbled into a setup where the AI ripped through a huge log file. It summed everything up and zeroed in on the problem within seconds. Way quicker than me pounding coffee over a screen.

That's the aha moment. AI won't take over jobs. Instead, it acts like that reliable buddy who stays up all night. It catches patterns we overlook and lets us tackle bigger challenges.

I kept experimenting after that. Linked Bedrock with CloudWatch, Lambda, Code-Commit for code trails, Cost Explorer data, plus Slack alerts. Suddenly the whole thing locked into place.

AWS lays it all out ready to go. Logs, metrics, events, billing details, repos, old incident records, models that actually reason through stuff.

Observing It smartly and we can build an AI helper that runs around the clock like a solid junior dev. No futuristic talk. This runs right now in real accounts.

How I assembled it step by step, using actual services and headaches I fixed, with code anyone can fire up.
But not like any dry tutorials, this is just my numerous Attempts and trying different ways until I get the solution and hit the spot.

I didn’t plan any of this properly. It sort of came together while handling a random RDS CPU alert. The alarm fired (again), and instead of going through the usual drill, I thought I’d quickly put something in place to save myself the back-and-forth.

So the first thing I did was open the CloudWatch alarm to check the settings, mainly to confirm the threshold and the evaluation period. Since this is part of the story anyway.

After confirming the alarm settings, I jumped into the logs.
I wanted Lambda to automatically pick up a small window of logs, so I opened the RDS log group to see what format and timestamps I’d be dealing with.

Once I saw the structure, I created a simple Lambda function. Nothing fancy Python, default runtime, one file. The idea was:

Setting Up Slack and Wiring the Notifications

Once I had the basic flow in my head, the next thing I needed was a clean way to get alerts somewhere I actually check. Email works sometimes, but Slack is where most of us live anyway. So I created a small Slack app just for this setup.

The moment you open the Slack developer page, it looks something like this.

All I had to do was enable incoming webhooks for the workspace.

Then Slack asked me where exactly I wanted the messages to appear. I picked a fresh channel I made only for this experiment, mostly so I could keep the noise separate from everything else.

After approving it, Slack generated the webhook URL. This is the only thing Lambda needs to trigger messages into that channel.

I didn’t want to hardcode this webhook into Lambda though. Storing secrets in code is a recipe for embarrassment later. So I opened Secrets Manager and created a new secret for the webhook.

With that done, the notification path was ready.
Slack would receive the message.
Secrets Manager would hold the sensitive piece.
And Lambda would glue the two together.

Alarm fires → Lambda gets event → fetch 10 mins logs → send to Bedrock.

Wiring Slack into the Flow

Once the idea was clear, I needed a place where all this noise could land. Slack was the obvious choice.

I created a small Slack app just for this project. Nothing fancy, just a name, a workspace, and a clean slate.

Then I enabled incoming webhooks so external services could post messages into a channel.

After that, Slack asked which channel should receive the alerts. I created a fresh channel named #ai-ops and approved the app.

Once approved, Slack generated the webhook URL.

That tiny URL is the bridge between AWS and my phone lighting up at 2 AM.

Hiding the Webhook Where It Belongs

Hardcoding the webhook into Lambda felt wrong, so I pushed it into Secrets Manager.

Setting Up Secrets and Notifications

Once the basic pieces started forming, I observed that I needed somewhere to store the Slack webhook properly. I could not do Hardcoding into Lambda, it felt messy and also slightly dangerous, so I moved it into Secrets Manager. It took just a minute but instantly made the whole thing feel cleaner. Almost like, ok fine, now this setup looks more aligned.

After saving, the secret was sitting quietly in AWS, invisible to anyone who doesn’t have permission.

This one step instantly made the setup feel less hacky and more production-ish.

The Lambda That Holds Everything Together

Now came the dispatcher.

I created a new Lambda function using Python. No layers, no frameworks, just a single file.

Then I added environment variables so Lambda would know:

which bucket to write to

where the Slack secret lives

which region it is running in

After saving, the configuration finally looked complete.

At this point, the plumbing was done.

Pressing Run and Watching It Come Alive

I triggered the Lambda manually using a small test payload:

Within seconds, Slack pinged.

Not a test message, A real alert from something I just built.

Capturing the Full Story in S3

Every execution drops a JSON file into S3. That file contains the logs, timestamps, event data and whatever analysis is produced.

And when I opened the results folder, the structure was clean and traceable.

How Everything Is Actually Connected

This is the entire setup I have running today.
No Bedrock yet. No heavy ML. Just automation that removes one painful step from my daily cloud routine.

What This Actually Changed

I didn’t build this system because I wanted to “do AI in the cloud”.
I built it because I was tired of reacting blindly.

Before this, every alert meant the same ritual.
Open CloudWatch.
Scroll through logs.
Guess which 3 lines out of 50,000 actually mattered.
Hope I didn’t miss the real problem.

Now the flow is different.

An alarm fires.
A message hits Slack.
A full execution trace lands in S3.
By the time I even open the console, the context is already waiting for me.

That changes the entire mental state.
You don’t start from confusion anymore.
You start from evidence.

What surprised me most is how small the system is.