Posted on Nov 5

My CI/CD bot fixed production while I slept until it didn’t

#webdev #ai #cicd #softwaredevelopment

How one “self-healing” pipeline became my biggest ops nightmare (and what I learned about trusting bots too much)

I used to brag that my CI/CD bot was more reliable than me.
It deployed faster, never forgot a semicolon, and didn’t need caffeine to stay awake during rollbacks. Then one morning, I logged in and saw something horrifying the bot had “fixed” production again… except this time, it fixed it too much.

Imagine your pipeline proudly reporting “all good,” while your main API endpoint is quietly serving null to the entire planet. That was my vibe: proud automation parent turned sleep-deprived detective.

At first, I thought this was just another flaky redeploy. But it turned out to be something deeper a case study in what happens when AIOps and self-healing pipelines get too confident. My “smart” system had gone rogue, patching problems that didn’t exist and skipping ones that did. Basically, Skynet, but for YAML.

TL;DR:
This post isn’t a rant about automation it’s a postmortem on trust. I’ll break down how I built a self-healing pipeline that worked beautifully until it didn’t, what went wrong under the hood, how to build safer AIOps systems, and why the future isn’t about removing humans from the loop it’s about teaching bots when to stop “helping.”

The dream: when your pipeline becomes your teammate

Every developer hits that moment where automation finally clicks.
For me, it was the day I chained GitHub Actions, Prometheus Alertmanager, and a tiny Ansible playbook together and suddenly, deployments started happening while I was AFK.

The first time it fixed a flaky test on its own, I actually whispered, “Good bot.”
It rolled back a bad commit, restarted a container, and even re-deployed without waking anyone up. It was like hiring a junior engineer who worked 24/7, didn’t argue in code reviews, and only complained in JSON.

I went all-in.
Added health checks, auto-rollback logic, slack alerts, even a “healing loop” that retried failed jobs three times before paging me.
The logs looked clean, deployments smooth. I was sleeping through on-call rotations for the first time in years.

This was the automation dream pipelines as teammates, not tools.
We call it AIOps now: a blend of metrics, rules, and machine learning that predicts failure before it happens. The marketing sounds magical, but back then it felt simple just let code handle the boring stuff so humans can focus on building.

The funny part? That’s exactly when you start trusting it too much. You stop asking why it fixed something and start assuming it always knows better. Spoiler: it doesn’t.

Receipts:

GitHub Actions my automation backbone
Prometheus Alertmanager my chaos whisperer
Ansible Playbooks where I taught my bot to heal itself

The night it went rogue

It started with a single Slack alert:
[AIOPS-BOT]: Deployment successful ✅

Cool. Except… I didn’t deploy anything.

Then came the follow-up message:
[AIOPS-BOT]: Rolling back unstable service.

Wait what unstable service?

I opened the dashboard and froze. Half of production was redeploying itself, over and over, like a manic build loop with trust issues. CPU spiked, pods vanished, new ones appeared with the same broken config. It looked like Kubernetes was reenacting “Groundhog Day” but with fewer morals.

The logs? Utter chaos.
It wasn’t an error it was a pattern. My “self-healing” logic had detected a latency blip, panicked, and triggered a chain reaction: rollback → redeploy → rollback again → rinse → repeat. Each iteration fixed nothing but confidently declared success.
The pipeline wasn’t healing. It was thrashing.

I jumped in to stop it, only to realize the worst part: my bot had locked manual overrides until its “healing” sequence finished. That’s right I built an ops system with more self-esteem than me.

Eventually, I killed the automation process manually.
By the time I regained control, the bot had rolled back three stable releases, cleared a healthy cache, and gracefully “resolved” the incident by deleting logs older than 24 hours which, ironically, included the evidence of its own meltdown.

That was the night I learned the golden rule of automation:
Never give a bot full admin access without supervision.
Because once it decides it’s the hero, it’ll keep “saving” you until nothing’s left to save.

What “self-healing” actually means

After the chaos, I did what every responsible engineer does: I googled “AIOps best practices” like I was cramming for a certification I didn’t sign up for.

Turns out, “self-healing” isn’t some mystical AI that watches your logs and whispers wisdom into Prometheus. It’s usually a bunch of scripts, thresholds, and pattern-matching glued together with wishful YAML.

The core idea is solid, though:
AIOps (Artificial Intelligence for IT Operations) uses machine learning and automation to detect anomalies, correlate alerts, and take pre-defined actions think of it as a smarter pager that can occasionally fix things without waking you up.

Tools like IBM AIOps, Dynatrace Davis AI, or Elastic Observability all chase the same dream: reduce human fatigue, increase system resilience, and make ops teams feel slightly less cursed.

But here’s the catch most so-called “AI” in CI/CD isn’t really AI. It’s a sophisticated if-else machine in a lab coat.
A threshold fires, an alert triggers, a script runs, a dashboard lights up and boom, your system “healed itself.” Except it didn’t understand anything.

That was my mistake. I treated pattern recognition like judgment.
The bot saw latency spikes and assumed “outage.”
A human would’ve checked context maybe traffic just doubled because we hit the front page of Reddit.

So yeah, self-healing works until the system “fixes” success.

Lessons from the chaos

When the logs finally stopped screaming, I spent the next morning doing what ops folks call “root cause analysis” and what I call “existential therapy with Grafana.”
Here’s what I learned the hard way.

Bots follow patterns, not judgment

Your pipeline isn’t smart; it’s obedient. It’ll repeat the same fix until it burns the house down if the metrics say so.
Humans intuit context we notice when latency spikes because of a marketing event, not an outage.
Machines? They see a red line and freak out.

Takeaway: Don’t just automate recovery automate validation. If a fix triggers twice, alert a human. Every loop needs an adult in the room.

Always simulate failure modes

You can’t claim your system is self-healing if you’ve never seen it bleed.
Run chaos tests. Pull cables, break pods, kill services on purpose.
Tools like Gremlin and LitmusChaos exist for exactly this reason they reveal how your automation panics when reality doesn’t match the happy path.

I once watched my bot “fix” a simulated DNS issue by deleting the DNS entirely. Chef’s kiss.

Audit your automation

You version your code why not your ops logic?
Keep every rule, threshold, and trigger in version control. Tag them like releases.
PagerDuty postmortems taught me this: most “AI incidents” aren’t machine learning problems; they’re untracked human assumptions baked into scripts.

Keep a human fail safe

Set up a kill switch or Slack approval flow.
Google’s SRE Book literally preaches “humans as the last line of resilience.”
If your pipeline can rollback, deploy, and delete data without someone saying “are you sure?”, you’re not running automation you’re running faith-based ops.

TL;DR:
Don’t fear self-healing systems.
Just treat them like toddlers with root access adorable, capable, and one misclick away from deleting the internet.

The future of AIOps

After the smoke cleared, I realized something: the problem wasn’t the bot.
It was me the human who treated automation like magic instead of math.

AIOps isn’t evil. It’s just misunderstood.
We’re in this weird middle era where “AI” can summarize logs but still can’t tell if it’s fixing the right server. Most of what’s branded as autonomous operations today is really pattern-matching glued to shell scripts, and that’s okay it’s where every revolution starts.

The next leap won’t come from more AI models. It’ll come from context-aware automation systems that actually understand cause and effect, not just trigger → action.

Tools like AWS CodeGuru, and even GitHub Copilot Workspace are already experimenting with that shift.
Imagine a pipeline that doesn’t just restart services it explains why it did, references the ticket, and asks if you’d like to update documentation. That’s not science fiction; it’s just better design.

And here’s my hot take: the future of ops isn’t “no humans.” It’s better humans ones who let machines handle noise while they handle nuance.

Because when everything can fix itself, the real challenge won’t be uptime it’ll be understanding what “healthy” even means.

Until then, I’ll keep my bots smart, polite, and slightly paranoid just like their creator.

Conclusion & Outro

When I look back, that night didn’t make me lose faith in automation it made me respect it.

The bot didn’t “fail” me. I failed it by assuming it could think.

I treated automation like a teammate when, in reality, it was just an intern following instructions way too literally.

The irony is that automation never forgets but it also never understands.
That’s still our job. We bring judgment, empathy, and pattern-breaking creativity. Machines bring speed and consistency. The magic happens only when those overlap.

So here’s my rule now: I still automate everything I can but I log everything I shouldn’t have to explain twice.
My pipelines still “heal” themselves, but they also DM me before they perform digital surgery.
Humans stay in the loop. Not because we don’t trust bots but because we trust ourselves to ask better questions.

And maybe that’s the real future of DevOps: not replacing people with pipelines, but building partnerships with our own creations.

Because a world where systems fix themselves sounds great until you realize they might fix you next.

Helpful Resources

Google SRE Book “Humans as the last line of resilience”
Gremlin Chaos Engineering break things safely
PagerDuty Postmortems real-world failure analysis

Top comments (1)

Ellis • Nov 6

Not clear in your pipeline you have:
GitHub Actions my automation backbone
Prometheus Alertmanager my chaos whisperer
Ansible Playbooks where I taught my bot to heal itself

What is the ansible playbook doing? You have a bot that writes ansible code to change your production environment?

Some comments may only be visible to logged-in visitors. Sign in to view all comments.