Cloudflare and AWS prove: the biggest threat to your infra isn't a cyber spy. It's a bad Friday deploy. Here is why ITIL is actually a security control.
Remember that day in November 2025?
- ChatGPT stopped.
- Discord went mute.
- Canva froze.
Offices worldwide literally froze. In WhatsApp groups, the standard theory took over: "This is a massive state-sponsored attack, for sure!"
The reality? Much more boring. And, honestly, much scarier.
No invasion.
No malware.
Just internal error.
It was classic "friendly fire": highly qualified people trying to improve the system ended up bringing the whole house down. This happens far more often than companies admit publicly.
And this is exactly when ITIL (that "alphabet soup" many devs roll their eyes at) stops being bureaucracy and becomes operational survival. Information Security isn't just firewalls and antivirus. It's ensuring that the people holding the house keys don't accidentally set the sofa on fire.
Before blaming the engineers, we need to understand what tool was missing from the table.
What is ITIL and what does it actually solve?
ITIL (Information Technology Infrastructure Library) isn't just a thick manual for certification exams. In practice, it’s the operating system for an IT that doesn't break.
It solves three pains that kill digital operations:
1. The chaos of firefighting
- Without ITIL, every incident is solved differently.
- With ITIL, there is process, pattern, and predictability.
2. The dependency on heroes
- You know that one dev who, if they win the lottery and leave, paralyzes the company?
- ITIL documents, distributes, and institutionalizes that knowledge.
3. The delivery of value
- IT stops "turning server screws" and starts guaranteeing services.
- Email working matters more than the green LED on the rack.
ITIL exists to provide predictability and resilience. Simple as that.
How ITIL reduces operational risk (the CIA triad)
Techies usually turn their noses up at Service Management. They see paperwork. And they idolize cybersecurity, thinking it's like an action movie.
But they play on the same team.
Information Security rests on a tripod called CIA:
- Confidentiality (no one reads what they shouldn't).
- Integrity (no one changes what they shouldn't).
- Availability (the system works when needed).
ITIL is the guardian of availability.
Think with me. When Cloudflare goes down due to a config error, the result for the user is identical to a hacker attack: error screen. The "A" of the tripod broke just the same.
Change Management (controlling what changes in the system) isn't bureaucracy for auditors. It is a vital security control.
What really caused the Cloudflare and AWS outages: Changes without checks
Forget the conspiracy theories. The official reports—the famous Post-Mortems—show the problem was a lack of brakes when speeding up.
1. The Cloudflare case (November 18, 2025)
- The symptom: half the internet vanished.
- What broke: a change in database permissions caused the system to output duplicated entries into the Bot Management feature file. This file grew far beyond its expected size and propagated across Cloudflare's global inspection layer. The oversized file exceeded memory limits in critical processes, triggering a cascading failure across the global protection system.
-
Where the process failed: the update was not validated in a production-equivalent environment, and the rollback mechanism depended on the same component that had already crashed. This made rollback impossible once failure began. A controlled rollout such as a canary release would have reduced the blast radius significantly.
- (Source: Cloudflare Post-Mortem, Nov 2025)
2. The AWS Case (October 2025)
- The symptom: critical services like DynamoDB and Lambda stopped in the US-EAST-1 region.
- What broke: an automated DNS management system suffered a latent defect that produced an empty DNS record for the DynamoDB regional endpoint (dynamodb.us-east-1.amazonaws.com). With no valid endpoint, the resolver fleet failed to route requests, triggering cascading failures across many dependent AWS services.
-
Where the process failed: automation was treated as infallible. There was no effective sanity check or human-validation layer to prevent a script or process from deleting or corrupting critical production infrastructure. The lack of gradual rollout or fallback mechanism amplified the impact.
- (Source: AWS Post-Event Summaries)
3 Governance Controls that improve your daily life
If you don't want to be the next negative headline on TechCrunch, stop treating Governance as the enemy of speed. Do the basics well.
1. The Agile CAB (change advisory board)
The modern CAB isn't a coffee meeting. It's risk-focused. Before running any critical script, the question must be: "If this throws a blue screen, what is the exact command to undo it in 1 minute?". At Cloudflare, the rollback failed because there was no plan for total collapse.
2. CMDB: total visibility
CMDB is your IT inventory. If AWS changes a server and doesn't know which apps depend on it, it's Russian Roulette. You can't protect what you can't see. Having this updated is a security prerequisite.
3. Unified incident management
When the site crashes, the DevOps team rushes to restore the backup, and the Security team rushes to check the logs. This is slow. Modern ITIL proposes the Integrated SIRT. Every incident is treated as a security risk until proven otherwise, focusing on rapid restoration.
Takeaway: DevOps needs Brakes
The incidents of 2025 leave a brutal lesson for any manager, dev, or CEO.
You can have the best cybersecurity system in the world. If your change process is immature, you will take down your own company.
DevOps is the engine that speeds up the car. ITIL and Security are the brakes and the seatbelt. Try driving at 200km/h without them and see what happens.
And in your company?
Does your deploy process have a tested "Rollback" button, or is it all based on faith and prayer?
Comment below: what was the most expensive or bizarre case of "friendly fire" you've ever witnessed? (No names needed, let's protect the innocent 🤫).
If this post opened your eyes, drop a like/unicorn and share it with your infra team.
Top comments (0)