Mariusz Gębala

Posted on Feb 28 • Originally published at haitmg.pl

89 critical vulnerabilities and nothing is on fire

#security #devops #kubernetes #leadership

Every month Trivy runs, this time no different. Results arrived through the pipeline straight into our channel. Across 54 container images, 89 showed CRITICAL issues. Just another routine scan.

A while after that, the CEO sent me a note. Not wordy - just asked if the issue in the security report needed immediate attention. Wondered whether we were facing something serious.

Harder to answer than you might think. Truth is, likely no - though there's a reason behind it. That sort of reply never fits neatly into a quick message. Explaining takes space. Space rarely given.

Things didn't go well at first. For the opening stretch, my move was tossing over a Trivy JSON file along with a note saying "see attached." Not one person looked. After that, I flipped entirely - long blocks of words unpacking each CVE one by one. Still nothing. One day the CTO cut in while I was presenting, told me to "say it in plain English." What we were doing already was in English.

Realizing what clicks doesn't happen fast. Took time, sure, but now it makes sense.

Why raw scan results don't work for leadership

A single look at 54 container images using Trivy inside a Kubernetes setup often reveals findings similar to these:

Severity	Count
CRITICAL	89
HIGH	612
MEDIUM	1,247
LOW	731

That 89 might seem alarming at first glance - yet nearly all those flaws sit inside systems tucked behind our firewall. Since they're locked down by the VPN, outsiders can't touch them directly. Without checking which parts connect to the web, the figure tells you very little. What matters is access, not just counts.

What management actually wants to know:

Is there a chance any of these might work within how things are set up?
Do any of the impacted items connect to the web?
Could things be deteriorating, or are we only noticing because fresh CVEs appeared?
What have we already patched and what's still open?

CVE-2024-something means nothing to them. What matters is danger.

How I organize these reports today

Finding the right fit took ages. Eventually four parts made sense. Plain setup, nothing flashy.

What we scanned and what shifted

Without scope, people just see big numbers and freak out. So I always open with what we actually scanned:

"We scan all 54 infrastructure container images using Trivy every month. This covers cluster stuff - storage controllers, networking, monitoring, databases. Not business applications. This month we upgraded 5 components and did a Kubernetes engine bump."

This does two things. Sets up that scanning is routine, not a fire drill. And anchors those numbers to infra images specifically. 89 CRITICAL across 54 infra containers hits different than 89 CRITICAL on your customer-facing app.

Why the number went up after we patched

This always gets the same reaction. We upgraded stuff. The number went up. What?

Looks weird on a slide, yeah. But it's usually not a regression. During scanning gaps, 27 fresh CVE entries appeared in open databases. On its following pass, Trivy spots these newly listed issues and marks them. Not that systems declined - just that more flaws in widely used code came to light.

Without a clear explanation, managers will think progress is slipping. That's how their thinking works - fewer issues mean fixes are done. Shatter that idea fast or each update turns into a defense.

Breakdown by actual risk

Here's where things usually go off track. Most folks rank flaws by CVSS alone, then call it done. That 9.8 rating? Looks terrifying - till you check the fine print. Imagine needing to send a special CMS packet to some module that ignores such packets entirely. Sitting on an inside network. Shielded by a VPN layer. Suddenly, not so urgent.

Every group gets clear answers on three points. First, a plain explanation of what it means. Then whether it affects how we work right now. Finally, when changes will arrive.

A few CVEs showed up in the XML parser our load balancer uses. Might sound alarming at first glance. This particular device handles traffic moving between backend systems only. External data never reaches it directly. Any exploitation would demand prior access to our internal network. The fix is ready - just held back until the vendor's upcoming update drops. That rollout fits into the planned March maintenance window, same time as always: first Wednesday, 22:00 to 23:00 CET.

Here's how it looks in table form:

Component Group	CRITICALs	Real Risk	Internet-Exposed
Storage system	32	Low	No
Cluster networking	12	Low	No
Kubernetes engine	19	Low	No
Monitoring stack	9	Low	No
Load balancer	6	Minimal	No
Tracing + auth proxy	5	Low	No
Dashboards	4	Low	No
Log shipping	2	Minimal	No

That "No" column all the way down? That's the whole point. Everything sits behind the VPN. TLS terminates at the ingress controller but the vulnerable components aren't on that path. To exploit any of this you'd need to already be inside the network.

What we did, what we're doing, what we're blocked on

Actions with dates. Just that.

Priority	Action	Timeline
1	Keeping an eye on Go stdlib updates (impacts 25+ images)	Ongoing
2	Storage system moves to updated version	Next maintenance window
3	Update load balancer for XML and SQLite fixes	March
4	Fix tracing setup once upstream patch arrives	After upstream release

The tracing one is stuck because the latest version breaks dashboard rendering. Maintainers know about it, there's an open issue with 40+ comments, but no fix yet. So we wait. That's not us being slow - that's the reality of running on open-source.

And this is what management needs to get: most of our "inaction" is waiting for upstream. We use official container images. When Go stdlib gets a CVE, every single Go-based tool in the cloud-native ecosystem gets hit. We can't fix it before the Go team does.

The framing that helped the most

Most of those 89 CRITICALs weren't 89 different fires. It was a handful of root causes - mainly the Go stdlib thing - that propagated across every Go-based image we run. Not "our infra is broken." More like "the entire cloud-native ecosystem has a known issue and we're tracking the upstream fix like everyone else."

When I frame it like that in the report, the reaction shifts. Not "why is our stuff broken" but "ok, so this is an industry thing." Which is what it actually is.

Things that went wrong before I figured this out

Once I sent a report that was just the raw numbers along with "5 components upgraded this month." No context on the CRITICAL count. My manager forwarded it to the VP. The VP saw 89 CRITICAL and almost delayed a product launch. Over vulnerabilities in internal monitoring tools that aren't even reachable from outside. Took a 30-minute call to undo that.

Another time a manager asked me "when will we have zero vulnerabilities?" I said "never" without any explanation. That went over about as well as you'd expect. What I should have said - and what I say now - is that any infrastructure running open-source will always have known CVEs. Always. Not about reaching zero. It's about whether they're exploitable in our specific environment and whether we have a process to patch them.

The worst is sending numbers without the "so what." People see 89 CRITICAL and fill in the blanks with their imagination. And their imagination usually involves hackers. Give them the numbers AND the context or don't send anything at all.

Why I keep doing this monthly

Writing these reports isn't fun. Not why I got into this field. But they've made my life easier in ways I didn't expect.

When our CEO now sees "storage system upgrade" on the maintenance calendar, he already knows that means getting rid of 32 critical vulnerabilities in our storage layer. No need to justify the downtime window anymore. He just approves it. I send a message and get a thumbs up.

That only works because he's been reading these reports for months and trusts the process. Without that trust, every maintenance window is a negotiation.

Zero CVEs is never going to happen. Can't happen, won't happen. But a leadership team that understands what the numbers mean, trusts that you're handling things, and doesn't block your maintenance windows? That's doable. And it starts with a report that doesn't just dump numbers.

Running Kubernetes in production and need help with vulnerability management? Let's talk.

DEV Community