KubeCon + CloudNativeCon EU 2026 · Amsterdam · March 23–26
More than 13,000 engineers gathering around infrastructure might sound excessive until you realize what they're really there for: understanding how the next generation of systems is being built in real time.
All sessions referenced in this article are available through the CNCF KubeCon recordings.
The Why
I almost didn't go.
KubeCon felt overwhelming—too big, too technical, too crowded. But something about the energy of thousands of engineers gathering around the future of infrastructure made the trip worth it.
Not for the keynotes or the networking (though both mattered). I went because infrastructure is changing faster than most organizations can operationalize it, and I wanted to understand where the ecosystem was converging.
What I found was not a week of dramatic announcements or paradigm shifts.
It was something more interesting: operational maturity.
Across sessions, hallway conversations, and product announcements, the same themes kept repeating:
- observability moving deeper into the kernel,
- platform engineering focusing on developer cognition,
- AI workloads becoming operational infrastructure,
- and agentic systems forcing teams to rethink reliability entirely.
What became clear by the end of the week was this:
The cloud-native ecosystem is beginning to build the operational layer for AI agents the same way it once built the operational layer for containers—incrementally, pragmatically, and one infrastructure problem at a time.
01. LLM Inference on Kubernetes: Infrastructure Becomes the Product
The GKE session on optimizing large language models on Kubernetes was the first talk that shifted my perspective.
Not because it introduced radically new ideas, but because the conversation felt deeply operational.
The core challenge was straightforward:
LLMs are not typical workloads.
Inference systems introduce sustained resource pressure across networking, scheduling, memory allocation, and accelerator management in ways many Kubernetes environments were not originally designed for.
The session covered:
- model serving frameworks like vLLM, TGI, Triton, and Ray Serve,
- Kubernetes Dynamic Resource Allocation (DRA),
- GPU orchestration,
- and increasingly sophisticated networking strategies for inference optimization.
One recurring theme was KV cache efficiency and routing.
Not because it is flashy, but because inference optimization increasingly comes down to infrastructure efficiency rather than model novelty.
What stood out most was how normalized these conversations felt.
AI infrastructure discussions at KubeCon no longer sounded experimental. They sounded operational.
The Learning
The challenge with AI workloads is increasingly operational rather than conceptual.
Model access is becoming commoditized.
Reliable orchestration, scheduling, observability, and cost control are becoming the differentiators.
02. Backstage & the Philosophy of Developer Experience
Spotify's talk on Backstage was one of the more interesting non-technical sessions of the week.
A story from the session stayed with me:
Spotify teams had experienced the familiar problem many fast-growing engineering organizations encounter—operational knowledge becoming fragmented across tools, documentation systems, spreadsheets, ownership records, and tribal knowledge.
The example illustrated a broader organizational truth:
engineering complexity often grows faster than internal systems evolve to manage it.
Backstage emerged from Spotify's effort to centralize operational context and developer workflows into a more coherent platform experience.
What matters here is not only the tool itself, but the philosophy behind it.
Developers should not need deep infrastructure expertise simply to deploy software safely and reliably.
Backstage approaches this by treating operational metadata as infrastructure:
- ownership information,
- deployment workflows,
- dependency visibility,
- templates,
- scorecards,
- and documentation become integrated directly into the developer workflow.
What stood out was how operational context became centralized into a single interface.
Backstage was not acting like a dashboard.
It was acting more like an internal platform layer for developers.
The most important insight from the session was organizational rather than technical:
platform engineering succeeds when it reduces cognitive fragmentation.
The Learning
The strongest platform teams optimize for cognitive clarity as aggressively as they optimize for system reliability.
Golden paths scale better than undocumented complexity.
03. Cross-AZ Observability & the Real Cost of Visibility
Miro's session on cross-AZ observability costs highlighted something many teams underestimate:
observability architecture itself can become a significant infrastructure cost center.
When workloads run across availability zones, metrics and telemetry crossing network boundaries generate measurable egress costs.
At scale, observability design decisions become infrastructure decisions.
Miro discussed a relatively straightforward but effective pattern:
zone-aware scraping.
Prometheus scraped local targets, aggregated locally, and minimized unnecessary cross-zone metric transfer.
The session also highlighted VictoriaMetrics, which has gained attention for focusing heavily on efficiency and operational simplicity in metrics storage.
What made the talk compelling was not novelty.
It was practicality.
The operational maturity of cloud-native infrastructure increasingly depends on efficiency optimization at every layer.
What Happened Post-KubeCon
Shortly after KubeCon:
- Splunk announced OpenTelemetry eBPF Instrumentation (OBI) in beta,
- and Grafana continued integrating projects like Beyla into broader OpenTelemetry workflows.
The larger trend is becoming clearer:
observability instrumentation is moving closer to the kernel layer through eBPF, while operational standards increasingly converge around OpenTelemetry.
The Learning
At scale, observability becomes an architectural discipline rather than simply a tooling choice.
Tooling amplifies operational design decisions already embedded into the system.
04. AI Agents & Platform Engineering: Reliability for Non-Deterministic Systems
The panel on AI Agents & Platform Engineering was the session that tied many of the week's themes together.
Panelists:
- Idit Levine (Solo.io)
- Vincent Caldeira (Red Hat)
- Hasith Kalpage (Cisco)
- Sara Qasmi (United Nations)
- Carlos Santana (AWS, moderator)
The central tension discussed throughout the panel was this:
AI agents are probabilistic systems operating inside infrastructure environments historically optimized for deterministic behavior.
Traditional platform engineering assumes:
- reproducibility,
- consistency,
- predictable deployments,
- and stable execution paths.
Agentic systems challenge many of those assumptions.
The conversation repeatedly returned to observability, evaluation, and governance.
Rather than forcing agents into deterministic behavior models, the emerging operational pattern appears to focus on:
- continuous evaluation,
- instrumentation,
- permissions boundaries,
- and measurable reliability.
One of the strongest moments from the panel came from Vincent Caldeira:
"Agentic vulnerability is statistical, not deterministic."
That framing changes the operational question entirely.
Instead of asking:
"Is this system perfectly safe?"
Teams increasingly ask:
"Is this system measurably safer, more observable, and more governable than the existing human process?"
Another concept discussed heavily was the emergence of reusable "Skills" and tool abstractions for agents.
The architecture forming around agentic systems increasingly resembles familiar cloud-native operational patterns:
- modular capabilities,
- registries,
- sandboxed execution,
- observability,
- and governance layers.
What Happened at KubeCon (and After)
Solo.io announced:
agentevals — an open-source framework for evaluating agent behavior using OpenTelemetry.
agentregistrydonated to the CNCF ecosystem — focused on centralized discovery and governance for agents and tools.
These announcements felt notable not because they solved everything, but because they suggested the ecosystem is beginning to standardize operational patterns for agentic infrastructure.
The Learning
The shift from LLMs to agents is not simply about smarter models. It is about infrastructure adapting to probabilistic operational systems.
Observability, evaluation, governance, and orchestration are becoming foundational concerns.
The North Star: Where the Ecosystem Appears to Be Going
By Thursday afternoon, several patterns had become difficult to ignore.
The same operational themes kept surfacing:
- platform engineering,
- eBPF,
- OpenTelemetry,
- AI infrastructure,
- operational efficiency,
- and governance.
Three broader shifts stood out.
1. Platform Engineering ↔ eBPF
Infrastructure conversations are increasingly moving simultaneously:
- upward toward developer experience,
- and downward toward kernel-level visibility and security.
eBPF sits at the center of that transition.
Instrumentation is becoming more deeply integrated into infrastructure itself while becoming increasingly invisible to developers.
2. AI on Kubernetes Is Becoming Operational Infrastructure
AI workloads are rapidly becoming standard platform concerns.
Platform teams are now regularly discussing:
- GPU scheduling,
- inference networking,
- accelerator orchestration,
- model serving reliability,
- and operational cost control.
The tooling ecosystem around Kubernetes AI workloads is maturing quickly.
3. Efficiency Is Becoming a Core Operational Metric
Energy usage, infrastructure efficiency, observability overhead, and GPU utilization are increasingly treated as operational concerns rather than secondary optimizations.
The broader trend is not only about sustainability messaging.
It is also about economic reality.
Efficient infrastructure compounds.
What This Means for Platform Teams
Several practical themes emerged repeatedly throughout the week.
Understand Observability as Infrastructure
Observability is no longer just a monitoring layer.
It increasingly shapes architecture, reliability, debugging, governance, and operational cost.
Prepare for Agentic Workloads
Even organizations not deploying agents today are beginning to think about:
- evaluation,
- permissions,
- governance,
- sandboxing,
- and auditability.
The operational questions are arriving quickly.
Efficiency Is Becoming Strategic
Cloud costs, telemetry overhead, and infrastructure efficiency are becoming platform-level concerns.
Operational efficiency increasingly creates long-term competitive advantages.
Developer Experience Is Infrastructure Work
The strongest platform teams reduce cognitive overhead.
They centralize operational context.
They provide safe defaults.
They simplify complexity without hiding it.
That is infrastructure engineering now.
The Real Value of KubeCon
At some point during a conference like KubeCon, you stop trying to absorb every session.
You start paying attention to the patterns connecting them.
That was the most valuable part of the week for me.
Not only the talks themselves, but the conversations afterward:
- engineers comparing operational tradeoffs,
- platform teams solving similar scaling problems independently,
- and a growing sense that infrastructure is evolving faster than most organizations can fully absorb.
The cloud-native ecosystem in 2026 feels less experimental than it did several years ago.
The tooling is maturing.
The operational patterns are stabilizing.
The architectural layers are becoming clearer.
And the next frontier—agentic systems, kernel-level observability, and infrastructure efficiency—appears to be emerging from the same engineering discipline that previously shaped the container ecosystem itself.
The most valuable reminder from KubeCon is that these are collective problems.
Thousands of engineers are iterating toward similar operational solutions at the same time.
Go watch the sessions.
But more importantly, operationalize one useful idea inside your own systems.
That is where transformation actually happens.
Infrastructure is no longer simply supporting transformation. Increasingly, it is becoming the mechanism through which transformation happens.
Resources
This article draws from sessions and discussions involving Google Cloud, Spotify Engineering, Miro, Solo.io, Red Hat, Cisco, AWS, and other contributors across the cloud-native ecosystem.
By Soumia — LinkedIn · Portfolio
Are you working on something similar? Drop a comment — I'm curious what you're building and what you're seeing in your own work.
Top comments (0)