Soumia

Posted on May 8 • Edited on May 17

What KubeCon Amsterdam 2026 Taught Me About Infrastructure as Transformation

#kubecon #kubernetes #cloudnative #observability

KubeCon + CloudNativeCon EU 2026 · Amsterdam · March 23–26

More than 13,000 engineers gathering around infrastructure might sound excessive until you realize what they're really there for: understanding how the next generation of systems is being built in real time.

All sessions referenced in this article are available through the CNCF KubeCon recordings.

The Why

I almost didn't go.

KubeCon felt overwhelming—too big, too technical, too crowded. But something about the energy of thousands of engineers gathering around the future of infrastructure made the trip worth it.

I went because infrastructure is changing faster than most organizations can operationalize it, and I wanted to understand where the ecosystem was converging — and how to explain that shift to the people who need to act on it.

What I found was not a week of dramatic announcements or paradigm shifts.

It was something more interesting: operational maturity.

Across sessions, hallway conversations, and product announcements, the same themes kept repeating:

observability moving deeper into the kernel,
platform engineering focusing on developer cognition,
AI workloads becoming operational infrastructure,
and agentic systems forcing teams to rethink reliability entirely.

What became clear by the end of the week was this:

The cloud-native ecosystem is beginning to build the operational layer for AI agents the same way it once built the operational layer for containers—incrementally, pragmatically, and one infrastructure problem at a time.

01. LLM Inference on Kubernetes: Infrastructure Becomes the Product

The GKE session on optimizing large language models on Kubernetes was the first talk that shifted my perspective.

Not because it introduced radically new ideas, but because the conversation felt deeply operational.

The core challenge was straightforward:
LLMs are not typical workloads.

Inference systems introduce sustained resource pressure across networking, scheduling, memory allocation, and accelerator management in ways many Kubernetes environments were not originally designed for.

The session covered:

model serving frameworks like vLLM, TGI, Triton, and Ray Serve,
Kubernetes Dynamic Resource Allocation (DRA),
GPU orchestration,
and increasingly sophisticated networking strategies for inference optimization.

One recurring theme was KV cache efficiency and routing.

Not because it is flashy, but because inference optimization increasingly comes down to infrastructure efficiency rather than model novelty.

What stood out most was how normalized these conversations felt.

AI infrastructure discussions at KubeCon no longer sounded experimental. They sounded operational.

The Learning

The challenge with AI workloads is increasingly operational rather than conceptual.

Model access is becoming commoditized.
Reliable orchestration, scheduling, observability, and cost control are becoming the differentiators.

02. Backstage & the Philosophy of Developer Experience

Spotify's talk on Backstage was one of the more interesting non-technical sessions of the week.

A story from the session stayed with me:
Spotify teams had experienced the familiar problem many fast-growing engineering organizations encounter—operational knowledge becoming fragmented across tools, documentation systems, spreadsheets, ownership records, and tribal knowledge.

The example illustrated a broader organizational truth:
engineering complexity often grows faster than internal systems evolve to manage it.

Backstage emerged from Spotify's effort to centralize operational context and developer workflows into a more coherent platform experience.

What matters here is not only the tool itself, but the philosophy behind it.

Developers should not need deep infrastructure expertise simply to deploy software safely and reliably.

Backstage approaches this by treating operational metadata as infrastructure:

ownership information,
deployment workflows,
dependency visibility,
templates,
scorecards,
and documentation become integrated directly into the developer workflow.

What stood out was how operational context became centralized into a single interface.

Backstage was not acting like a dashboard.
It was acting more like an internal platform layer for developers.

The most important insight from the session was organizational rather than technical:
platform engineering succeeds when it reduces cognitive fragmentation.

The Learning

The strongest platform teams optimize for cognitive clarity as aggressively as they optimize for system reliability.

Golden paths scale better than undocumented complexity.

03. Cross-AZ Observability & the Real Cost of Visibility

Miro's session on cross-AZ observability costs highlighted something many teams underestimate:
observability architecture itself can become a significant infrastructure cost center.

When workloads run across availability zones, metrics and telemetry crossing network boundaries generate measurable egress costs.

At scale, observability design decisions become infrastructure decisions.

Miro discussed a relatively straightforward but effective pattern:
zone-aware scraping.

Prometheus scraped local targets, aggregated locally, and minimized unnecessary cross-zone metric transfer.

The session also highlighted VictoriaMetrics, which has gained attention for focusing heavily on efficiency and operational simplicity in metrics storage.

What made the talk compelling was not novelty.
It was practicality.

The operational maturity of cloud-native infrastructure increasingly depends on efficiency optimization at every layer.

What Happened Post-KubeCon

Shortly after KubeCon:

Splunk announced OpenTelemetry eBPF Instrumentation (OBI) in beta,
and Grafana continued integrating projects like Beyla into broader OpenTelemetry workflows.

The larger trend is becoming clearer:
observability instrumentation is moving closer to the kernel layer through eBPF, while operational standards increasingly converge around OpenTelemetry.

The Learning

At scale, observability becomes an architectural discipline rather than simply a tooling choice.

Tooling amplifies operational design decisions already embedded into the system.

04. AI Agents & Platform Engineering: Reliability for Non-Deterministic Systems

The panel on AI Agents & Platform Engineering was the session that tied many of the week's themes together.

Panelists:

Idit Levine (Solo.io)
Vincent Caldeira (Red Hat)
Hasith Kalpage (Cisco)
Sara Qasmi (United Nations)
Carlos Santana (AWS, moderator)

The central tension discussed throughout the panel was this:

AI agents are probabilistic systems operating inside infrastructure environments historically optimized for deterministic behavior.

Traditional platform engineering assumes:

reproducibility,
consistency,
predictable deployments,
and stable execution paths.

Agentic systems challenge many of those assumptions.

The conversation repeatedly returned to observability, evaluation, and governance.

Rather than forcing agents into deterministic behavior models, the emerging operational pattern appears to focus on:

continuous evaluation,
instrumentation,
permissions boundaries,
and measurable reliability.

One of the strongest moments from the panel came from Vincent Caldeira:

"Agentic vulnerability is statistical, not deterministic."

That framing changes the operational question entirely.

Instead of asking:

"Is this system perfectly safe?"

Teams increasingly ask:

"Is this system measurably safer, more observable, and more governable than the existing human process?"

Another concept discussed heavily was the emergence of reusable "Skills" and tool abstractions for agents.

The architecture forming around agentic systems increasingly resembles familiar cloud-native operational patterns:

modular capabilities,
registries,
sandboxed execution,
observability,
and governance layers.

What Happened at KubeCon (and After)

Solo.io announced:

agentevals — an open-source framework for evaluating agent behavior using OpenTelemetry.
agentregistry donated to the CNCF ecosystem — focused on centralized discovery and governance for agents and tools.

These announcements felt notable not because they solved everything, but because they suggested the ecosystem is beginning to standardize operational patterns for agentic infrastructure.

The Learning

The shift from LLMs to agents is not simply about smarter models. It is about infrastructure adapting to probabilistic operational systems.

Observability, evaluation, governance, and orchestration are becoming foundational concerns.

05. Uber & The Industrialization of ML: Proving the Abstraction

During a deeply operational look at scaling ML, Uber highlighted how their foundational compute platforms and Michelangelo system have become the backbone for GenAI and deep learning development.

The numbers they shared to illustrate this were staggering:

1 million+ diverse workloads deployed onto 200 Kubernetes clusters across two regions,
20,000 machine learning models trained per month,
5,300 models actively in production,
and over 30 million peak predictions per second across roughly 1,000 serving nodes.

What made Uber's presence at the conference so critical wasn't just the sheer scale, but their clear validation of Kubernetes as a programmable control plane capable of handling distributed AI infrastructure.

AI workloads are notoriously stateful, hardware-constrained, and latency-sensitive. For a long time, there was healthy skepticism about whether cloud-native abstractions could endure GPU-heavy inference at enterprise scale without collapsing. Uber proved that they can.

The takeaway isn't that every enterprise will—or should—operate exactly like Uber. Rather, it is that the production blueprint for operationalizing AI already exists.

The Learning

The abstraction holds under pressure.

Kubernetes is successfully industrializing AI, shifting the enterprise focus away from raw model creation and toward lifecycle management, efficient serving, and reliable execution at scale.

Deep Dive: Want to know exactly how they went from fragmented Python scripts to 30 million predictions a second? Read the full architectural breakdown: The Industrialization of ML: A Deep Dive into Uber’s AI Platform Architecture.

06. The Missing Link: AI Provenance & The Cyber Resilience Act

However, leaving KubeCon thinking only about compute orchestration misses the week's most critical subtext: the standardization of the AI software supply chain.

With the European Cyber Resilience Act (CRA) deadlines looming in September 2026, the attack surface is officially shifting from traditional code vulnerabilities to poisoned weights and compromised training pipelines. Sessions like Airbus’s "Proving trust" and the debut of the CNCF's Agentics Day made one thing explicitly clear: smoothly orchestrating 10,000 agents is rapidly becoming a solved infrastructure problem. Governing and cryptographically verifying the cognitive provenance of those agents before they execute is the actual frontier.

The quiet consensus in Amsterdam was this: if your platform can deploy an army of agents but cannot cryptographically verify their permissions via aiBOMs and signed models, you haven't built an operational platform—you've just automated a massive liability.

Dig Deeper: How exactly do we secure probabilistic systems? For a technical deep dive into how SLSA standards, Sigstore, and Kubernetes admission controllers are being adapted to solve this, read my follow-up piece: Securing the Agentic Supply Chain: Why Provenance is the New Perimeter.

The North Star: Where the Ecosystem Appears to Be Going

By Thursday afternoon, several patterns had become difficult to ignore.

The same operational themes kept surfacing:

platform engineering,
eBPF,
OpenTelemetry,
AI infrastructure,
operational efficiency,
and governance.

Three broader shifts stood out.

01. Platform Engineering ↔ eBPF

Infrastructure conversations are increasingly moving simultaneously:

upward toward developer experience,
and downward toward kernel-level visibility and security.

eBPF sits at the center of that transition.

Instrumentation is becoming more deeply integrated into infrastructure itself while becoming increasingly invisible to developers.

02. AI on Kubernetes Is Becoming Operational Infrastructure

AI workloads are rapidly becoming standard platform concerns.

Platform teams are now regularly discussing:

GPU scheduling,
inference networking,
accelerator orchestration,
model serving reliability,
and operational cost control.

The tooling ecosystem around Kubernetes AI workloads is maturing quickly.

03. Efficiency Is Becoming a Core Operational Metric

Energy usage, infrastructure efficiency, observability overhead, and GPU utilization are increasingly treated as operational concerns rather than secondary optimizations.

The broader trend is not only about sustainability messaging.
It is also about economic reality.

Efficient infrastructure compounds.

Infrastructure is no longer simply supporting transformation. Increasingly, it is becoming the mechanism through which transformation happens.

Resources

This article draws from sessions and discussions involving Google Cloud, Spotify Engineering, Miro, Solo.io, Red Hat, Netflix and other contributors across the cloud-native ecosystem.

By Soumia, a developer advocate focused on making complex infrastructure legible — through writing, speaking, and helping technical and non-technical audiences find common ground. I work at the intersection of cloud-native systems, AI, and editorial craft. — LinkedIn · Portfolio

DEV Community

What KubeCon Amsterdam 2026 Taught Me About Infrastructure as Transformation

The Why

01. LLM Inference on Kubernetes: Infrastructure Becomes the Product

The Learning

02. Backstage & the Philosophy of Developer Experience

The Learning

03. Cross-AZ Observability & the Real Cost of Visibility

What Happened Post-KubeCon

The Learning

04. AI Agents & Platform Engineering: Reliability for Non-Deterministic Systems

What Happened at KubeCon (and After)

The Learning

05. Uber & The Industrialization of ML: Proving the Abstraction

The Learning

06. The Missing Link: AI Provenance & The Cyber Resilience Act

The North Star: Where the Ecosystem Appears to Be Going

01. Platform Engineering ↔ eBPF

02. AI on Kubernetes Is Becoming Operational Infrastructure

03. Efficiency Is Becoming a Core Operational Metric

Resources

Top comments (0)