DEV Community

Cover image for What I Learned at the CNCF Montreal KubeCon NA 2025 Recap
Sonia Rahal
Sonia Rahal

Posted on

What I Learned at the CNCF Montreal KubeCon NA 2025 Recap

On December 10th, the Cloud Native Montreal community hosted a recap of KubeCon NA 2025 in Atlanta. Rather than being a traditional conference, this was a community-driven evening with lightning talks and reflections on where the cloud-native ecosystem is heading.

Instead of focusing on slides or announcements, the event emphasized patterns and lessons emerging across the ecosystem — from AI agents and observability to GitOps and energy-aware infrastructure.

Here are the key takeaways that stood out.


Cloud Native Is Becoming AI-Native

One recurring theme was that AI workloads are now first-class citizens in cloud-native environments.

Traditional observability answers questions like:

  • Is the service up?
  • Is latency within SLOs?

AI systems introduce new operational questions:

  • What prompt triggered this behavior?
  • Which model call was expensive?
  • Why did this agent take a specific action?

Tools such as OpenLLMetry extend OpenTelemetry with instrumentation for LLM and agent workflows, while OpenCost provides visibility into Kubernetes and cloud spend across workloads, teams, and environments.

The takeaway is clear:

You can’t scale AI systems you can’t observe or financially understand.


Observability Is Shifting From Dashboards to Agents

Observability is evolving beyond dashboards and alerts toward agent-assisted operations.

Instead of engineers manually correlating metrics, logs, and recent deployments, emerging tools aim to:

  • Perform root-cause analysis
  • Triage alerts
  • Recommend remediation steps

Projects like k8sgpt, Seraph, and newer agentic SRE tools suggest a future where observability systems don’t just surface data — they actively reason over it.

Several tools highlighted this shift:

  • k8sgpt — AI-native Kubernetes troubleshooting
  • HolmesGPT / Seraph — Automated root cause analysis and alert mitigation

Emerging Agent-Based Platforms:

These agents correlate logs, metrics, deployments, and incidents to assist on-call engineers and reduce alert fatigue.

This doesn’t replace engineers, but it changes the workflow: less time searching for signals, more time making informed decisions.

Image of agentic SRE tools

Abstraction Helps — but Security Must Follow

Another major topic was Cyclops, an open-source platform that simplifies Kubernetes by replacing raw YAML with structured, form-based abstractions.

Cyclops introduces:

  • Modules — logical groupings of all Kubernetes resources an application needs
  • Templates — mappings that translate module inputs into valid Kubernetes manifests

How Cyclops works with Helm:

Helm charts define the Kubernetes resources (Deployments, Services, Ingress, etc.) using templated YAML.

  • Cyclops wraps those Helm charts and exposes their values as validated forms instead of free-text YAML edits.
  • Users fill in forms, and Cyclops renders the underlying Helm templates into valid Kubernetes manifests.

Cyclops also supports AI-driven operations through a Model Context Protocol (MCP) server, allowing agents to manage applications using natural language rather than direct cluster access.

The key lesson here wasn’t blind automation, but caution:

Code generated by AI should be treated as untrusted.

Security risks still apply. As abstraction increases, guardrails, validation, and testing become even more critical.


GitOps Works Best When Designed for Teams

A practical GitOps case study highlighted that repository structure matters as much as tooling.

Key principles discussed:

  • Align configuration structure with team ownership
  • Centralize configuration while keeping environments explicit
  • Keep related files close together (“proximity matters”)
  • Optimize for developer experience, not just correctness

Using ArgoCD, deployments become automated, auditable, and consistent — but only when GitOps is treated as both a technical and organizational design.

Image of Before/After Gitops Repository Structure


Energy Efficiency Is Becoming a Platform Concern

The final talk focused on Kepler, a CNCF project designed to expose energy consumption at the container level.

Kepler provides:

  • Fine-grained container and process power metrics
  • Support for CPUs, GPUs, and heterogeneous hardware
  • Low overhead using eBPF
  • Integration with existing observability stacks

As GPU-heavy and AI workloads grow, energy usage and cooling costs are becoming operational concerns.

The key message:

Sustainability is now part of platform engineering, not just hardware planning.

Image of the 8 concepts of Kepler Project


Final Reflection

This KubeCon recap wasn’t about memorizing tools — it was about understanding direction.

Across talks, a consistent shift emerged:

  • From reactive monitoring to AI-assisted operations
  • From raw YAML to safe, opinionated abstractions
  • From cost surprises to cost-aware platforms
  • From performance-only metrics to energy-aware infrastructure

Community-driven events like this help connect individual technologies into a cohesive mental model of where cloud-native systems are heading next.

Top comments (0)