DEV Community: Gulcan Topcu

What `os.cpu_count()` Gets Wrong in a CPU-Limited Kubernetes Pod

Gulcan Topcu — Wed, 24 Jun 2026 20:05:43 +0000

This article is originally published on my personal blog.

I gave a pod 500m CPU and then went inside and asked Python how many CPUs it could see. The answer was 20, and that seemed worth understanding.

TL;DR: When a Gunicorn config sizes workers from os.cpu_count() inside a 500m-limited pod, it might see the node's full CPU count instead of what the cgroup actually allows, and most of what those extra workers do is wait to be scheduled.

I start by checking what Kubernetes reports for the node:

kubectl get node minikube -o jsonpath='{.status.capacity.cpu}{"\n"}{.status.allocatable.cpu}{"\n"}'

20
20

Both values are 20, which means Kubernetes scheduled the pod onto a node that advertises twenty CPUs as capacity and allocatable CPU. The important trick is that 500m does not remove CPUs from the process view; it limits how much CPU time the cgroup is allowed to spend while those CPUs remain visible.

And the YAML is where that promise gets made.

What the YAML Actually Promises

The pod lives in its own namespace, python-cpu-quota-demo, and it's set up with matching request and limit: both 500m.

kubectl get pod cpu-probe -n python-cpu-quota-demo -o yaml

resources:
  limits:
    cpu: 500m
    memory: 128Mi
  requests:
    cpu: 500m
    memory: 128Mi
qosClass: Guaranteed

These two fields sit next to each other in the spec, but they do very different jobs. The CPU request is for the scheduler before the pod exists, because Kubernetes needs to know whether the node has room. The CPU limit is for the kernel after the pod is running, because the cgroup needs to know how much CPU time this workload may spend per period.

Memory is crueler when you cross the line.

That is the gap I want to inspect from inside the pod.

What Python Sees vs What the Kernel Enforces

I want three values next to each other: what Python reports, what the cgroup says, and what Linux affinity allows for the current process.

kubectl logs cpu-probe -n python-cpu-quota-demo

python 3.13.14
os.cpu_count 20
os.process_cpu_count 20
sched_getaffinity 20
cpu.max 50000 100000
cpu.cfs_quota_us missing
cpu.cfs_period_us missing
cpu.stat usage_usec 106789
user_usec 81012
system_usec 25776
nice_usec 0
nr_periods 2
nr_throttled 2
throttled_usec 8150
nr_bursts 0
burst_usec 0
cpus_allowed_list 0-19

The CPU-count answers agree with each other. Python says 20, and Linux affinity says the process is allowed to run on CPUs 0-19, so that also comes out as 20. Those answers are not wrong, because the process really can be scheduled on any of those logical CPUs.

The cgroup is looking at a different limit:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- cat /sys/fs/cgroup/cpu.max

50000 100000

This pod is using cgroup v2, which is why the value lives in cpu.max:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- stat -fc %T /sys/fs/cgroup

cgroup2fs

On cgroup v2, cpu.max has two fields: quota and period. Here the cgroup can spend 50,000 microseconds of CPU time in each 100,000 microsecond period, which works out to half a CPU:

50000 / 100000 = 0.5 CPU

That is the 500m limit, and now the mismatch is visible: Python sees twenty CPUs, affinity allows twenty CPUs, but the kernel quota allows half a CPU worth of time.

All three answers are technically true because they are answering different questions. Python's os.cpu_count() is answering "how many logical CPUs are in the system?", os.process_cpu_count() and affinity are answering "which CPUs can this thread run on?", and the cgroup is answering "how much CPU time can this group spend?" Worker sizing for a CPU-bound sync service should start from the third question, but a lot of worker formulas read the first answer.

So what does Gunicorn do with twenty?

What Gunicorn Does With That Number

By default, if WEB_CONCURRENCY is not set, this installed version of Gunicorn starts with one worker:

kubectl exec deployment/gunicorn-cpu-demo -n python-cpu-quota-demo -- \
  python -c "import os; os.environ.pop('WEB_CONCURRENCY', None); \
  from gunicorn.config import Config; config = Config(); \
  print(config.settings['workers'].default)"

1

The problem starts when an application config overrides that default and calculates worker count from the CPU count Python reports. A common Gunicorn config pattern looks like this:

import multiprocessing

bind = "127.0.0.1:8000"
workers = multiprocessing.cpu_count() * 2 + 1

Inside this pod, multiprocessing.cpu_count() returns 20, which makes that formula produce 41 workers. The demo startup log prints both the what-if calculation and the value actually used:

kubectl logs deployment/gunicorn-cpu-demo -n python-cpu-quota-demo --limit-bytes=5000

os.cpu_count=20
os.process_cpu_count=20
cpu.max=50000 100000
quota_cpus=0.50
gunicorn_formula_from_quota=2
gunicorn_formula_from_os_cpu_count=41
WEB_CONCURRENCY=1

[2026-06-22 15:17:19 +0000] [1] [INFO] Starting gunicorn 23.0.0
[2026-06-22 15:17:19 +0000] [1] [INFO] Listening at: http://0.0.0.0:8000 (1)
[2026-06-22 15:17:19 +0000] [1] [INFO] Using worker: sync
[2026-06-22 15:17:19 +0000] [12] [INFO] Booting worker with pid: 12
[22/Jun/2026:15:17:19 +0000] "GET /healthz HTTP/1.1" 200 3 "-" "kube-probe/1.34"

gunicorn_formula_from_os_cpu_count=41 is the dangerous what-if number, while WEB_CONCURRENCY=1 is what this run actually uses. The startup lines confirm that Gunicorn boots one worker, which gives us a small baseline before adding more workers to the same half-CPU budget.

To compare those worker counts, I need an endpoint that spends CPU in a boring and repeatable way.

The Endpoint and the Load Setup

The service exposes that endpoint as /burn. Each request runs a fixed Python loop and returns the worker pid with elapsed time:

import os
import time
from urllib.parse import parse_qs

DEFAULT_LOOPS = int(os.getenv("BURN_LOOPS", "1500000"))

def burn(iterations):
    total = 0
    for item in range(iterations):
        total = (total + (item * item)) % 1000003
    return total

def application(environ, start_response):
    path = environ.get("PATH_INFO", "/")

    if path == "/healthz":
        body = b"ok\n"
        start_response("200 OK", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
        return [body]

    if path != "/burn":
        body = b"use /burn\n"
        start_response("404 Not Found", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
        return [body]

    query = parse_qs(environ.get("QUERY_STRING", ""))
    iterations = int(query.get("n", [str(DEFAULT_LOOPS)])[0])

    started = time.perf_counter()
    result = burn(iterations)
    duration_ms = (time.perf_counter() - started) * 1000

    body = (
        f"pid={os.getpid()} loops={iterations} "
        f"result={result} duration_ms={duration_ms:.2f}\n"
    ).encode()
    start_response("200 OK", [("Content-Type", "text/plain"), ("Content-Length", str(len(body)))])
    return [body]

Each benchmark used the same setup. I port-forwarded the service to 127.0.0.1:18080, ran ab for 20 seconds against /burn, and captured /sys/fs/cgroup/cpu.stat from the app container before and after the run. That gave me both sides of the story: what the client saw and what the kernel counted.

With the workload fixed, the next question is what changed when I changed only the process layout inside the same tiny CPU budget.

So Which Worker Count Actually Won?

The pod spec, endpoint, CPU limit, and test duration stayed the same, while WEB_CONCURRENCY changed between runs. I also changed the client pressure: the 1-worker run used -c 5, and the 14-worker run used -c 20. That means this is a stress comparison, not a clean single-variable benchmark, but it still answers the question I cared about: what happens when too many workers fight over the same 0.5 CPU?

Workers	Why tested	Completed	ab length mismatches	Req/s	P50 latency	P95 latency
1	Explicit conservative value	101	0	5.03	1,002 ms	1,060 ms
14	Explicit high worker overcommit	46	45	2.18	6,390 ms	11,402 ms

One worker completed 101 responses with a 1,002ms median, while fourteen workers completed 46 responses and pushed the median past 6 seconds. The extra workers did not increase the amount of CPU available to the pod, so they could not buy real throughput, and instead gave the scheduler more runnable processes to pause and resume inside the same 0.5 CPU budget.

The ab length mismatches are also worth reading carefully.

The endpoint returns values like pid and duration_ms, so the response body is not a fixed length across requests. That is why ab reports length mismatches as failed requests in the 14-worker run. I still kept the column in the table so the raw client output is not hidden, but the more important numbers are completed requests, request rate, latency, and the kernel counters below.

With one worker, the full ab output:

ab -t 20 -c 5 -s 60 -q http://127.0.0.1:18080/burn

Concurrency Level:      5
Time taken for tests:   20.084 seconds
Complete requests:      101
Failed requests:        0
Requests per second:    5.03 [#/sec] (mean)
Time per request:       994.273 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   1002
  95%   1060
 100%   1299 (longest request)

With 14 workers:

ab -t 20 -c 20 -s 60 -q http://127.0.0.1:18080/burn

Concurrency Level:      20
Time taken for tests:   21.133 seconds
Complete requests:      46
Failed requests:        45
   (Connect: 0, Receive: 0, Length: 45, Exceptions: 0)
Requests per second:    2.18 [#/sec] (mean)
Time per request:       9188.476 [ms] (mean)

Percentage of the requests served within a certain time (ms)
  50%   6390
  95%  11402
 100%  12312 (longest request)

The client numbers tell us the service got slower, but cpu.stat tells us what the kernel was doing while the clients waited. Before the 14-worker run, the cgroup had 24 throttled periods and just under 1 second of throttled time accumulated since the pod started. After the load test, the same counters looked like this:

nr_periods 328
nr_throttled 272
throttled_usec 285826526

That is 248 newly throttled periods and about 285 seconds of newly accumulated throttled time during roughly 20 seconds of wall-clock load. The number can be much larger than wall time because cpu.stat accounts throttling across the runnable tasks in the cgroup. With 14 workers competing for the same 0.5 CPU quota, blocked time piles up in parallel.

The 1-worker run is the same workload with fewer processes trying to spend the quota. In that run, the cgroup had 16 throttled periods and about 0.7 seconds of throttled time before the test. After the load test:

nr_periods 830
nr_throttled 220
throttled_usec 10877871

That is 204 newly throttled periods and about 10.2 seconds of newly accumulated throttled time across the run. The single worker still hit the CPU limit, because the endpoint is CPU-bound and the quota is small, but it spent the quota on useful work instead of spreading it across a crowd of workers.

At this point the kernel counters have already told the story, but dashboards are where people usually look first. So I wanted to know whether Prometheus would show the same failure mode, or whether this would stay hidden unless you went into cpu.stat.

The Prometheus View

This cluster runs kube-prometheus-stack and scrapes cAdvisor Prometheus metrics from the kubelet. Before writing a query, I first checked which container CPU metrics were actually available:

curl -sG 'http://localhost:9090/api/v1/label/__name__/values' \
  | python3 -c "
import sys, json
d = json.load(sys.stdin)
cadvisor = [m for m in d['data'] if 'container_cpu' in m]
print('\n'.join(sorted(cadvisor)))
"

container_cpu_cfs_periods_total
container_cpu_cfs_throttled_periods_total
container_cpu_usage_seconds_total

Prometheus gives me throttled periods here, while the raw throttled seconds still have to come from cpu.stat. container_cpu_cfs_throttled_periods_total is present, but container_cpu_cfs_throttled_seconds_total is not exposed by the kubelet cAdvisor on this cluster.

One other detail matters for the query. On this Minikube setup, cAdvisor emits these metrics at pod scope without a container label, which means queries filtering on container="app" return empty results. The actual label set looks like this:

{
  "__name__": "container_cpu_cfs_throttled_periods_total",
  "namespace": "python-cpu-quota-demo",
  "pod": "gunicorn-cpu-demo-694589fb97-45cq2",
  "node": "minikube"
}

There is no container key there, so the throttle ratio query filters by namespace and pod instead:

rate(container_cpu_cfs_throttled_periods_total{
  namespace="python-cpu-quota-demo",
  pod=~"gunicorn-cpu-demo-.*"
}[1m])
/
rate(container_cpu_cfs_periods_total{
  namespace="python-cpu-quota-demo",
  pod=~"gunicorn-cpu-demo-.*"
}[1m])

During the 14-worker load test, that query showed about 83% of measured periods hitting throttling. Because this is a 1-minute rate() query, the exact value moves as the load window ages out of Prometheus' lookback range:

throttle_ratio: 83.4 % | pod: gunicorn-cpu-demo-694589fb97-45cq2

Both runs throttle because the workload is CPU-bound and the quota is tight, but the cost of that throttling is different. With 14 workers, the run completed 46 requests while the cgroup spent a large share of measured periods throttled. With 1 worker, the cgroup still throttled, but 101 requests completed cleanly because the single worker spent its quota on the loop instead of sharing it across extra processes.

Now the fix is less mysterious: choose worker count from the CPU time the cgroup can spend, not from the number of CPUs Python can see.

Reading the Quota Before Sizing Workers

Once you put the pieces next to each other, the chain from YAML to latency is short.

The safer starting point is to read the quota before sizing workers. A quota-aware helper can read cgroup v2 first and fall back to cgroup v1 CFS bandwidth files:

import math
import os
from pathlib import Path

def cgroup_cpu_quota():
    cpu_max = Path("/sys/fs/cgroup/cpu.max")
    if cpu_max.exists():
        quota, period = cpu_max.read_text().strip().split()
        if quota != "max":
            return int(quota) / int(period)

    for cgroup_cpu_dir in (
        Path("/sys/fs/cgroup/cpu"),
        Path("/sys/fs/cgroup/cpu,cpuacct"),
    ):
        quota_file = cgroup_cpu_dir / "cpu.cfs_quota_us"
        period_file = cgroup_cpu_dir / "cpu.cfs_period_us"
        if not quota_file.exists() or not period_file.exists():
            continue

        quota = int(quota_file.read_text().strip())
        period = int(period_file.read_text().strip())
        if quota > 0:
            return quota / period

    return len(os.sched_getaffinity(0))

quota_cpus = cgroup_cpu_quota()
cpu_bound_workers = max(1, math.floor(quota_cpus))
workers = int(os.getenv("WEB_CONCURRENCY", cpu_bound_workers))

This helper does two separate things on purpose. It reads the kernel-enforced quota first, then still leaves WEB_CONCURRENCY as an override, because the quota is the right starting point but the workload decides the final number.

Key Takeaways

Inside a CPU-limited Kubernetes pod, Python can report the node's full CPU count while the kernel enforces a much smaller CPU budget through the cgroup. In this lab, os.cpu_count() returned 20, but cpu.max was 50000 100000, which means the pod was limited to 0.5 CPU.
A Gunicorn worker formula based on os.cpu_count() can produce a worker count that has nothing to do with the pod's actual CPU quota. In the demo pod, the common workers = multiprocessing.cpu_count() * 2 + 1 pattern would calculate 41 workers from a container that only had half a CPU worth of time.
More workers did not mean more completed work for this CPU-bound sync endpoint. Under the same 500m limit, one worker completed 101 requests with a 1,002ms median latency, while fourteen workers completed 46 requests with a 6,390ms median because they were all sharing the same half-CPU quota.

Until next time!

References

Kubelet Metrics: How cAdvisor and CRI Collect Kubernetes Stats

Gulcan Topcu — Thu, 28 May 2026 11:23:15 +0000

This article was originally published on LearnKube

TL;DR: This article dissects the Kubernetes metrics pipeline through kubelet, cAdvisor, and CRI to show where your metrics actually come from and what breaks when the defaults change.

This article breaks down how Kubernetes collects container, pod, and node metrics, starting with cAdvisor and the Linux kernel, then shifting to a CRI-native model powered by gRPC.

You’ll see how kubelet exposes this data, what happens when you flip PodAndContainerStatsFromCRI, why container metrics on /metrics/cadvisor can be sourced from CRI instead of cAdvisor, and how to trace each metric back to its origin.

It also explains how kubelet talks to the CRI over gRPC, and why understanding this matters if you rely on Prometheus, Grafana, or any observability stack.

Table of contents
How Kubernetes Monitoring Layers Stack Up
Where Metrics Originate
cgroup v1 with cgroupfs: The Legacy Baseline
At the crux of how cgroup hierarchy is shaped
How Kubernetes Creates and Manages the Cgroup Hierarchy
Kubernetes QoS Classes and cgroup Placement
Auto-Detecting cgroup Drivers via KubeletCgroupDriverFromCRI
cAdvisor: Embedded Resource Monitoring in Kubelet
Kubelet’s Metrics Endpoints
From cAdvisor to CRI: How Kubelet Collects Metrics Today
Validating CRI-Based Metrics Collection in Kubelet
Summary
References

How Kubernetes Monitoring Layers Stack Up

Kubernetes metrics are the lifeblood of observability in your clusters.

While tools like Prometheus and Grafana often dominate the monitoring conversation, it's worth understanding the native mechanisms that Kubernetes uses to collect, expose, and leverage metrics before they ever reach those external systems.

Kubernetes monitoring works as a multi-layered system which provides insights that span from bare metal to application workloads.

Each layer builds upon the previous one to create a comprehensive picture of your cluster's health.

At the foundation sit node-level metrics.

These reveal the utilization of physical and virtual resources like CPU, memory, and disk I/O.

The Prometheus Node Exporter is commonly used to collect these fundamental metrics, but they originate from the operating system itself.

One layer up are Kubernetes component metrics.

These expose the health and performance of core services such as kubelet, kube-proxy, and the API server.

Metrics like pod startup latency or API request throughput can tell you whether your control plane is running efficiently and reliably.

Zooming out to the object layer, API resource metrics, often surfaced by tools like kube-state-metrics, offer visibility into Kubernetes objects.

They track details such as the number of pods in a namespace, deployment status, or the number of services running across your cluster.

Finally, at the top layer are pod and container workload metrics.

These focus on the actual performance of your applications.

This is where critical signals like CPU throttling come into play.

For instance, knowing how often a container is blocked from using CPU because it's hit its limit can reveal performance bottlenecks that might otherwise remain hidden.

Where Metrics Originate

Kubernetes defines resource requests and limits, but the kernel does the actual enforcement.

It relies on the Linux kernel’s control groups, known as cgroups, to apply those rules.

Cgroups are directories in the /sys/fs/cgroup/ virtual filesystem.

They are a live view of resource allocation and enforcement at the kernel level, exposed as files you can read and write.

These directories define how much CPU time, memory, or I/O bandwidth a process is allowed to consume.

In this context, a resource is anything the system can allocate, limit, and monitor: CPU cycles, memory usage, disk throughput, network bandwidth, even the number of process IDs a container can spawn.

But defining resources is only half of the story.

That’s where controllers make all the difference.

A controller is a kernel component that enforces resource policies and monitors usage for a specific type of resource.

For every resource, there’s a controller in cgroups that governs it.

The kernel reads them, applies the rules they define, and keeps every container within its resource boundaries.

Let's start a Minikube cluster with containerd as the container runtime, and deploy a Python pod to see this in action:

minikube start -c containerd
kubectl create deployment python \
  --image=ghcr.io/learnk8s/python-metrics \
  --port=8080 \
  -- /usr/local/bin/python3 -m http.server 8080

kubectl get po -o wide
NAME                      READY   STATUS    IP
python-66dc9f5c8b-w6x4b   1/1     Running   10.244.0.5

The Linux cgroup API has two versions: cgroup v1 and cgroup v2.

Each version structures resource management differently.

To understand why cgroup v2 and the systemd driver matter, it helps to start with the older model first: cgroup v1 with the cgroupfs driver.

cgroup v1 with cgroupfs: The Legacy Baseline

In this model, Kubernetes and the container runtime manage cgroups by writing directly to the cgroup filesystem.

That works, but it also means the hierarchy is shaped by separate controller trees rather than one unified resource tree.

In cgroup v1, kubelet and the container runtime can still be configured to use either systemd or cgroupfs, as long as both sides use the same driver.

Now let's step into a cgroup v1 environment and see how Kubernetes builds its QoS-based hierarchies when it uses the cgroupfs driver.

We’ll delete our existing Minikube cluster and reboot into a system where cgroup v1 is enabled:

minikube delete

There are several ways to switch a Linux system back to cgroup v1.

You might pass kernel boot parameters like systemd.unified_cgroup_hierarchy=0 or disable cgroup v2 entirely, depending on the environment, whether it’s bare metal, a VM, or WSL2.

Once the node boots into cgroup v1, Kubernetes automatically detects it and adjusts its resource management behavior.

First, confirm the system is operating under cgroup v1:

stat -fc %T /sys/fs/cgroup/
tmpfs

Now start a fresh Minikube cluster with the containerd runtime:

minikube start -c containerd
kubectl create deployment python \
  --image=ghcr.io/learnk8s/python-metrics \
  --port=8080 \
  -- /usr/local/bin/python3 -m http.server 8080

And deploy the Python pod:

kubectl get po -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP
python-66dc9f5c8b-4248r   1/1     Running   0          42s   10.244.0.4

Now we focus on how Kubernetes structures the cgroups under cgroup v1 with the cgroupfs driver.

Kubernetes enforces QoS-based resource isolation by creating separate hierarchies for each QoS class under every controller.

We confirm the kubelet configuration to verify this setting:

kubectl proxy --port=8001 &
curl -X GET http://127.0.0.1:8001/api/v1/nodes/minikube/proxy/configz | jq . | grep -i qos
"cgroupsPerQOS": true,

Per-QoS hierarchy creation is enabled, but which driver is kubelet using to manage these hierarchies?:

minikube ssh -- "sudo cat /var/lib/kubelet/config.yaml | grep -i cgroupDriver"
cgroupDriver: cgroupfs

In cgroup v1 with cgroupsPerQOS: true, kubelet’s use of the cgroupfs driver results in Kubernetes creating and managing separate cgroup subtrees for QoS classes under each controller.

Let's inspect the CPU controller directory structure:

minikube ssh -- "ls -la /sys/fs/cgroup/cpu/kubepods/"
drwxr-xr-x 5 root root 0 Mar 20 12:10 besteffort
drwxr-xr-x 7 root root 0 Mar 20 12:11 burstable
drwxr-xr-x 3 root root 0 Mar 20 12:12 guaranteed

Each QoS class gets its own directory under each controller.

Since our Python pod was deployed without resource requests, we can locate it under the besteffort QoS class:

minikube ssh -- "ls -la /sys/fs/cgroup/cpu/kubepods/besteffort/"
drwxr-xr-x 4 root root 0 Mar 20 03:51 pod23e59e27-abe5-4529-bf9c-581516ae0c0b
drwxr-xr-x 4 root root 0 Mar 20 03:51 pod9f874003-a948-425d-a072-f389dc21bdff
drwxr-xr-x 4 root root 0 Mar 20 03:51 podc1d8cd50-b50a-4b3c-a33d-8963242c60ef

We find multiple pod directories, named by their UID.

To correlate the pod directory with the actual python pod let's retrieve its UID from the Kubernetes API:

kubectl get pod python-66dc9f5c8b-4248r -o jsonpath='{.metadata.uid}'
c1d8cd50-b50a-4b3c-a33d-8963242c60ef

This matches the directory podc1d8cd50-b50a-4b3c-a33d-8963242c60ef under the besteffort class.

Inside this pod directory, each container has its own cgroup, named after the container ID:

minikube ssh -- "ls -la /sys/fs/cgroup/cpu/kubepods/besteffort/podc1d8cd50-b50a-4b3c-a33d-8963242c60ef/"
-rw-r--r-- 1 root root 0 Mar 20 12:16 cpu.shares
-rw-r--r-- 1 root root 0 Mar 20 12:16 cpu.cfs_quota_us
drwxr-xr-x 2 root root 0 Mar 20 03:52 ef455b35bf7e2afa0942e25b58cd10858d40ed1d97fffe7f0b6a664d2e64aa54
-rw-r--r-- 1 root root 0 Mar 20 04:22 tasks

For example, we can inspect the pod’s memory limit in the memory controller:

minikube ssh -- "cat /sys/fs/cgroup/memory/kubepods/besteffort/\
podc1d8cd50-b50a-4b3c-a33d-8963242c60ef/\
memory.limit_in_bytes"

9223372036854771712

This very large value is an effectively unlimited memory ceiling, which is expected for a BestEffort pod.

At this point, kubelet decides where the pod belongs in the QoS hierarchy, the container runtime helps create and configure the container cgroups, and the kernel enforces the resulting cgroup settings for the processes attached to them.

At the crux of how cgroup hierarchy is shaped

In cgroup v1, each controller operates in its own separate hierarchy.

When we list the mounted cgroup controllers in cgroup v1, we see each one mounted independently as its own filesystem:

minikube ssh -- "mount | grep cgroup"

cgroup on /sys/fs/cgroup/cpu type cgroup (rw,relatime,cpu)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,relatime,memory)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,relatime,pids)

This indicates that each controller, whether CPU, memory, or pids, has its own mount point and hierarchy.

We can confirm this separation by checking /proc/cgroups:

minikube ssh -- "cat /proc/cgroups"

#subsys_name    hierarchy    num_cgroups    enabled
cpuset          1            34             1
cpu             2            52             1
cpuacct         3            34             1

When we check the filesystem type of /sys/fs/cgroup/ in cgroup v1, it reports tmpfs instead of cgroup2fs:

minikube ssh -- "stat -fc %T /sys/fs/cgroup/"

tmpfs

The cgroup fs structure looks like the following:

minikube ssh -- "ls -la /sys/fs/cgroup/"

drwxr-xr-x 15 root root   0 Feb 23 05:17 blkio
drwxr-xr-x 15 root root   0 Feb 23 05:17 cpu
drwxr-xr-x  2 root root  40 Feb 23 05:17 cpu,cpuacct
drwxr-xr-x 23 root root   0 Feb 23 05:17 cpuacct
drwxr-xr-x 23 root root   0 Feb 23 05:17 cpuset
drwxr-xr-x 18 root root   0 Feb 23 05:17 devices
drwxr-xr-x 23 root root   0 Feb 23 05:17 freezer

This is the core limitation of cgroup v1: CPU, memory, pids, and other controllers can each have their own hierarchy, so resource management is split across multiple trees.

cgroup v2 fixes that part by moving controllers into a single unified hierarchy.

Now let's switch to a cgroup v2 system and examine the structure of the cgroup filesystem.

minikube ssh -- "ls -la /sys/fs/cgroup/"

-r--r--r-- 1 root root 0 Apr 28 10:51 cgroup.controllers
-r--r--r-- 1 root root 0 Apr 28 10:58 cgroup.stat
-rw-r--r-- 1 root root 0 Apr 28 10:51 memory.high
drwxr-xr-x 5 root root 0 Apr 28 10:51 kubepods.slice
...

All resource controllers are managed together in a single tree rooted at /sys/fs/cgroup/.

To confirm that cgroup v2 is active, we can inspect the mounted cgroup filesystem:

minikube ssh -- "mount | grep cgroup"

cgroup on /sys/fs/cgroup type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate,...)

We can list the active controllers that the kernel has attached to this unified hierarchy by reading /proc/cgroups.

In cgroup v2, all controllers operate within a single hierarchy, and the hierarchy column reflects this by showing 0 for each controller:

minikube ssh -- "cat /proc/cgroups"

#subsys_name    hierarchy       num_cgroups     enabled
cpu     0       208     1
cpuacct 0       208     1
blkio   0       208     1
devices 0       208     1

To verify the filesystem type for /sys/fs/cgroup/, we can run the stat utility.

In cgroup v2, this command reports cgroup2fs:

minikube ssh -- "stat -fc %T /sys/fs/cgroup/"

cgroup2fs

If it shows cgroup2fs, we know we’re running cgroup v2.

So cgroup v2 cleans up the kernel-side hierarchy, but it does not answer the ownership question by itself.

On a systemd-based node, Kubernetes still needs to decide who owns and manages the cgroup tree: systemd or direct filesystem writes through cgroupfs.

cgroup v1 is now only relevant for legacy systems, and its days are officially numbered.

Modern distributions such as Ubuntu 22.04+, Fedora 31+, and RHEL 9+ enable cgroup v2 by default.

Kubernetes has supported cgroup v2 as stable since v1.25, and cgroup v1 has been officially deprecated since Kubernetes v1.35 as part of KEP-5573.

Starting with Kubernetes v1.35, kubelet no longer starts on cgroup v1 nodes by default unless failCgroupV1 is explicitly set to false.

If you’re running production clusters that still use cgroup v1, you should plan a migration to cgroup v2 and define an upgrade or rollback strategy in advance.

So far, we've seen how cgroup v1 and v2 shape the filesystem layout, and we've learned how to verify which mode the node is using.

But to understand how Kubernetes actually turns that kernel structure into pod and container boundaries, we now need to look at the two decisions kubelet makes next: which cgroup manager it initializes, and which cgroup driver owns the tree.

And that is where the cgroup driver comes in.

How Kubernetes Creates and Manages the Cgroup Hierarchy

On a Kubernetes node, kubelet and the container runtime collaborate to build and maintain the cgroup hierarchy used for enforcing pod-level resource constraints.

Before either component can create or manage any cgroups, kubelet needs to resolve one fundamental question: is the node running cgroup v1 or cgroup v2?

That answer comes early.

At startup, kubelet queries the kernel to determine the active cgroup mode.

If it detects cgroup v2, it initializes a v2-specific manager built for the unified hierarchy.

If the node is using cgroup v1, it falls back to a legacy manager.

This decision locks in the way kubelet will interact with kernel-level resource controls for the lifetime of the process.

But the cgroup version is only half the equation.

The other part is who is responsible for actually managing the cgroup tree within /sys/fs/cgroup/.

This is called the cgroup driver.

Kubelet supports two drivers: systemd or cgroupfs.

It picks one or the other, never both at the same time.

In cgroup v2, the unified hierarchy makes the systemd cgroup driver the recommended choice on systemd-based Linux distributions.

Kubelet can still be configured to use cgroupfs, but Kubernetes recommends avoiding a setup where systemd and Kubernetes manage cgroups separately.

If the driver is systemd, kubelet hands cgroup creation to systemd; instead of writing directories itself, it generates logical slice names like kubepods.slice or kubepods-besteffort.slice.

These slices represent pod resource groups.

After generating the slice names, kubelet asks systemd to instantiate and manage the cgroup structure beneath /sys/fs/cgroup.

This is the part cgroup v2 does not solve alone: ownership of the tree needs to be consistent.

From that point on, all resource controls for pods are expressed through systemd’s unit model.

Why systemd?

Because when you boot a modern Linux system, systemd is the first userspace process the kernel runs.

It becomes PID 1.

As PID 1, systemd takes ownership of process supervision and resource control for the entire system.

Rather than using shell scripts, systemd defines behavior through typed units.

Units are structured configuration objects like .service, .scope, and .slice.

A slice is how systemd partitions the system for resource control.

In Kubernetes slices are automatically created by systemd based on pod QoS classes.

Think of slices like namespaces for CPU and memory budgets, managed for you behind the scenes.

What matters is you can apply limits at the slice level.

Services are the more familiar systemd unit type.

A .service represents a process that systemd starts and supervises directly.

On a Kubernetes node, kubelet and containerd usually run as services:

kubelet.service
containerd.service

These services live under system.slice, not under kubepods.slice.

That distinction matters: kubelet and containerd are host daemons that coordinate pod placement and container startup, but the containers themselves do not become children of containerd.service.

The actual container processes are placed into Kubernetes pod cgroups under kubepods.slice.

Scopes are different.

Scopes are used when systemd needs to manage a process it inherits from another launcher and still wants to control.

For example when the runtime launches a container, systemd can still take over and manage it.

It does this by wrapping the container process in a .scope unit.

Then systemd creates a .scope unit (such as cri-containerd-<container-id>.scope) and places it inside an appropriate slice determined by the pod’s quality of service (QoS) class.

But this only works if both kubelet and the container runtime agree on the cgroup driver.

If kubelet generates systemd slice names but containerd uses cgroupfs, the contract breaks.

If the cgroup driver is cgroupfs, kubelet goes back to the older model: direct filesystem ownership.

Kubelet interacts with the kernel’s cgroup API through the filesystem to create and manage cgroup directories.

Let’s step back into our Minikube cluster running cgroup v2 with containerd as the runtime.

Containerd handles its end of the driver selection agreement through its configuration file in /etc/containerd/config.toml through the SystemdCgroup parameter:

minikube ssh -- "sudo cat /etc/containerd/config.toml | grep -i -C2 'SystemdCgroup'"
runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

  [plugins."io.containerd.grpc.v1.cri".cni]

This is the config version 2 format used by containerd 1.x.

Once kubelet and the runtime align on both the cgroup version and the driver, kubelet can safely take ownership of building the pod-level cgroup hierarchy.

But in systemd with cgroup v2, which scope unit goes into which systemd slice?

That’s determined by the pod’s QoS class, which kubelet calculates based on the pod’s resource requests and limits.

Kubernetes QoS Classes and cgroup Placement

Based on the pod’s resource requests and limits, Kubernetes assigns it to one of three Quality-of-Service (QoS) classes, which influences where the pod is placed in the cgroup hierarchy.

A pod is classified as Guaranteed only when every container has CPU and memory requests and limits set, and each request exactly matches its corresponding limit.
A pod is Burstable when it defines at least one CPU or memory request or limit but does not meet the stricter Guaranteed rules.
A pod is BestEffort when none of its containers define CPU or memory requests or limits.

This QoS-to-cgroup hierarchy behavior is controlled by kubelet’s --cgroups-per-qos flag, which defaults to true.

When cgroupsPerQOS: true and systemd manages cgroups on a cgroup v2 node, systemd organizes pods under kubepods.slice and further into slices based on QoS classes.

Let's inspect the root qos directory:

minikube ssh -- "ls -d /sys/fs/cgroup/kubepods.slice/*/"
/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/
/sys/fs/cgroup/kubepods.slice/kubepods-burstable.slice/
/sys/fs/cgroup/kubepods-poded2df55a_639e_4beb_aee3_5db422c35910.slice/

Notice the third entry.

It is not a QoS slice like kubepods-besteffort.slice or kubepods-burstable.slice.

This is a pod-level cgroup.

The pod... part maps back to ed2df55a-639e-4beb-aee3-5db422c35910 Kubernetes UID:

Let's verify which pod owns that UID:

kubectl get pods -A \
  -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,UID:.metadata.uid' \
  | grep ed2df55a
kube-system   kindnet-qkqvh   ed2df55a-639e-4beb-aee3-5db422c35910

So the third cgroup entry belongs to the kindnet-qkqvh pod in the kube-system namespace.

Now let's verify its QoS class from the Kubernetes API:

kubectl get pod kindnet-qkqvh -n kube-system -o jsonpath='{.status.qosClass}{"\n"}'
Guaranteed

Now, if we print the QoS class and UID together:

kubectl get pod kindnet-qkqvh -n kube-system -o jsonpath='QoS={.status.qosClass}{"\n"}UID={.metadata.uid}{"\n"}'
QoS=Guaranteed
UID=ed2df55a-639e-4beb-aee3-5db422c35910

We see the mapping is the cgroup for this pod and that pod is classified by Kubernetes as Guaranteed.

Now let's look inside that pod cgroup:

minikube ssh -- "ls -la /sys/fs/cgroup/kubepods.slice/kubepods-poded2df55a_639e_4beb_aee3_5db422c35910.slice/"
cri-containerd-7ae5ffd3996a6ac09031cbf283d6bd9727a24bc723a06e76141132a8e57f1716.scope
cri-containerd-d24246f29f54f7adced123bc6194d9e0f15fd3a15c54326cd8c96d39961760c0.scope

The two cri-containerd-*.scope entries are the container-level systemd scope units running inside the kindnet-qkqvh pod.

We have traced a Guaranteed pod all the way down from the Kubernetes API to its pod slice and container scopes on disk.

Simplified to the branch we just inspected, the mapping looks like this:

/sys/fs/cgroup/
└── kubepods.slice
    └── kubepods-poded2df55a_639e_4beb_aee3_5db422c35910.slice
        ├── cri-containerd-7ae5ffd3996a6ac09031cbf283d6bd9727a24bc723a06e76141132a8e57f1716.scope
        └── cri-containerd-d24246f29f54f7adced123bc6194d9e0f15fd3a15c54326cd8c96d39961760c0.scope

Now let’s do the same for our Python workload, which lands in a different part of the hierarchy because it has a different QoS class.

Inside the root slice, systemd further organizes pods into separate slices based on their QoS classes.

Since our Python pod was deployed without any CPU or memory requests or limits, its resources are managed under kubepods-besteffort.slice.

Let's confirm the QoS classification of the pod:

kubectl get pod python-66dc9f5c8b-2kktd -o jsonpath='{.status.qosClass}'
BestEffort

Let's map our python pod and containers to their systemd-managed cgroup slices and scopes.

To achieve this we will get the pod UID to map it to the slice name:

kubectl get pod python-66dc9f5c8b-2kktd -o jsonpath='{.metadata.uid}'
b60baa0b-1e66-4990-8670-93c5919f09cb

Each pod gets its own slice under the qos slices and systemd translates hyphens into underscores when creating pod slice directories (kubepods-{qos class}-pod{pod UID with underscores}.slice).

List the available pod slices under kubepods-besteffort.slice:

minikube ssh -- "ls -d /sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/*/"
/sys/fs/cgroup/.../kubepods-besteffort-pod740242e7_85e5_4369_a8a0_d6101719e386.slice/
/sys/fs/cgroup/.../kubepods-besteffort-pod857495d4_07b5_45a2_895b_0298f68797d8.slice/
/sys/fs/cgroup/.../kubepods-besteffort-podb60baa0b_1e66_4990_8670_93c5919f09cb.slice/

The last pod slice corresponds to our Python pod (its UID matches b60baa0b-1e66-4990-8670-93c5919f09cb).

The other entries are other BestEffort pods on the node, such as kube-system pods like CoreDNS or kube-proxy.

Within this pod slice, systemd organizes each container into separate .scope units.

These scopes are named after the containerd runtime and container ID.

List the contents of the specific pod slice:

minikube ssh -- "ls /sys/fs/cgroup/kubepods.slice/\
kubepods-besteffort.slice/kubepods-besteffort-podb60baa0b_1e66_4990_8670_93c5919f09cb.slice/ | grep scope"
cri-containerd-b21e881ca9d6228281aa32cb1e2ebba5537f2a7b90e860a2f0cc6afec3305229.scope
cri-containerd-b8609ccf36f85b5a4fc652317358950861a6f0a538e6c4b4c4243241189fbc11.scope

The long hex strings above are the container ID, as assigned by containerd.

Systemd appends them to the .scope unit it creates for each container.

So now the question is: which one of these is your Python container?

We query containerd to match the container ID:

minikube ssh -- "sudo crictl ps --name python"
CONTAINER           IMAGE          NAME              POD ID            POD
b21e881ca9d62       bdbec6b439339  python-metrics    b8609ccf36f85     python-66dc9f5c8b-2kktd

The container ID b21e881ca9d62 matches the first .scope unit above.

The other one (b8609ccf36f85...) is the pod sandbox, which is the pause container we will inspect next.

minikube ssh -- "\
ls -la \
/sys/fs/cgroup/kubepods.slice/kubepods-besteffort.slice/\
kubepods-besteffort-podb60baa0b_1e66_4990_8670_93c5919f09cb.slice/\
cri-containerd-b21e881ca9d6228281aa32cb1e2ebba5537f2a7b90e860a2f0cc6afec3305229.scope"
cpu.max
hugetlb.2MB.events
memory.high
memory.stat

At this point, the hierarchy for the Python pod looks like this:

/sys/fs/cgroup/
└── kubepods.slice
    └── kubepods-besteffort.slice
        └── kubepods-besteffort-podb60baa0b_1e66_4990_8670_93c5919f09cb.slice
            ├── cri-containerd-b21e881ca9d6228281aa32cb1e2ebba5537f2a7b90e860a2f0cc6afec3305229.scope
            │   └── python-metrics container
            └── cri-containerd-b8609ccf36f85b5a4fc652317358950861a6f0a538e6c4b4c4243241189fbc11.scope
                └── pod sandbox / pause container

We can now dig into its cgroup resource metrics like memory usage statistics.

minikube ssh -- "cat /sys/fs/cgroup/kubepods.slice/\
kubepods-besteffort.slice/kubepods-besteffort-podb60baa0b_1e66_4990_8670_93c5919f09cb.slice/\
cri-containerd-b21e881ca9d6228281aa32cb1e2ebba5537f2a7b90e860a2f0cc6afec3305229.scope/\
memory.stat" | head -5
anon 9601024
file 13496320
kernel 1056768
kernel_stack 16384
pagetables 94208

Great!

But what about the other scope?

In this setup, even a Pod with a single application container has two active container scopes under the pod slice: one for the application container, one for the pause container.

The pause container is a sandbox environment that sets up the network namespace, IP address, and IPC for the pod.

Once the sandbox is running and holding that shared environment, Kubernetes starts the Python container inside that namespace.

Let’s inspect the pod sandbox b8609ccf36f85 to confirm the pause container:

minikube ssh -- "sudo crictl inspectp b8609ccf36f85 | grep image"
"image": "registry.k8s.io/pause:3.10.1",

The pause container maps to the other .scope unit, but how can we verify it?

We inspect the pod sandbox to retrieve the pause container's PID:

minikube ssh -- "sudo crictl inspectp b8609ccf36f85 | grep -E '\"pid\"'"
"pid": "CONTAINER",
    "pid": 1647,

PID 1647 corresponds to the pause container.

We correlate the PID with the running process and its parent shim:

minikube ssh -- "sudo ps -e -o pid,ppid,cmd | grep -E '\\b1603\\b|\\b1647\\b'"
1603       1 /usr/bin/containerd-shim-runc-v2 -namespace k8s.io -id b8609... -address /run/containerd/containerd.sock
1647    1603 /pause
1694    1603 /usr/local/bin/python3 -m http.server 8080

The second scope is the pause container.

PID 1647 is the /pause process, and it shares the same containerd-shim-runc-v2 parent, PID 1603, with the Python process 1694.

Auto-Detecting cgroup Drivers via KubeletCgroupDriverFromCRI

Kubernetes addressed some of the coordination challenges with the KubeletCgroupDriverFromCRI feature gate, introduced as alpha in v1.28 and graduated to GA in v1.34.

At startup, kubelet asks the runtime which cgroup driver to use through the CRI RuntimeConfig RPC.

On Kubernetes 1.34+, the feature gate no longer needs to be set explicitly.

If the runtime lacks the RuntimeConfig RPC, kubelet falls back to the cgroupDriver value in its own configuration only in Kubernetes versions that still support this fallback.

Let's start a new cluster using CRI-O as the container runtime:

minikube start -p test-driverfromcri --container-runtime=cri-o

When we inspect the /var/lib/kubelet/config.yaml file, the kubelet config still shows the configured fallback driver:

minikube ssh -p test-driverfromcri -- "sudo cat /var/lib/kubelet/config.yaml | grep -A2 cgroupDriver"
cgroupDriver: systemd
clusterDNS:
- 10.96.0.10

If the CRI runtime does not implement the RuntimeConfig RPC, kubelet falls back to the configured cgroupDriver:

minikube ssh -p test-driverfromcri -- "sudo journalctl -u kubelet | grep -E 'RuntimeConfig|CRI implementation'"
"RuntimeConfig from runtime service failed" err="rpc error: code = Unimplemented desc = unknown method RuntimeConfig"
"CRI implementation should be updated to support RuntimeConfig. Falling back to using cgroupDriver from kubelet config."

Finally, once kubelet settles on a cgroup driver, it uses that driver consistently when placing pods and containers into the node’s cgroup hierarchy.

The container runtime then passes the resulting cgroup placement into the OCI runtime layer, where runc/libcontainer applies it by writing to the kernel’s cgroup interfaces.

Whether the hierarchy is represented through systemd slices and scopes or raw cgroupfs directories, the end result is the same: the Linux kernel enforces the configured CPU, memory, and other resource limits.

At this point, we have seen both sides: cgroup v1 with direct filesystem-managed hierarchies, and cgroup v2 with systemd-managed slices and scopes.

But enforcement is only half of the story.

The kernel exposes raw counters, limits, and events through the cgroup filesystem, but Kubernetes still needs a component that can read those low-level files and turn them into useful container and pod-level metrics.

That is the visibility gap cAdvisor was designed to fill.

cAdvisor: Embedded Resource Monitoring in Kubelet

Container Advisor, or cAdvisor, is the default kubelet-integrated path for collecting container resource usage statistics on Kubernetes nodes.

It runs as an embedded component inside the kubelet process and is initialized automatically when kubelet starts.

Once initialized, it reads resource usage from the cgroup filesystem.

cAdvisor reads low-level resource data from the cgroup filesystem and attaches labels such as pod, namespace, container, and image.

Kubelet then exposes the collected metrics through its own HTTP endpoints: the Summary API and cAdvisor metrics endpoint.

If PodAndContainerStatsFromCRI is enabled and the container runtime supports stats through CRI, kubelet fetches pod and container metrics from the runtime instead of cAdvisor.

Kubelet’s Metrics Endpoints

Kubelet exposes several distinct metrics and stats endpoints on its HTTP server.

Each serves a specific purpose and differs in data granularity, format, and source.

The /metrics/cadvisor endpoint exposes high-resolution container metrics in Prometheus format.

These metrics come directly from cAdvisor, and kubelet passes them through as-is to the scraper.

Prometheus typically scrapes this endpoint to collect detailed per-container metrics such as CPU time, memory usage, and I/O statistics.

These metrics are useful for low-level monitoring, fine-grained alerting, and capacity planning.

To query the kubelet’s /metrics/cadvisor endpoint, we first need to establish a local proxy to the Kubernetes API server.

Run the following command and leave it running on another terminal:

kubectl proxy --port=8001

Once the proxy forwards local HTTP requests to the kubelet’s API on the node, we can access kubelet HTTP endpoints through http://localhost:8001.

curl -sS http://localhost:8001/api/v1/nodes/minikube/proxy/metrics/cadvisor

container_cpu_usage_seconds_total{container="python-metrics",cpu="total",pod="python-66dc9f5c8b-2kktd"} 0.105818
container_memory_usage_bytes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 2.5870336e+07
container_fs_reads_bytes_total{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 1.49504e+07
container_processes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 1
container_spec_cpu_shares{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 2
container_spec_memory_limit_bytes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 0

Related node, pod, container, and volume stats are also available through kubelet’s Summary API on /stats/summary, which returns structured JSON instead of Prometheus-formatted metrics:

/stats/summary exposes node, pod, container, and volume stats. Metrics Server v0.6.0 and later use /metrics/resource for CPU and memory metrics instead.

For example, to inspect our pod’s resource consumption, we can run:

curl -sS \
  http://localhost:8001/api/v1/nodes/minikube/proxy/stats/summary \
  | jq '.pods[] | select(.podRef.name == "python-66dc9f5c8b-2kktd")'
{
  "podRef": {
    "name": "python-66dc9f5c8b-2kktd",
    "namespace": "default",
    "uid": "b60baa0b-1e66-4990-8670-93c5919f09cb"
  },
  "containers": [
    {
      "name": "python-metrics",
      "cpu": {
        "usageNanoCores": 151695,
        "usageCoreNanoSeconds": 226134000
      },
      "memory": {
        "usageBytes": 25870336,
        "workingSetBytes": 22114304,
        "rssBytes": 9596928,
        "pageFaults": 3346,
        "majorPageFaults": 136
      },
      "rootfs": {
        "usedBytes": 122880
      },
      "logs": {
        "usedBytes": 8192
      },
      "swap": {
        "swapAvailableBytes": 0,
        "swapUsageBytes": 0
      }
    }
  ]
}

If you only need simplified, high-level metrics, /metrics/resource serves that role.

It exposes CPU and memory usage in Prometheus format, optimized for lightweight node monitoring.

We can query this endpoint for aggregated container and pod metrics:

curl -sS http://localhost:8001/api/v1/nodes/minikube/proxy/metrics/resource | grep python-metrics
container_cpu_usage_seconds_total{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 0.298696 1777623311728
container_memory_working_set_bytes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 2.2114304e+07 1777623311728
container_start_time_seconds{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 1.7776221060112867e+09
container_swap_limit_bytes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 0 1777623324188
container_swap_usage_bytes{container="python-metrics",pod="python-66dc9f5c8b-2kktd"} 0 1777623324188

These metrics provide a point-in-time view of how much CPU and memory the pod and its containers are consuming.

What about if we need to debug kubelet’s performance or runtime interactions?

kubelet exposes its own internal metrics at the /metrics endpoint.

These metrics include runtime operation durations, event counters, and error rates that reflect how kubelet interacts with the container runtime and manages node resources.

For instance, if pods take longer to start or containers fail to stop cleanly, reviewing kubelet_runtime_operations_duration_seconds can reveal latency bottlenecks between kubelet and the runtime:

curl -sS \
  http://localhost:8001/api/v1/nodes/minikube/proxy/metrics \
  | grep kubelet_runtime_operations_duration_seconds \
  | tail -n 3
kubelet_runtime_operations_duration_seconds_bucket{operation_type="version",le="+Inf"} 152
kubelet_runtime_operations_duration_seconds_sum{operation_type="version"} 0.12228928199999994
kubelet_runtime_operations_duration_seconds_count{operation_type="version"} 152

The four kubelet metrics endpoints fit together like this:

Historically, cAdvisor was Kubernetes’ primary mechanism for container resource monitoring.

It provided an efficient mechanism for exposing container metrics when workloads were simpler and observability requirements were limited.

But as Kubernetes matured, a question appeared.

If kubelet already talks to the container runtime through CRI, why should it always ask cAdvisor to rediscover the same containers from the host filesystem?

To answer that, we need to look at cAdvisor’s design first.

From cAdvisor to CRI: How Kubelet Collects Metrics Today

Originally, cAdvisor collected container metrics by observing the Linux host directly.

That model worked well for the classic Linux container path, where containers were visible through the host’s cgroup hierarchy.

But Kubernetes later standardized kubelet-to-runtime communication through the Container Runtime Interface (CRI).

CRI is a gRPC-based API that lets kubelet talk to different container runtimes without being tied to a specific runtime implementation.

So a natural question appears.

If the runtime already created the containers and already tracks their state, why should kubelet always rely on cAdvisor to rediscover that information from the host?

That is the design reason behind the CRI stats path.

With this path, kubelet gets pod and container stats directly from the runtime.

That path avoids collecting the same data twice when the runtime already has it.

It also helps with runtimes where cAdvisor cannot easily see containers from the host.

But how does kubelet achieve that?

We can verify the exact method names directly from the CRI protobuf definition:

curl -sSL https://raw.githubusercontent.com/kubernetes/cri-api/master/pkg/apis/runtime/v1/api.proto \
  | grep -E 'rpc (ContainerStats|ListContainerStats|PodSandboxStats|ListPodSandboxStats)'
    rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
    rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}
    rpc PodSandboxStats(PodSandboxStatsRequest) returns (PodSandboxStatsResponse) {}
    rpc ListPodSandboxStats(ListPodSandboxStatsRequest) returns (ListPodSandboxStatsResponse) {}

The runtime exposes stats through CRI RPC methods.

These calls return structured Protobuf messages containing resource usage data such as CPU, memory, network, process, IO, and per-container stats, depending on the platform and runtime implementation.

With PodAndContainerStatsFromCRI enabled, kubelet can use CRI stats methods such as ListPodSandboxStats, PodSandboxStats, and ListContainerStats to collect pod and container metrics from the runtime.

Kubelet sends these gRPC requests to the runtime endpoint configured on the node.

For containerd, that endpoint is commonly /run/containerd/containerd.sock.

For CRI-O, it is commonly /var/run/crio/crio.sock.

Once kubelet receives stats from the runtime, it converts the CRI Protobuf responses into kubelet’s internal stats structures and then exposes the resulting stats.

But did we bypass cAdvisor completely?

No.

Even on the CRI stats path, kubelet can still rely on cAdvisor for node-level and filesystem-related stats that are outside the pod and container stats returned by CRI.

The two stats paths look like this:

Validating CRI-Based Metrics Collection in Kubelet

Now that we understand why Kubernetes shifted metrics collection from cAdvisor to the CRI, let’s validate that kubelet is actually pulling metrics from the runtime.

We’ll configure kubelet to use CRI-based metrics, confirm it through logs, and compare kubelet’s reported data to what containerd provides directly.

We start by increasing kubelet’s log verbosity by editing its unit file to pass the --v=5 argument.

/etc/systemd/system/kubelet.service.d/10-kubeadm.conf

Inside the above file, we ensure the ExecStart line includes the verbose logging flag.

[Unit]
Wants=containerd.service

[Service]
ExecStart=
ExecStart=/var/lib/minikube/binaries/v1.34.0/kubelet \
  --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf \
  --config=/var/lib/kubelet/config.yaml \
  --hostname-override=minikube \
  --kubeconfig=/etc/kubernetes/kubelet.conf \
  --node-ip=192.168.49.2 \
  --v=5

[Install]

Once we save the configuration, we reload the systemd daemon and restart kubelet.

sudo systemctl daemon-reload
sudo systemctl restart kubelet

First, validate that the container runtime’s socket is active and listening:

minikube ssh -- "ss -lx | grep containerd.sock"
u_str LISTEN 0      4096   /run/containerd/containerd.sock.ttrpc 80566      * 0
u_str LISTEN 0      4096   /run/containerd/containerd.sock 79442            * 0

Containerd is exposing its CRI endpoint over /run/containerd/containerd.sock.

Next, verify kubelet is configured to use the correct runtime endpoint:

minikube ssh -- "sudo cat /var/lib/kubelet/config.yaml | grep -i containerRuntimeEndpoint"
containerRuntimeEndpoint: unix:///run/containerd/containerd.sock

Kubelet is communicating with the correct CRI runtime over the expected UNIX domain socket.

Let's tell kubelet to use the CRI for collecting pod and container stats by enabling the PodAndContainerStatsFromCRI feature gate.

Before we flip this switch, one thing is worth knowing.

Kubelet reports the maturity of every feature gate it knows about through the /metrics endpoint, under the kubernetes_feature_enabled series.

Querying that series for PodAndContainerStatsFromCRI on a fresh Kubernetes 1.34 cluster gives us:

curl -sS http://localhost:8001/api/v1/nodes/minikube/proxy/metrics \
  | grep 'kubernetes_feature_enabled.*PodAndContainer'

kubernetes_feature_enabled{name="PodAndContainerStatsFromCRI",stage="ALPHA"} 0

stage="ALPHA" and 0 means disabled by default.

We open kubelet's /var/lib/kubelet/config.yaml configuration file on the minikube node and add the feature gate and ensure the following block is present:

...
featureGates:
  PodAndContainerStatsFromCRI: true

Then we restart kubelet once more.

sudo systemctl restart kubelet

At this point, kubelet should be sourcing pod and container metrics directly from containerd over the CRI API.

When we inspect the kubelet logs with the following command:

sudo journalctl -u kubelet | grep -i containerstats

May 01 10:27:57 minikube kubelet[4205]: feature gates: {map[PodAndContainerStatsFromCRI:true]}
May 01 10:27:57 minikube kubelet[4205]: "PodAndContainerStatsFromCRI": true

Great!

We see kubelet successfully loads the PodAndContainerStatsFromCRI gate.

But it's output doesn’t confirm metrics are being retrieved from the runtime.

/stats/summary is kubelet's primary interface for exposing metrics that it collects, whether from cAdvisor or directly from the container runtime through the CRI.

When PodAndContainerStatsFromCRI is enabled, kubelet populates this endpoint with data retrieved from the runtime.

Let's query /stats/summary endpoint to observe the metrics kubelet is serving and confirm whether they match what the runtime reports.

We will start the kubelet proxy first if you haven't already and query the summary stats for our pod:

kubectl proxy --port=8001
curl -sS \
  http://localhost:8001/api/v1/nodes/minikube/proxy/stats/summary \
  | jq '.pods[] | select(.podRef.name == "python-66dc9f5c8b-2kktd")'
{
  "podRef": {
    "name": "python-66dc9f5c8b-2kktd",
    "namespace": "default"
  },
  "containers": [
    {
      "name": "python-metrics",
      "cpu": {
        "usageNanoCores": 149575,
        "usageCoreNanoSeconds": 1647087000
      },
      "memory": {
        "workingSetBytes": 22114304
      }
    }
  ]
}

The Summary API reports 22114304 bytes of memory working set, about 22.11 MB, and 149575 nanocores of current CPU usage for the python-metrics container.

But how do we know kubelet sourced this from containerd, not cAdvisor?

We can cross-check by querying containerd directly with crictl.

But first, we need to confirm the container ID:

kubectl get pod python-66dc9f5c8b-2kktd -o jsonpath='{.status.containerStatuses[*].containerID}'
containerd://9b508d38b441b

Now we SSH into the node and run crictl stats.

minikube ssh -- sudo crictl stats

CONTAINER           CPU %               MEM                 DISK                INODES
...
5e63e93291a32       0.21                75.7MB              36.86kB             11
62bbd4d869537       0.04                66.93MB             65.54kB             24
6cff256e868f3       0.00                37.74MB             65.54kB             24
9b508d38b441b       0.02                22.11MB             122.9kB             16

The python-metrics container appears as container ID 9b508d38b441b in crictl stats, with MEM reported as 22.11MB.

That matches the Summary API value.

CPU is harder to match exactly because both values are point-in-time samples, but they are consistent: kubelet reports 149575 nanocores, and crictl stats shows 0.02% CPU for the same container.

Next, we query kubelet’s /metrics/resource endpoint to see the Prometheus exposition format.

curl -sS http://localhost:8001/api/v1/nodes/minikube/proxy/metrics/resource \
  | grep -i "python-66dc9f5c8b-2kktd"

pod_cpu_usage_seconds_total{namespace="default",pod="python-66dc9f5c8b-2kktd"} 1.760035 1777632057760
pod_memory_working_set_bytes{namespace="default",pod="python-66dc9f5c8b-2kktd"} 2.2421504e+07 1777632057760

Again, the working set is in the same range across all three views:

/metrics/resource reports about 22.42 MB,
/stats/summary and crictl stats report about 22.11 MB.

Kubelet sources pod and container metrics directly from containerd through the CRI API.

What happens when we check kubelet’s /metrics/cadvisor endpoint:

curl -sS http://localhost:8001/api/v1/nodes/minikube/proxy/metrics/cadvisor
machine_cpu_cores{machine_id="a5b246...",system_uuid="7bd5a1e2-ea5e-452b-a202-536452caf458"} 20
machine_cpu_physical_cores{machine_id="a5b246...",system_uuid="7bd5a1e2-ea5e-452b-a202-536452caf458"} 14
machine_cpu_sockets{machine_id="a5b246...",system_uuid="7bd5a1e2-ea5e-452b-a202-536452caf458"} 1
machine_memory_bytes{machine_id="a5b246...",system_uuid="7bd5a1e2-ea5e-452b-a202-536452caf458"} 3.338305536e+10
machine_swap_bytes{machine_id="a5b246...",system_uuid="7bd5a1e2-ea5e-452b-a202-536452caf458"} 3.4088153088e+10

Huh!

Before enabling the CRI stats path, /metrics/cadvisor exposed detailed container metrics emitted by cAdvisor and labeled by pod, namespace, container, image, and cgroup path.

Now, in this run, the endpoint only shows machine-level cAdvisor metrics such as CPU topology, installed memory, swap capacity, and machine scrape status.

In this run, no pod metrics or container-level data appeared in the /metrics/cadvisor output.

All the pod and container resource usage?

Those pod and container metrics are now sourced from containerd's CRI stats implementation.

Summary

Kubernetes does not directly enforce Linux resource limits; the Linux kernel enforces them through cgroups. Kubelet and the container runtime translate pod resource settings into cgroup configuration, then the kernel applies the actual CPU, memory, pids, and related controls.
cgroup v2 uses a single unified hierarchy where controllers coexist under /sys/fs/cgroup/. cgroup v1 uses separate controller hierarchies, so controllers such as CPU, memory, and pids can be mounted as separate cgroup trees.
cgroup v1 has been officially deprecated since Kubernetes v1.35. As part of KEP-5573, kubelet now fails by default on cgroup v1 nodes unless failCgroupV1 is explicitly set to false, with full code removal planned no earlier than Kubernetes v1.38.
Kubelet and the container runtime must use a compatible cgroup driver. With the systemd driver, kubelet and the runtime place containers under systemd-managed slices; with cgroupfs, they manage cgroup paths directly. For cgroup v2, Kubernetes strongly recommends the systemd cgroup driver.
KubeletCgroupDriverFromCRI graduated to GA in Kubernetes v1.34. At startup, kubelet asks the runtime for the cgroup driver through the CRI RuntimeConfig RPC when the runtime supports it; otherwise kubelet falls back to its configured cgroupDriver.
cAdvisor is embedded inside the kubelet process and starts as part of kubelet. By default, kubelet uses cAdvisor to collect node, pod, container, volume, and filesystem statistics, then exposes that data through kubelet HTTP endpoints. There is no separate cAdvisor sidecar or daemon in the normal kubelet setup.
Kubelet exposes several metrics and stats endpoints. /metrics/cadvisor exposes cAdvisor-style container and machine metrics in Prometheus format. /stats/summary returns structured JSON for node, pod, container, and volume stats. /metrics/resource exposes lightweight CPU and memory resource metrics used by modern Metrics Server versions. /metrics exposes kubelet’s own internal component metrics, such as operation counters and latencies. Metrics Server 0.6.x and later query /metrics/resource, not /stats/summary.
CRI is the gRPC API that standardizes kubelet-to-runtime communication. It lets kubelet manage pods and containers through the runtime, and with compatible runtimes it can also collect pod and container metrics directly from the runtime over the runtime socket.
PodAndContainerStatsFromCRI is an Alpha feature gate and is disabled by default. When enabled with a compatible runtime, kubelet collects pod and container stats through CRI instead of relying on cAdvisor for those pod and container stats.
Even with CRI-based pod and container metrics collection, kubelet still depends on cAdvisor for stats that CRI does not provide, especially node-level, machine-level, volume, and filesystem-related data.

References

Hacking Alibaba Cloud's Kubernetes Cluster

Gulcan Topcu — Tue, 02 Jul 2024 06:34:16 +0000

Hacking Alibaba Cloud's Kubernetes Cluster with Hillai Ben-Sasson &Ronen Shustin, Security Researchers at Wiz and Bart Farrell, KubeFM Host

Securing Kubernetes clusters is one of the toughest challenges in cloud security, but for Ronen Shustin and Hillai Ben-Sasson at Wiz, it's just another day at work. These top-tier researchers are fearless in diving into the deep end. Their latest exploit? Cracking Alibaba Cloud's Kubernetes clusters through clever PostgreSQL vulnerabilities.

Join Bart Farell as he dives into how their innovative approach identifies vulnerabilities and enhances the overall security of cloud ecosystems.

You can watch (or listen to) this interview here.

Bart: What are three emerging Kubernetes or other tools that you're keeping an eye on?

Hillai: Ronen and I have extensive knowledge of Kubernetes, but our expertise only originates from working directly with Kubernetes. We're hackers who transitioned into Kubernetes hacking, not Kubernetes experts who started hacking. So, we need to familiarize ourselves with many Kubernetes tools. Most of the tools we know are those we've encountered and exploited during our engagements. Therefore, we might not be the best sources for the latest Kubernetes tools, but we are excited about ongoing Kubernetes research.

Bart: Are there any specific tools or infrastructure that you particularly like?

Ronen: Instead of specific tools, we're more interested in infrastructure elements like service meshes. From an attacker's perspective, engaging with these is quite fascinating. Currently, we need to mention standout tools.

Bart: For those unfamiliar, can you tell us more about your roles and what you do at Wiz?

Hillai: Ronen and I work at Wiz, a cloud security company, as part of the vulnerability research team. We focus on researching primary cloud services and providers like Azure, GCP, AWS, and more. We utilize their open bug bounty programs to find and report vulnerabilities. By sharing our findings, we aim to enhance the security of the cloud community, not just for our clients but for everyone.

Bart: Is hacking cloud environments your primary focus, or is this a specialized area within security research?

Hillai: It's unique. We didn't start with cloud environments. We began as general security researchers, focusing on hacking techniques. Over time, we transitioned into specializing in cloud security. Our research aims to discover innovative ways attackers might exploit cloud systems, ultimately leading to more secure cloud environments for everyone.

Bart: How has your hacking experience influenced your approach to Kubernetes security? Did you discover any exciting findings during this research?

Hillai: Many cloud providers rely on Kubernetes and container technology to manage their services efficiently. Traditionally, setting up individual virtual or physical machines for each customer would only be scalable for some companies. Containers offer a more efficient way to manage large infrastructures. Focusing on cloud environments, we discovered Kubernetes as the go-to tool for Alibaba Cloud and companies like IBM. Our journey started with cloud security research and ultimately led us to specialize in Kubernetes security within that domain.

Ronen: Our initial focus was on container security. We researched container escapes and other vulnerabilities that might impact containers. This research naturally led us to Kubernetes, as many infrastructures we encountered used it. We had to learn Kubernetes and develop specific techniques to achieve our goals.

Bart: If you could go back in time and share one career tip with your younger self, what would it be?

Hillai: Always follow your curiosity. Research is all about pursuing leads and hunches. We were curious about cloud security, even though we didn't start in that field. It became popular, and we wanted to explore this new area.

Bart: What resources do you use to stay updated on Kubernetes?

Ronen: I rely on technical documents the most. I also follow blogs from cloud providers, mainly the CNCF blog, because they have valuable information. I use The Kubernetes community on Twitter to learn about new features and technologies; they are highly active there.

Hillai: Additionally, I recommend Reddit. Many communities focused on security, Kubernetes, and cloud computing offer great content.

Bart: We came across an article about how you hacked Alibaba Cloud's Kubernetes cluster and a talk you gave at KubeCon. What motivated you to do this research, and did your company support you?

Hillai: Our company supports security research. At Wiz, we focus on cloud security research, often utilizing offensive security methodologies. We act like attackers to find vulnerabilities and then report them to the vendors. By identifying vulnerabilities, we can report them to the cloud providers and prevent actual attacks. Alibaba Cloud is just one example of this engagement.

Ronen: Our research often leads us to discover new hacking techniques we need to learn about. We share these discoveries with everyone so they can protect themselves.

Bart: One of our previous guests talked about Kubernetes secrets management and threat modelling. How do you approach exploiting vulnerabilities from a hacker's perspective?

Ronen:Our best security insights come from working with different applications, frameworks, and cloud systems. When we engage with one, our primary goal is to find critical security mistakes in its setup. To do this, we must fully understand how the system works and where attackers might discover weaknesses.

Hillai: There's an interesting difference between traditional and cloud security research. In traditional research, the goal is often to achieve "Remote Code Execution" (RCE) on a specific application, which means taking control of a machine and running unauthorized code. However, in the cloud, things are different. Since you often have access to a virtual machine yourself, RCE becomes less attractive.

The real challenge in cloud security lies in breaching the barriers between different customers. Unlike traditional environments, the cloud is a shared space with hundreds of thousands of users. Our focus is to demonstrate the possibility of attackers moving between these customers, even without data access. This risk highlights a unique cloud security risk - the potential for attackers to "jump" from one user to another and compromise their information. This type of research, proving a breach of trust without actually stealing data, is a crucial aspect of cloud security and something rarely seen in traditional security research.

Bart: When starting this research, why did you choose Alibaba Cloud?

Ronen: Our initial study focused on PostgreSQL. Since many cloud providers offer managed PostgreSQL instances, we were interested in how they handle the infrastructure. We discovered vulnerabilities that allowed us to execute code on these instances. We tested several providers, including Alibaba, and presented our findings at the Black Hat talk.

Hillai: We began with PostgreSQL and expanded to Alibaba and other cloud providers. Our blog post provides more details about PostgreSQL and our Black Hat talk.

Bart: Why did you choose to focus on PostgreSQL for your research?

Ronen: PostgreSQL is a robust database with many features, including the ability to execute code within the database. While this capability can benefit certain users, it poses a potential security risk in cloud environments.

Cloud providers typically modify PostgreSQL to prevent users from executing code on their managed instances. However, our research identified vulnerabilities in these modifications, not in the core PostgreSQL code itself. We were able to exploit these vulnerabilities to bypass the restrictions and still execute code on the managed databases.

Bart: How does PostgreSQL relate to Kubernetes in this context? Did you find a way to access a Kubernetes cluster by exploiting the PostgreSQL vulnerabilities?

Hillai: Cloud providers often use containers and orchestration tools like Kubernetes to manage large-scale services, including PostgreSQL. This approach allows them to offer these services to many customers efficiently. While exploiting the PostgreSQL vulnerabilities, we discovered that we were actually in a Kubernetes environment. The user interface typically abstracts away the underlying infrastructure from the user, but our research methods disclosed it.

Ronen: We've seen various infrastructures, but Alibaba and IBM used Kubernetes for their managed services. Other providers might use different implementations.

Bart: Security experts often talk about avoiding vulnerabilities caused by misconfigurations, which can be human errors. What were the biggest misconfigurations you found that created security risks?

Hillai: The biggest misconfiguration we found is treating containers as the only security barrier. It's important to remember that containers can be a security layer within a more extensive security system, but they should be relied on only partially. Containers alone wouldn't be strong enough to isolate each company's data from each other entirely because any security flaw in the core Linux system (the kernel) could bypass container security. We were able to exploit such misconfigurations during our research.

Another problem is poorly managed secrets within the Kubernetes environment. These secrets could read information across the system and write and change it, which meant we could overwrite software packages used by many cloud services and customer accounts within Alibaba. Essentially, these powerful secrets allowed someone to access different environments, services, and customer data—all with a single key. That's a significant security risk we wouldn't recommend taking.

Ronen: The specific secret we found was the image pull secret. In Kubernetes, when you want to download images from a private registry, you need this secret to configure network access. If you misconfigure it, you might accidentally include a secret key with push permissions instead of pull permissions. This key should only allow downloading images, not uploading them. If an attacker gains access to a key with push permissions (like what we achieved in Alibaba), it could have devastating consequences for your entire environment.

Bart: To those without a strong background in security, it may seem that security experts click a button, scan your system, and find vulnerabilities. However, security research, like many other fields, is a blend of art and science. Can you elaborate on this further?

Hillai: Security research requires a lot of creativity. When you hear about a new attack vector, it boils down to creative thinking - coming up with something no one else has considered. In this research, we started by looking for patterns we already knew were risky, like overly permissive settings and shared volumes. We had to think outside the box. Returning to the Alibaba Cloud control panel, we began experimenting. This exploration led us to a breakthrough when we discovered a button enabling SSL encryption for the PostgreSQL instance. Clicking it triggered new activity in the container, which we followed to escape the container.

Bart: To help our audience understand, could you explain SCP, its role in the attack, and how you exploited it?

Hillai: SCP stands for Secure Copy. It's a standard tool on Linux systems that transfers files between machines using secure SSH connections. In our case, the SSL encryption feature we triggered used a new Alibaba management container. This container ran the SCP command on our container to move the SSL certificate.

SCP reads its configuration from a directory we control within our container by default. We placed a malicious SSH configuration file there. When the SCP command loaded this configuration, it ran a command we placed within the file. This trick let us escape our limited container and jump to the Alibaba Management Container because it unknowingly executed our command.

Ronen: A crucial factor in this exploit was the shared volume. This volume acted like a shared home directory for our container and the management container since the same user existed in both containers. We could exploit this shared space because SCP reads its configuration from the user's home directory by default. By replacing the default configuration with ours containing a malicious command, we tricked the management container into running it when it used SCP.

Bart: What does successfully creating a privileged container using the Docker API tell us about cloud security in general?

Ronen: Many cloud environments rely on Docker to manage their containers. You can create a new container through an HTTP request if you gain access to the Docker API socket. This container could be privileged, meaning it shares resources like namespaces and possibly even volumes with the underlying host machine, the Kubernetes node. Spawning a privileged container grants you access to almost everything the node has access to.

Hillai: You transition from being a guest in the container to gaining complete control of the host machine.

Bart: Gainin access to the node would only give you control of some of the Kubernetes clusters, would it?

Ronen: With code execution on the node, we could use Kubelet credentials to explore further, looking for commands, codes, secrets, and other information. In our case, Alibaba had misconfigured its Kubelet credentials: it was too powerful. We could list all pods, see all the code in the cluster, potentially containing customer data, and even retrieve all the secrets using the "kubectl get secret" command. This misconfiguration was the key that unlocked broader access for us.

Bart: Did you achieve the entire exploit on a single node within the cluster?

Ronen: Yes, we were on a single node. Using the compromised Kubelet credentials, we could see all the other nodes and resources in the cluster.

Hillai: While the specific node we compromised was isolated and didn't contain data from other customers, the service account associated with Kubelet had excessive permissions. Even though the node itself was secure, this service account allowed us to access sensitive information across the entire cluster, including pods, nodes, and secrets belonging to other customers.

Bart: What was the next step after taking over Alibaba's managed PostgreSQL offering? Did you contact Alibaba to report your findings?

Hillai: Once we discovered the ability to access data belonging to other customers, our research stopped immediately. We wouldn't risk even accidentally accessing someone else's data. At that point, we documented everything we found and sent a detailed report to Alibaba Cloud, and they responded quickly and professionally. They kept us updated on the fixes they deployed throughout the research process. We immediately report any critical issues to prevent others from exploiting them.

Bart: Can you tell us about any specific fixes they implemented based on your findings?

Ronen: The first issue was a misconfiguration that falsely indicated increased resource consumption. We exploited it to execute unauthorized code on the operating system. We collaborated with Alibaba Cloud to fix this problem. They also resolved the SCP vulnerability problem that allowed unauthorized access to their management container. Finally, they restricted the Kubelet permissions to a narrower scope, granting only specific permissions.

Hillai: Following our research, Alibaba took several steps to address the vulnerabilities we discovered. They limited image pull secret permissions to read-only access, preventing unauthorized uploads. Additionally, they implemented a secure container technology similar to Google's gVisor project. This technology hardens containers and makes them more difficult to escape from, adding another layer of security.

Bart: Throughout this process, what key lessons did you learn?

Hillai: There are two main lessons learned. First, containers shouldn't be relied on as the sole security barrier. While they can be a layer of security, they can be bypassed in various ways. Additional precautions are crucial to ensure proper isolation between customers. We recommend building a layered defense so that a single vulnerability doesn't allow unauthorized access to a competitor company's data.

Second, strong credentials require careful management. As Ronen mentioned, Alibaba originally had a powerful secret that could be read and written across the cluster. This secret also had push access to the central Docker image registry. Following our report, they limited the scope of these credentials. It's essential to be very cautious with such powerful secrets. Ideally, you should scope the secrets to specific actions and minimize them whenever possible. A powerful secret can allow attackers to move across different environments, including production, development, testing, and even development workstations.

Another lesson learned relates to the container itself. The SCP vulnerability we exploited highlights the risk of shared namespaces between containers. In the Alibaba incident, the shared namespace and home directory allowed us to exploit the SCP vulnerability. Always be very careful when sharing namespaces between trusted and untrusted containers. The lesson learned is to minimize what you share and never grant unnecessary permissions. Attackers may exploit even seemingly minor misconfigurations.

Bart: Can you recommend any specific tools that people might need to be aware of if they want to discuss implementing some of these mitigation tactics with their managers?

Hillai: There's one framework I highly recommend: Peach. It's an open-source project developed by our research team and contributions from fantastic people at many companies.

Peach is a framework that outlines how to build secure and isolated environments, whether in the cloud or not. Like a white paper, it's a valuable resource that guides you on properly isolating tenants or customers in a multi-tenant environment. It covers common mistakes to avoid, what to look out for, and how to implement the necessary precautions.

If you manage a multi-tenant environment or need to isolate resources within your environment, Peach is a valuable resource worth exploring. It covers the common mistakes to avoid and offers best practices for implementing protection. It's completely open-source and available on GitHub. We also welcome contributions from anyone with additional tips or tricks we might need to know.

Ronen: I also recommend using secret scanning tools. These tools are essential in our research; we use them to identify potential secrets-related vulnerabilities.

Bart: Do you have any recommendations for securing multi-tenant Kubernetes clusters?

Ronen: Securing multi-tenant Kubernetes clusters involves a few key areas. First, prioritize network security. By default, Kubernetes doesn't restrict node communication, so strong network isolation is essential.

Second, separating namespaces between customers is a good practice when dealing with multi-tenancy.

Additionally, consider implementing container security technologies like gVisor or Kata Containers. Don't solely rely on Docker's security features to prevent container escapes.

Bart: What advice would you give for hardening containers to make them more secure?

Ronen: Our case study with Alibaba revealed they were using shared Linux namespaces between containers, such as their management container and our container. Sharing Linux namespaces can be dangerous. When designing a system that shares namespaces or resources between management and regular user containers, constantly carefully assess and be aware of the risks involved. Container technologies like GVisor and Kata Containers can mitigate the risk of attackers exploiting Linux kernel vulnerabilities in your environment to achieve kernel-level code execution and jump to the Kubernetes node.

Bart: What advice would you give to Kubernetes engineers needing more security experience?

Hillai: Security is crucial. Companies of all sizes, from startups to large corporations, are constantly targeted by malicious actors, not just ethical hackers like us. Anyone managing a service on the internet must understand that they are a potential target for cyberattacks. These attacks range from data breaches to ransomware attacks that turn off your entire operation. Even small projects need to pay more attention to security.

The good news is that many tools can help you achieve security without being a security expert. Tools like gVisor are relatively easy to implement because you don't need to write them from scratch. By using security hardening tools, you gain significant protection benefits.

Ronen: Besides the tools, many online resources are available to learn about security. These resources can help you understand security risks and how to mitigate them. Kubernetes itself has built-in security features, including default security policies. Be security-conscious and take steps to secure your environment.

Bart: You discover a vulnerability and report it to the vendor. What prevents you from exploiting the vulnerability for malicious purposes instead? Wouldn't Alibaba eventually find the problem on its own?

Ronen: We started seeing signs that Alibaba was taking steps to address the issue while we were still in the research phase. They were transparent with us about their efforts. Cloud providers all have security teams that constantly monitor their environments. They likely knew we were there.

Hillai: Cloud providers are doing a great job with security. We're ethical hackers; our goal is to improve security for the cloud community. Penetration testing, or offensive research, is a tool to achieve that goal. We want to fix the vulnerabilities, and it's rewarding to hear that our reports lead to security updates that benefit many customers. We do this to make cloud products more secure and help users learn how to secure their deployments.

We publish blogs and give talks so that security professionals and developers can learn from our research and identify potential problems in their environments.

Bart: What's next on the agenda for you both?

Hillai: We're always working on new research projects. Sagi from our team recently published a blog about a vulnerability in Hugging Face, an AI provider. We have several ongoing projects under disclosure, meaning we can only reveal them once we fix the vulnerabilities.

Follow our blog; it's the first place we announce new findings.

Ronen: Our research will benefit the Kubernetes security community as well.

Bart: How can people contact you if they have questions?

Hillai: We're both on Twitter. My handle is @hillai, and Ronen's is @RonenSHH. You can also email us at research@wiz.io, but Twitter is the best way. Make sure to spell the names correctly.

Wrap up

If you enjoyed this interview and want more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.

If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you're going to become an expert in Kubernetes, look at courses on Learnk8s.
If you want to keep in touch, follow me on Linkedin.

eBPF, sidecars, and the future of the service mesh

Gulcan Topcu — Fri, 07 Jun 2024 07:58:53 +0000

Kubernetes and service meshes may seem complex, but not for William Morgan, an engineer-turned-CEO who excels at simplifying the intricacies. In this enlightening podcast, he shares his journey from AI to the cloud-native world with Bart Farrell.

Discover William's cost-saving strategies for service meshes, gain insights into the ongoing debate between sidecars, Ambient Mesh, and Cilium Cluster Mesh, his surprising connection to Twitter's early days and unique perspective on balancing tech expertise with the humility of being a piano student.

You can watch (or listen) to this interview here.

Bart: Imagine you've just set up a fresh Kubernetes cluster. What's your go-to trio for the first tools to install?

William: My first pick would be Linkerd. It's a must-have for any Kubernetes cluster. I then lean towards tools that complement Linkerd, like Argo and cert-manager. You're off to a solid start with these three.

Bart: Cert Manager and Argo are popular choices, especially in the GitOps domain. What about Flux?

William: Flux would work just fine. I don't have a strong preference between the two. Flux and Argo are great options, especially for tasks like progressive delivery. When paired with Linkerd, they provide a robust safety net for rolling out new code.

Bart: As the CEO, who are you accountable to? Could you elaborate on your role and responsibilities?

William: Being a CEO is an exciting shift from my previous role as an engineer. I work for myself, and I must say, I’m a demanding boss. As a CEO, I focus on the big picture and align everyone toward a common goal. These are the two skills I’ve had to develop rapidly since transitioning from an engineer, where my primary concern was writing and maintaining code.

Bart: From a technical perspective, how did you transition into the cloud-native space? What were you doing before it became mainstream?

William: My early career was primarily focused on AI, NLP, and machine learning long before they became trendy. I thought I’d enter academia but realized I enjoyed coding more than research.

I worked at several Bay Area startups, mainly in NLP and machine learning roles. I was part of a company called PowerSet, which was building a natural language processing engine and was acquired by Microsoft. I then joined Twitter in its early days, around 2010, when it had about 200 employees. I started on the AI side but transitioned to infrastructure because I found it more satisfying and challenging. We were doing what I now describe at Twitter as cloud-native, even though the terminology differed. We didn’t have Kubernetes or Docker, but we had Mesos, the JVM for isolation, and cgroups for a basic form of containerization. We transitioned from a monolithic Ruby on Rails service to a massive microservices deployment. When I left Twitter, we tried to apply those same ideas to the emerging world of Kubernetes and Docker.

Bart: How do you keep up with the rapid changes in the Kubernetes and cloud-native ecosystems, especially transitioning from infrastructure and AI/NLP?

William: My current role primarily shapes my strategy. I learn a lot from the engineers and users of Linkerd, who are at the forefront of these technologies. I also keep myself updated by reading discussions on Reddit platforms like r/kubernetes and r/Linkerd. Occasionally, I contribute to or follow discussions on Hacker News. Overall, my primary source of knowledge comes from the experts I work with daily, giving me valuable insights into the latest developments.

Bart: If you could return to your time at Twitter or even before that, what one tip would you give yourself?

William: I'd tell myself to prioritize impact. As an engineer, I was obsessed with building and exploring new technologies, which was rewarding. However, I later understood the value of stepping back to see where I could make a real difference in the company. Transitioning my focus to high-impact areas, such as infrastructure at Twitter, was a turning point. Despite my passion for NLP, I realized that infrastructure was where I could truly shine. Always look for opportunities where you can make the most significant impact.

Bart: Let’s focus on "Sidecarless eBPF Service Mesh Sparks Debate," which follows up on your previous article “eBPF, sidecars, and the future of the service mesh.” You're one of the creators of Linkerd. For those unfamiliar, what exactly is a service mesh? Why would someone need it, and what value does it add?

William: There are two ways to describe service mesh: what it does and how it works. Service mesh is an additional layer for Kubernetes that enhances key areas Kubernetes doesn't fully address.

The first area is security. It ensures all connections in your cluster are encrypted, authorized, and authenticated. You can set policies based on services, gRPC methods, or HTTP routes, like allowing Service A to talk to /foo but not /bar.

The second area is reliability. It enables graceful failovers, transparent traffic shifting between clusters, and progressive delivery. For example, deploying new code and gradually increasing traffic to it to avoid immediate production traffic. It also includes mechanisms like load balancing, circuit breaking, retries, and timeouts.

The last area is observability. It provides uniform metrics for all workloads across all services, such as success rates, latency distribution, and traffic volume. Importantly, it does this without requiring changes to your application code.

The most prevalent method today involves using many proxies. This approach has become feasible thanks to technological advancements like Kubernetes and containers, which simplify the deployment and management of many proxies as a unified fleet. A decade ago, deploying 10,000 proxies would have been absurd, but it is feasible and practical today. The specifics of deploying these proxies, their locations, programming languages, and practices are subject to debate. However, at a high level, service meshes work by running these layer seven proxies that understand HTTP, HTTP2, and gRPC traffic and enable various functionalities without requiring changes to your application code.

Bart: Can you briefly explain how the data and control planes work in service meshes, especially compared to the older sidecar model with an extra container?

William: A service mesh architecture consists of two main components: a control plane and a data plane. The control plane allows you to manage and configure the data plane, which directs network traffic within the service mesh. In Kubernetes, the control plane operates as a collection of standard Kubernetes services, typically running within a dedicated namespace or across the entire cluster.

The data plane is the operational core of a service mesh, where proxies manage network traffic. The sidecar model, employed by service meshes like Linkerd, deploys a dedicated proxy alongside each application pod. Therefore, a service mesh with 20 pods would have 20 corresponding proxies. The overall efficiency and scalability of the service mesh rely heavily on the size and performance of these individual proxies.

In the sidecar model, service A and service B communication flows through service A's and service B's proxy. Service A sends its message to its sidecar proxy, and then the service A proxy forwards it to service B's sidecar proxy. Finally, service B's proxy delivers the message to service B itself. This indirect communication path adds extra hops, leading to a slight increase in latency. You must carefully consider the potential performance impacts to ensure that service mesh benefits outweigh the trade-offs.

Bart: We've been discussing the benefits of service meshes, but running an extra container for each pod sounds expensive. Does cost become a significant issue?

William: Service meshes have a compute cost, just like adding any component to a system. You pay for CPU and memory, but memory tends to be the more significant concern, as it can force you to scale up instances or nodes.

However, Linkerd has minimized this issue with a "micro proxy" written in Rust. Rust's strict memory management allows fast, lightweight proxies and avoids memory vulnerabilities like buffer overflows, which are common in C and C++. Studies from both Google and Microsoft have shown that roughly 70% of security bugs in C and C++ code are due to memory management errors.

Our choice of Rust as the programming language in 2018 was a calculated risk. Rust offers the best of both worlds: the speed and control of languages like C/C++ and the safety and ease of use of languages with runtime environments like Go. Rust and its network library ecosystem were still relatively young at that time. We invested significantly in underlying libraries like Tokio, Tower, and H2 to build the necessary infrastructure.

The critical role of the data plane in handling sensitive application data drove this decision. We ensured its reliability and security. Rust enables us to build small, fast, and secure proxies that scale with traffic, typically using minimal memory, directly translating to the user experience. Instead of facing long response times (like 5-second tail latencies), users experience faster interactions (closer to 30 milliseconds). A service mesh can optimize these tail latencies, improving user experience and customer retention. Choosing Rust has proven to be instrumental in achieving these goals.

While cost is a factor, the actual cost often stems from operational complexity. Do you need dedicated engineers to maintain complex proxies, or does the system primarily work independently? That human cost usually dwarfs the computational one.

Our design choices have made managing Linkerd’s costs relatively straightforward. However, for other service meshes, costs can escalate if the proxies are large and resource-intensive. Even so, the more significant cost is often not the resources but the operational overhead and complexity. This complexity can demand considerable time and expertise, increasing the overall cost.

Bart: You raise a crucial point about the human aspect. While we address technical challenges, the time spent resolving errors detracts from other tasks. The community has developed products and projects to tackle these concerns and costs. One such example is Istio with Ambient Mesh. Another approach is sidecarless service meshes like Cilium Cluster Mesh. Can you explain what Ambient Mesh is and how it enhances the classic sidecar model of service meshes?

William: We've delved deep into both of these options in Linkerd. While there might come a time when adopting these projects makes sense for us, we're not there yet.

Every decision involves trade-offs regarding distributed systems, especially in production environments within companies where the platform is a tool to support applications. At Linkerd, our priority is constantly reducing the operational workload.

Ambient Mesh and eBPF aren't primarily reactions to complexity but responses to the practical annoyances of sidecars. Their key selling point is eliminating the need for sidecars. However, the real question is: What's the cost of this shift? That's where the analysis becomes crucial.

In Ambient Mesh, rather than having sidecar containers, you utilize connective components, such as tunnels, within the namespace. These tunnels communicate with proxies located elsewhere in the cluster. So essentially, you have multiple proxies running outside of the pod, and the pods use these tunnels to communicate with the proxies, which then handle specific tasks.

This setup is indeed intriguing. As mentioned earlier, running sidecars can be challenging due to specific implications. One such implication is the cost factor, which we discussed earlier. In Linkerd’s case, this is a minor concern. However, a more significant implication is the need to restart the pod to upgrade the proxy to the latest version, given the immutability of pods in Kubernetes.

This situation necessitates managing two separate updates: one to keep the applications up-to-date and another to upgrade the service mesh. Therefore, while the setup has advantages, it also requires careful management to ensure smooth operation and optimal performance.

We operate the proxy as the first container for various reasons, which can lead to friction points, such as when using kubectl logs. Typically, when you request logs, you're interested in your application's logs, not the proxy's. This friction, combined with a desire for networking to operate seamlessly in the background, drives the development of solutions like Ambient and eBPF, which aim to eliminate the need for explicit sidecars.

Both Ambient and eBPF solutions, which are closely related, are reactions to this sentiment of not wanting to deal with sidecars directly. The aim is to make sidecars disappear. Take Istio and most service meshes built on Envoy, for instance. Envoy is complex and memory-intensive and requires constant attention and tuning based on traffic specifics.

Challenges with sidecars are more of a cloud-native trend to market solutions, like writing a blog post proclaiming the death of sidecars rather than being specific to Linkerd. They can sometimes be an inaccurate reflection of the reality of engineering.

In Ambient, eliminating sidecars by running the proxy elsewhere and using tunnel components allows for separate proxy maintenance without needing to reboot applications for upgrades. However, in a Kubernetes environment, the idea is that pods should be rebootable anytime. Kubernetes can reschedule pods as needed, which aligns with the principles of building applications as distributed systems. Yet, there are legacy applications or specific scenarios where rebooting could be more convenient, making the sidecar approach less appealing.

Historically, running cron jobs with sidecar proxies in Kubernetes posed a significant challenge. Kubernetes lacked a built-in mechanism to signal the sidecar proxy when the main job was complete, necessitating manual intervention to prevent the proxy from running indefinitely. This manual process went against the core principle of service mesh, which aims to decouple services from their proxies for easier management and scalability.

Thankfully, one significant development is the Sidecar Container Kubernetes Enhancement Proposal. With this enhancement, you can designate your proxy as a sidecar container, leading to several benefits, like jobs terminating the proxy once finished and eliminating unnecessary resource consumption.

For Linkerd, adopting Ambient mesh architecture introduces more complexity than benefits. The additional components, like the tunnel and separate proxies, add unnecessary layers to the system. Unlike Istio, which has encountered issues due to its architecture, Linkerd's existing design hasn't faced similar challenges. Therefore, the trade-offs associated with Ambient aren't justified for Linkerd.

In contrast, the sidecar model offers distinct advantages. It creates clear operational and security boundaries at the pod level. Each pod becomes a self-contained unit, making independent decisions regarding security and operations, aligning with Kubernetes principles, and simplifying management in a cloud-native environment.

This sidecar approach is crucial for implementing zero-trust security. The critical principle of zero trust is to enforce security policies at the most granular level possible. Traditional approaches relying on a perimeter firewall and implicitly trusting internal components are no longer sufficient. Instead, each security decision must be made independently at every system layer. This granular enforcement is achieved by deploying a sidecar proxy within each application pod, acting as a security boundary and enabling fine-grained control over network traffic, authentication, and authorization.

In Linkerd, every request undergoes a rigorous security check within the pod. This check includes verifying the validity of the TLS encryption, confirming the client's identity through cryptographic algorithms, and ensuring the request comes from a trusted source. Additionally, Linkerd checks whether the request can access the specific resource or method it's trying to reach. This multi-layered scrutiny happens directly inside the pod, providing the highest possible level of security within the Kubernetes framework. Maintaining this tight security model is crucial, as any deviation, like separating the proxy and TLS certificate, weakens the model and introduces potential vulnerabilities.

Bart: The next point I'd like to discuss has garnered significant attention in recent years through Cilium Service Mesh and various domains. What is eBPF?

William: eBPF is a kernel technology that enables the execution of specific code within the kernel, offering significant advantages. Firstly, operations within the kernel are high-speed, eliminating the overhead of context switching between kernel and user space. Secondly, the kernel has unrestricted access to all system resources, requiring robust security measures to ensure eBPF programs are safe. This powerful technology empowers developers to create highly efficient and secure solutions for various system tasks, particularly networking, security, and observability.

Traditionally, user-space programs lacked direct access to kernel resources, relying on system calls to communicate with the kernel. While providing security, this syscall boundary introduced cost overhead, especially with frequent requests like network packet processing.

eBPF revolutionized this by enabling user-defined code to run within the kernel with stringent safety measures. The number of instructions an eBPF program can execute is limited, and infinite loops are prohibited to prevent resource monopolization. The bytecode verifier meticulously ensures every possible execution path can be explored to avoid unexpected behavior or malicious activity. The bytecode is also verified for GPL compliance by checking for specific strings in its initial bytes.

These security measures make eBPF a powerful but restrictive mechanism, enabling previously unattainable capabilities. Understanding what eBPF can and cannot do is crucial, despite marketing claims that might blur these lines. While many promote eBPF as a groundbreaking solution that could eliminate the need for sidecars, the reality is more nuanced. It's crucial to understand its limitations and not be swayed by marketing hype.

Bart: There appears to be some confusion regarding the extent of limitations associated with eBPF. If eBPF has limitations, does that imply that these limitations constrain all service meshes using eBPF?

William: The idea of an eBPF-based service mesh can sometimes need clarification. In reality, the Envoy proxy still handles the heavy lifting, even in these eBPF-powered meshes. eBPF has limitations, especially in the network space, and can't fully replace the functionality of a traditional proxy.

While eBPF has many applications, including security and performance monitoring, its most interesting potential lies in instrumenting applications. The kernel can directly measure CPU usage, function calls, and other performance metrics by residing in the kernel.

However, when it comes to networking, eBPF faces significant challenges. Maintaining large amounts of state, essential for many network operations, is difficult, bordering on impossible. This challenge highlights the limitations of eBPF in entirely replacing traditional networking components like proxies.

The role of eBPF in networking, particularly within service meshes, is often overstated. While it excels in certain areas, like efficient TCP packet processing and simple metrics collection, other options exist beyond traditional proxies. Complex tasks like HTTP2 parsing, TLS handshakes, or layer seven routings are challenging, if possible, to implement purely with eBPF.

Some projects attempt complex eBPF implementations for these tasks but often involve convoluted workarounds that sacrifice performance and practicality. In practice, eBPF is typically used for layer 4 (transport layer) tasks, while user-space proxies like Envoy handle more complex layer 7 (application layer) operations.

Service meshes like Cilium, despite their claims of being sidecar-less, often rely on daemonset proxies to handle these complex tasks. While eliminating sidecars, this approach introduces its own set of problems. Security is compromised as TLS certificates are mixed in the proxy's memory, and operational challenges arise when the daemonset goes down, affecting seemingly random pods scheduled on that machine.

Linkerd, having experienced similar issues with its first version (Linkerd1.x) running as a daemonset, opted for the sidecar model in subsequent versions. Sidecars provide clear operational and security boundaries, making management and troubleshooting easier.

Looking ahead, eBPF can still be a valuable tool for service meshes. Linkerd, for instance, could significantly speed up raw TCP proxying by offloading tasks to the kernel. However, for complex layer seven operations, a user-space proxy remains essential.

The decision to use eBPF and the choice between sidecars and daemonsets are distinct considerations, each with advantages and drawbacks. While eBPF offers powerful capabilities, it doesn't inherently dictate a specific proxy architecture. Choosing the most suitable approach requires careful evaluation of the system's requirements and trade-offs.

Bart: Can you share your predictions about conflict or uncertainty concerning service meshes and sidecars for the next few years? Is there a possibility of resolving this? Should we anticipate the emergence of new groups? What are your expectations for the near and distant future?

William: While innovation in this field is valuable, relying solely on marketing over technical analysis needs more appeal, especially for those prioritizing tangible customer benefits.

Regarding the future of service meshes, their value proposition is now well-established. The initial hype has given way to a practical understanding of their necessity, with users selecting and implementing solutions without extensive deliberation. This maturity is a positive development, shifting the focus from explaining the need for a service mesh to optimizing its usage.

Functionally, service meshes converge on core features like MTLS, load balancing, and circuit breaking. However, a significant area of development and our primary focus is mesh expansion, which involves integrating non-Kubernetes components into the mesh. We have a big announcement regarding this in mid-February.

Bart: That sounds intriguing. Please give us a sneak peek into what this announcement is about.

William: It is about Linkerd 2.15! The release of Linkerd 2.15 is a significant step forward. It introduces the ability to run the data plane outside Kubernetes, enabling seamless TLS communication for both VM and pod workloads.

The industry mirrors this direction, as evidenced by developments like the Gateway API, which converges to handle both ingress and service mesh configuration within Kubernetes. This unified approach allows consistent configuration primitives for traffic entering, transiting, and exiting the cluster.

The industry will likely focus on refining details like eBPF integration or the advantages of Ambient Mesh while the fundamental value proposition of service meshes remains consistent. I'm particularly excited about how these advancements can be applied across entire organizations, starting with securing and optimizing Kubernetes environments and extending these benefits to the broader infrastructure.

Bart: Shifting away from the professional side, we heard you have an interesting tattoo. Is it of Linkerd, or what is it about?

William: It’s just a temporary one. We handed them out at KubeCon last year as part of our swag. While everyone else gave out stickers, we thought we'd do something more extraordinary. So, we made temporary tattoos of Linky the Lobster, our Linkerd mascot.

When Linkerd graduated within the CNCF, reaching the top tier of project maturity, we needed a mascot. Most mascots are cute and cuddly, like the Go Gopher. We wanted something different, so we chose a blue lobster—an unusual and rare creature reflecting Linkerd's unique position in the CNCF universe.

The tattoo featured Linky the Lobster crushing some sailboats, which is part of our logo. It was a fun little easter egg. If you were at KubeCon, you might have seen them. That event was in Amsterdam.

Bart: What's next for you? Are there any side projects or new ventures you're excited about?

William: I'm devoting all my energy to Linkerd and Buoyant. That takes up most of my focus. Outside of work, I'm a dad. My kids are learning the piano, so I decided to start learning, too. It's humbling to see how fast they pick it up compared to me. As an adult learner, it's a slow process. It's interesting to be in a role where I'm the student, taking lessons from a teacher who's probably a third my age and incredibly talented. It’s an excellent reminder to stay humble, especially since much of my day involves being the authority on something. It’s a nice change of pace and a bit of a reality check.

Bart: That's a good balance. It's important to remind people that doing something you could be better at is okay. As a kid, you're used to it—no expectations, no judgment.

William: Exactly. However, it can be a struggle as an adult, especially as a CEO. I've taught Linkerd to hundreds of people without any panic, but playing a piano recital in front of 20 people is terrifying. It's the complete opposite.

Bart: If people want to contact you, what's the best way?

William: You can email me at william@buoyant.io, find me on Linkerd Slack at slack.linkerd.io, or DM me at @wm on Twitter. I'd love to hear about your challenges and how I can help.

Wrap up

If you enjoyed this interview and want to hear more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
Finally, if you want to keep in touch, follow me on Linkedin.

Clusters Are Cattle Until You Deploy Ingress

Gulcan Topcu — Thu, 30 May 2024 14:07:24 +0000

Managing repeatable infrastructure is the bedrock of efficient Kubernetes operations. While the ideal is to have easily replaceable clusters, reality often dictates a more nuanced approach. Dan Garfield, Co-founder of Codefresh, briefly captures this with the analogy: "A Kubernetes cluster is treated as disposable until you deploy ingress, and then it becomes a pet."

Dan Garfield joined Bart Farrell to understand how he managed Kubernetes clusters, transforming them from "cattle" to "pets" weaving in fascinating anecdotes about fairy tales, crypto, and snowboarding.

You can watch (or listen) to this interview here.

Bart: What are your top three must-have tools starting with a fresh Kubernetes cluster?

Dan: Argo CD is the first tool I install. For AWS, I will add Karpenter to manage costs. I will also use Longhorn for on-prem storage solutions, though I'd need ingress. Depending on the situation, I will install Argo CD first and then one of those other two.

Bart: Many of our recent podcast guests have highlighted Argo or Flux, emphasizing their significance in the GitOps domain. Why do you think these tools are considered indispensable?

Dan: The entire deployment workflow for Kubernetes revolves around Argo CD. When I set up a cluster, some might default to using kubectl apply, or if they're using Terraform, they might opt for the Helm provider to install various Helm charts. However, with Argo CD, I have precise control over deployment processes.

Typically, the bootstrap pattern involves using Terraform to set up the cluster and Helm provider to install Argo CD and predefined repositories. From there, Argo CD takes care of the rest.

I have my Kubernetes cluster displayed on the screen behind me, running Argo CD for those who can't see. I utilize Argo CD autopilot, which streamlines repository setup. Last year, when my system was compromised, Argo CD autopilot swiftly restored everything. It's incredibly convenient. Moreover, when debugging, the ability to quickly toggle sync, reset applications, and access logs through the UI is invaluable. Argo CD is, without a doubt, my go-to tool for Kubernetes. Admittedly, I'm biased as an Argo maintainer, but it's hard to argue with its effectiveness.

Bart: Our numerous podcast discussions with seasoned professionals show that GitOps has been a recurring theme in about 90% of our conversations. Almost every guest we've interviewed has emphasized its importance, often mentioning it as their primary tool alongside other essentials like cert manager, Kyverno, or OPA, depending on their preferences.

Could you introduce yourself to those unfamiliar with you? Tell us your background, work, and where you're currently employed.

Dan: I'm Dan Garfield, the co-founder and chief open-source officer at CodeFresh. As Argo maintainers, we're deeply involved in shaping the GitOps landscape. I've played a key role in creating the GitOps standard, establishing the GitOps working group, and spearheading the OpenGitOps project.

Our journey began seven years ago when we launched CodeFresh to enhance software delivery in the cloud-native ecosystem, primarily focusing on Kubernetes. Alongside my responsibilities at CodeFresh, I actively contribute to SIG security within the Kubernetes community and oversee community-driven events like ArgoCon. Outside of work, I reside in Salt Lake City, where I indulge in my passion for snowboarding. Oh, and I'm a proud father of four, eagerly awaiting the arrival of our fifth child.

Bart: It’s a fantastic journey. We'll have to catch up during KubeCon in Salt Lake City later this year. Before delving into your entrepreneurial venture, could you share how you entered Cloud Native?

Dan: My journey into the tech world began early on as a programmer. However, I found myself gravitating more towards the business side, where I discovered my knack for marketing. My pivotal experience was leading enterprise marketing at Atlassian during the release of Data Center, Atlassian's clustered tool version. Initially, it didn't garner much attention internally, but it soon became a game-changer, driving significant revenue for the company. Witnessing this transformation, including Atlassian's public offering, was exhilarating, although my direct contribution was modest as I spent less than two years there.

I noticed a significant change in containerization, which sparked my interest in taking on a new challenge. Conversations with friends starting container-focused experiences captivated me. Then, Raziel, the founder of Codefresh, reached out, sharing his vision for container-driven software development. His perspective resonated deeply, prompting me to join the venture.

Codefresh initially prioritized building robust CI tools, recognizing that effective CD hinges on solid CI practices and needed to be improved in many organizations at the time (and possibly still is). As we expanded, we delved into CD and explored ways to leverage Kubernetes insights.

Kubernetes had yet to emerge as the dominant force when we launched this journey. We evaluated competitors like Rancher, OpenShift, Mesosphere, and Docker Swarm. However, after thorough analysis, Kubernetes emerged as the frontrunner, boldly cueing us to bet on its potential.

Our decision proved visionary as other platforms gradually transitioned towards Kubernetes. Amazon's launch of EKS validated our foresight. This strategic alignment with Kubernetes paved the way for our deep dive into GitOps and Argo CD, driving the project's growth within the CNCF and its eventual graduation.

Bart: It's impressive how much you've accomplished in such a short timeframe, especially while balancing family life. With the industry evolving rapidly, How do you keep up with the cloud-native scene as a maintainer and a co-founder?

Dan: Indeed, staying updated involves reading blogs, scrolling through Twitter, and tuning into podcasts. However, I've found that my most insightful learnings come from direct conversations with individuals. For instance, I've assisted the community with Argo implementations, not as a sales pitch but to help gather insights genuinely. Interacting with Codefresh users and engaging with the broader community provides invaluable perspectives on adoption challenges and user needs.

Oddly enough, sometimes, the best way to learn is by putting forth incorrect opinions or questions. Recently, while wrestling with AI project complexities, I pondered aloud whether all Docker images with AI models would inevitably be bulky due to PyTorch dependencies. To my surprise, this sparked many helpful responses, offering insights into optimizing image sizes. Being willing to be wrong opens up avenues for rapid learning.

Bart: That vulnerability can indeed produce rich learning experiences. It's a valuable practice. Shifting gears slightly, if you could offer one piece of career advice to your younger self, what would it be?

Dan: Firstly, embrace a mindset of rapid learning and humility. Be more open to being wrong and detach ego from ideas. While standing firm on important matters is essential, recognize that failure and adaptation are part of the journey. Like a stone rolling down a mountain, each collision smooths out the sharp edges, leading to growth.

Secondly, prioritize hiring decisions. The people you bring into your business shape its trajectory more than any other factor. A wrong hire can have far-reaching consequences beyond their salary. Despite some missteps, I've been fortunate to work with exceptional individuals who contribute immensely to our success. When considering a job opportunity, I always emphasize the people's quality, the mission's significance, and fair compensation. Prioritizing in this order ensures fulfillment and satisfaction in your career journey.

Bart: That's insightful advice, especially about hiring. Surrounding yourself with talented individuals can make all the difference in navigating business challenges. Now, shifting gears to your recent tweet about Kubernetes and Ingress, who was the intended audience for that tweet?

Dan: Honestly, it was more of a reflection for myself, perhaps shouted into the void. I was weighing the significance of deploying Ingress within Kubernetes. In engineering, a saying that "the problem is always DNS" suggests that your cluster becomes more tangible once you configure DNS settings. Similarly, setting up Ingress signifies a shift in how you perceive and manage your cluster. Without Ingress, it might be considered disposable, like a development environment. However, once Ingress is in place, your cluster hosts services that require more attention and care.

Bart: For those unfamiliar with the "cattle versus pets" analogy in Kubernetes, could you elaborate on its relevance, particularly in the context of Ingress?

Dan: While potentially controversial, the "cattle versus pets" analogy illustrates a fundamental concept in managing infrastructure. In this analogy, cattle represent interchangeable and disposable resources, much like livestock in a ranching operation. Conversely, pets are unique, loved entities requiring personalized care.

In Kubernetes, deploying resources as "cattle" means treating them as replaceable, identical units. However, Ingress introduces a shift towards a "pet" model, where individual services become distinct and valuable entities. Just as you wouldn't name every cow on a farm, you typically wouldn't concern yourself with the specific details of each interchangeable resource. But once you start deploying services accessible via Ingress, each service becomes unique and worthy of individual attention, akin to caring for a pet.

Bart: It seems the "cattle versus pets" analogy is stirring some controversy among vegans, which is understandable given its context. How does this analogy relate to Kubernetes and Ingress?

Dan: In software, the analogy helps distinguish between disposable, interchangeable components (cattle) and unique, loved entities (pets). For instance, in my Kubernetes cluster, the individual nodes are like cattle—replaceable and without specific significance. If one node malfunctions, I can easily swap it out without concern.

However, once I deploy Ingress and start hosting services, the cluster takes on a different role. While the individual nodes remain disposable, the cluster becomes more akin to a pet. I care about its state, its configuration, and its uptime. Suddenly, I'm monitoring metrics and ensuring its well-being, similar to caring for a pet's health.

So, the analogy underscores the shift in perception and care that occurs when transitioning from managing generic infrastructure to hosting meaningful services accessible via Ingress.

Bart: That's a fascinating perspective. How do Kubernetes and Ingress relate to all of this?

Dan: The ingress in Kubernetes is a central resource for managing incoming traffic to the cluster and routing it to different services. However, unlike other resources in Kubernetes, such as those managed by Argo CD, the ingress is often shared among multiple applications. Each application may have its own deployment rules, allowing for granular control over updates and configurations. For example, one application might only update when manually triggered, while another automatically updates when changes are detected.

The challenge arises because updating Ingress impacts multiple applications simultaneously. Through this centralized routing mechanism, you're essentially juggling the needs of various applications. This complexity underscores the importance of managing the cluster effectively, as each change to Ingress affects the entire ecosystem of applications.

The Argo CD community is discussing introducing delegated server-side field permissions. This feature would allow one application to modify components of another, easing the burden of managing shared resources like Ingress. However, it's still under debate, and alternative solutions may emerge. Other tools, like Contour, offer a different approach by treating each route as a separate custom resource, allowing applications to manage their routing independently.

Ultimately, deploying the ingress marks a shift in the cluster's dynamics, requiring considerations such as DNS settings and centralized routing configurations. As a result, the cluster becomes more specialized and less disposable as its configuration becomes bespoke to accommodate the routing needs of various applications.

Bart: Any recommendations for those who aim to keep their infrastructure reproducible while needing Ingress?

Dan: One approach is abstraction and leveraging wildcards. While technically, you can deploy an Ingress without external pointing; I prefer the concept of self-updating components. Tools like Crossplane or Google Cloud's Config Connector allow you to represent non-Kubernetes resources as Kubernetes objects. Incorporating such tools into your cluster bootstrap process ensures the dynamic creation of necessary components.

However, there's a caveat. Despite reproducible clusters, external components like DNS settings may not be. Updating name servers remains a manual task. It's a tricky aspect of operations that needs a perfect solution.

Bart: How do GitOps and Argo CD fit into solving this challenge?

Dan: GitOps and Argo CD play a crucial role in managing complex infrastructure, especially with sensitive data. The key lies in representing all infrastructure resources, including secrets and certificates, as Kubernetes objects. This approach enables Argo CD to track and reconcile them, ensuring that the desired state defined in Git reflects accurately in your cluster.

Tools like Crossplane, vCluster (for managing multiple clusters), or Cluster API (for provisioning additional clusters) can extend this approach to handle various infrastructure resources beyond Kubernetes. Essentially, Git serves as the single source of truth for your entire infrastructure, with Argo CD functioning as the engine to enforce that truth.

A common issue with Terraform is that its state can get corrupted easily because it must constantly monitor changes. Crossplane often uses Terraform under the hood. The problem is not with Terraform's primitives but with the data store and its maintenance. Crossplane ensures the data store remains uncorrupted, accurately reflecting the current state. If changes occur, they appear as out of sync in Argo CD.

You can then define policies for reconciliation and updates, guiding the controller on the next steps. This approach is crucial for managing infrastructure effectively. Using etcd as your data store is an excellent pattern and likely the future of infrastructure management.

Bart: What would happen if the challenges of managing Kubernetes infrastructure extend beyond handling ingress traffic to managing sensitive information like state secrets and certificates? This added complexity could lead to a "pet" cluster scenario. Would you think backup and recovery tools like Velero would be easier to use without these additional challenges?

Dan: I need to familiarize myself with Velero. Can you tell me about it?

Bart: Velero is a tool focused on backing up and restoring Kubernetes resources. Since you mentioned Argo CD and custom resources earlier, I'm curious about your approach to backing up persistent volumes. How did you manage disaster recovery in your home lab when everything went haywire?

Dan: I've used Longhorn for volume restoration, and clear protocols were in place. I'm currently exploring Velero, which looks like a promising tool for data migration.

Managing data involves complexities like caring for a pet, requiring careful handling and migration. Many people need help managing stateful workloads in Kubernetes. Fortunately, most of my stateful workloads in Kubernetes can rebuild their databases if data is lost. Therefore, data loss is manageable for me. Most of the elements I work with are replicable. Any items needing persistence between sessions are stored in Git or a versioned, immutable secret repository.

Bart: It's worth noting, especially considering what happened with your home lab. Should small startups prioritize treating their clusters like cattle, or is ClickOps sufficient?

Dan: It depends on the use cases. vCluster, a project I'm fond of, is particularly well-suited for creating disposable development clusters, providing developers with isolated sandboxes for testing and experimentation. It allows deploying a virtualized cluster on an existing Kubernetes setup, which saves significantly on ingress costs, especially on platforms like AWS, where you can consolidate ingress into one.

Another example is using Argo CD's application sets to create full-stack environments for each pull request in a Git repository. These environments, which include a virtual cluster, are unique to each pull request but remain completely disposable and easily recreated, much like cattle.

However, managing ingress for disposable clusters can be challenging. When deployed and applied to vClusters, ingress needs custom configurations, requiring separate tracking and maintenance. Despite this, it's still beneficial to prioritize treating infrastructure as disposable. For example, while my on-site Kubernetes cluster is a "pet" that requires careful maintenance, its nodes are considered "cattle" that can be replaced or reconfigured without disrupting overall operations. This abstraction is a core principle of Kubernetes and allows for greater flexibility and resilience.

By abstracting clusters away from custom configurations and focusing on reproducibility, you can treat them more like cattle, even if they have some pet-like qualities due to ingress deployment and DNS configurations. This commoditization of clusters simplifies management and enables greater scalability. The more you abstract and standardize your infrastructure, the smoother your operations will become. And to be clear, this analogy has nothing to do with dietary choices.

Bart: If you could rewind time and change anything, what scenario would you create to avoid writing that tweet?

Dan: We've been discussing a feature in Argo CD that allows for delegated field permissions to happen server-side. It addresses a problem inherent in Kubernetes architecture, particularly regarding ingress. The current setup doesn't allow for external delegation of its components, even though many users operate it that way. If I could make changes, I might have split ingress into an additional resource, including routes as a separate definition that users could manage independently.

Exploring other scenarios where delegated field permissions would be helpful is crucial. Ingress is the most obvious example, highlighting an area for potential improvement. Creating separate routes and resources could solve this issue without altering Argo CD. This approach, similar to Contour's, could be a promising solution. Contour's separate resource strategy demonstrates learning from Ingress and making improvements. We should consider adopting tools like Contour or other service mesh ingress providers, as several compelling options are available.

Bart: If you had to build a cluster from scratch today, how would you address these issues whenever possible?

Dan: Sometimes you just have to accept the challenge and not try to work around it. Setting up ingress and configuring DNS for a single cluster might not be a big deal, but it's worth considering a re-architecture if you're doing it on a large scale, like 250,000 times. For instance, with Codefresh, many users opt for our hybrid setup. They deploy our GitOps agent, based on Argo CD, on their cluster, which then connects to our control plane.

One of the perks we offer is a hosted ingress. Instead of setting up ingresses for each of their 5000 Argo CD instances, users can leverage our hosted ingress, saving money and configuration headaches. Consider alternatives like a tunneling system instead of custom ingress setups, depending on your use case. A hosted ingress can be a game-changer for large-scale distributed setups like multiple Argo CD instances, saving costs and simplifying configurations. Ultimately, re-architecting is always an option tailored to what works best for you.

Bart: We're nearing the end of the podcast and want to touch on a closing question, which we are looking at from a few different angles. How do you deal with the anxiety of adopting a new tool or practice, only to find out later that it might be wrong?

Dan: I've seen this dynamic play out. Sometimes, organizations invest heavily in a tool, only to realize a few years later that there are better fits. Take the example of a company transitioning to Argo workflows for CICD and deployment, only to discover that Argo CD would have been a better fit for most of their use cases. However, these transitions are well-spent efforts. In their case, the journey through Argo workflows paved the way for a smoother transition to Argo CD. Sometimes, detaching the wrong direction is necessary to reach the correct destination faster.

You can only sometimes foresee the ideal solution from where you are, and experimenting with different tools is part of the learning process. It's essential not to dwell on mistakes but to learn from them and move forward. After all, even if a tool ultimately proves to be the wrong choice, it often still brings value. The key is recognizing when a change is needed and adapting accordingly. Mistakes only become fatal if we fail to acknowledge and learn from them.

Bart: We stumbled upon your blog, Today Was Awesome, which hasn't seen an update in a while. You wrote a post about Bitcoin, priced at around $450 in 2015. Are you a crypto millionaire now?

Dan: Not quite! Crypto is a fascinating topic, often sparking wild debates. While there's no shortage of scams in the crypto world, there's also genuine innovation happening. I dabbled in Bitcoin early on and even mined a bit to understand its potential use cases better. One notable experience was mentoring at Hack the North, a massive hackathon where numerous projects leveraged Ethereum. I strategically sold my Bitcoin for Ethereum, which turned out well. However, I'm still waiting on those Lambos—I'm not quite at millionaire status yet!

Bart: Your blog covers many topics, including one post titled "What are we really supposed to learn from fairy tales.” How did you decide on such diverse content?

Dan: I can't recall the exact inspiration, but my wife and I often joke about how outdated the moral lessons in fairy tales feel. Exploring their relevance in today's world is an interesting angle to explore.

Bart: What's next for you? More fairy tales, moon-bound Lamborghinis, or snowboarding adventures? Also, let's discuss your recent tweet about making your bacon. How did that start?

Dan: Ah, yes, making bacon! It's surprisingly simple. First, you get pork belly and cure it in the fridge for seven to ten days. Then, you smoke it for a couple of hours.

My primary motivation was to avoid the nitrates found in store-bought bacon linked to health issues. Homemade bacon tastes better, is of higher quality, and is cheaper. My freezer now overflows with homemade bacon, which makes for a unique and well-received gift. People love the taste; overall, it's been a rewarding and delicious effort!

Bart: Regardless of dietary choices, considering where your food comes from and being involved in the process—whether by growing your food or making it yourself and turning it into a gift for others—creates a different, enriching experience. What's next for you?

Dan: This year, my focus is on environment management and promotion. In the Kubernetes world, we often think about applications, clusters, and instances of Argo CD to manage everything. We're working on a paradigm shift: we think about products instead of applications. In our context, a product is an application in every environment in which it exists. Hence, if you deploy a development application, move it to stage, and finally to production, you're deploying the same application with variations three times. That's what we call a product. We’re shifting from thinking about where an application lives to considering its entire life cycle. Instead of focusing on clusters, we think about environments because an environment might have many clusters.

For instance, retail companies like Starbucks, Chick-fil-A, and Pizza Hut often have Kubernetes clusters on-site. Deploying to US West might mean deploying to 1,300 different clusters and 1,300 different Argo CD instances. We abstract all that complexity by grouping them into the environments bucket. We focus on helping people scale up and build their workflow using environments and establishing these relationships. The feedback has been incredible; people are amazed by what we’re demonstrating.

We're showcasing this at ArgoCon next month in Paris. After that, I plan to do some snowboarding and then make it back in time for the birth of my fifth child.

Bart: That's a big plan. 2024 is packed for you. If people want to contact you, what's the best way to do it?

Dan: Twitter is probably the best. You can find me at @todaywasawesome. If you visit my blog and leave comments, I won't see them, as it's more of an archive now. I keep it around because I worked on it ten years ago and occasionally reference something I wrote.

You can also reach out on LinkedIn, GitHub, or Slack. I respond slower on Slack, but I do get to it eventually.

Wrap up

If you enjoyed this interview and want to hear more Kubernetes stories and opinions, visit KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
Finally, if you want to keep in touch, follow me on Linkedin.

Upgrading Hundreds of Kubernetes Clusters

Gulcan Topcu — Wed, 03 Apr 2024 07:22:58 +0000

Automating the upgrade process for hundreds of Kubernetes clusters is a formidable task, but it's one that Pierre Mavro, the co-founder and CTO at Qovery, is well-equipped to handle. With his extensive experience and a dedicated team of engineers, they have successfully automated the upgrade process for both public and private clouds.

Bart Farell sat with Pierre to understand how he did it without breaking the bank.

You can watch (or listen) to this interview here.

Bart: If you installed three tools on a new Kubernetes cluster, which tools would they be and why?

Pierre: The first tool I recommend is K9s. It's not just a time-saver but a productivity booster. With its intuitive interface, you can speed up all the usual kubectl commands, access logs, edit resources and configurations, and more. It's like having a personal assistant for your cluster management tasks.

The second one is a combination of tools: External DNS, cert-manager, and NGINX ingress. Using these as a stack, you can quickly deploy an application, making it available through a DNS with a TLS without much effort via simple annotations. When I first discovered External DNS, I was amazed at its quality.

The last one is mostly an observability stack with Prometheus, Metric server, and Prometheus adapter to have excellent insights into what is happening on the cluster. You can reuse the same stack for autoscaling by repurposing all the data collected for monitoring.

Bart: Tell us more about your background and how you progressed through your career.

Pierre: My journey in the tech industry has been diverse and enriching. I've had the privilege of working for renowned companies like Red Hat and Criteo, where I honed my skills in cloud deployment. Today, as the co-founder and CTO of Qovery, I bring a wealth of experience in distributed systems, particularly for NoSQL databases, and a deep understanding of Kubernetes, which I began exploring in 2016 with version 1.2.

To provide some context to Qovery's services, we offer a self-service developer platform that allows code deployment on Kubernetes without requiring expertise in infrastructure. We keep our platform cloud-agnostic and place Kubernetes at the core to ensure our deployments are portable across different cloud providers.

Bart: How was your journey into Kubernetes and the cloud-native world, given the changes since 2016?

Pierre: Actually, learning Kubernetes was quite a journey. You had a less developed landscape with most Kubernetes components in alpha at these times. In 2016, I was also juggling between my job at Criteo and my own company.

When it came to deployment, I had several options, and I chose the hard way: deploying Kubernetes on bare metal nodes using KubeSpray. Troubleshooting bare metal Kubernetes deployments honed my skills in pinpointing issues. This hands-on experience provided a deep understanding of how each component, like the Control Plane, kubelet, Container Runtime, and scheduler, interacts to orchestrate containers.

Another resource that I found pretty helpful was "Kubernetes the Hard Way" by Kelsey Hightower despite its complexity.

Lastly, I got help from the official Kubernetes docs.

Bart: Looking back, is there anything you would do differently or advice you would give to your past self?

Pierre: Not really. Looking back, KubeSpray was the best option at the time, and there were no significant changes I would make to the decision.

Bart: You've worked on various projects involving bare metal and private clouds. Can you share more about your Kubernetes experience, such as the scale of clusters and nodes?

Pierre: At Criteo, I led a NoSQL team supporting several million requests per second on a massive 4,500-node bare-metal cluster. Managing this infrastructure - particularly node failures and data consistency across stateful databases like Cassandra, Couchbase, and Elasticsearch - was a constant challenge.

While at Criteo, I also had a personal project where I built a smaller 10-node bare-metal cluster.
This experience with bare metal management solidified my belief in the benefits of Kubernetes, which I later implemented at Criteo.

When we adopted Kubernetes at Criteo, we encountered initial hurdles. In 2018, Kubernetes operators were still new, and there was internal competition from Mesos. We addressed these challenges by validating Kubernetes performance for our specific needs and building custom Chef recipes, StatefulSet hooks, and startup scripts.

Migrating to Kubernetes took eight months of dedicated effort. It was a complex process, but it was worth it.

Bart: As you’ve mentioned, Kubernetes had competitors in 2018 and continues to do so today. Despite the tooling's immaturity, you led a team to adopt Kubernetes for stateful workloads, which was unconventional. How did you guide your team through this transition?

Pierre: We had large instances — all between 50 and 100 CPUs each and 256 gigabytes of RAM up to 500 gigabytes of RAM.

We had multiple Cassandra clusters on a single Kubernetes cluster, and each Kubernetes node was dedicated to a single Cassandra node. We chose this bare metal setup to optimize disk access with SSD or NVMe.

Running these stateful workloads wasn't just a matter of starting them up. We had to handle them carefully because stateful sets like Elasticsearch and Cassandra must keep their data safe even if the machine they're running on fails.

Kubernetes helped us detect issues with these apps using features like Pod Disruption Budgets (PDBs) that limit how often pods can be disrupted, StatefulSets that have consistent ordering of execution and stable storage, and automated probes that trigger actions and alerts when something goes wrong.

Bart: Your experiences helped me better understand your blog post, The Cost of Upgrading Hundreds of Kubernetes Clusters. After managing large infrastructures, you founded Qovery. What drove you to take this step as an engineer?

Pierre: Kubernetes has become a standard, but managing it can be a headache for developers. Cloud providers offer a basic Kubernetes setup, but it often needs more features developers need to get started and deploy applications quickly. Managing the cluster and nodes and keeping them up-to-date is time-consuming. Developers must spend a lot of time adding extra tools and configurations on top of the basic setup and then updating everything, which can be time-consuming.

To tackle these challenges, I founded Qovery.

Qovery provides two critical solutions. First, it offers a unified, user-friendly stack across cloud providers, simplifying Kubernetes deployment and management complexity. Second, it enables developers to deploy code without hassle.

Bart: Managing clusters can have various interpretations. The term can be broad. How do you define cluster management at Qovery in the context of upgrading and recovery?

Pierre: Yes, that's right. At Qovery, we understand the complexity of managing Kubernetes for customers. That's why we automate and simplify the entire process.

We automatically notify you about upcoming Kubernetes updates and handle the upgrade process on schedule, eliminating the need for manual intervention.

We deploy and manage various essential charts for your environment, including tools for logging, metrics collection, and certificate management. You don't need to worry about these intricacies.

We deploy all the necessary infrastructure elements to create a fully functional Kubernetes environment for production within 30 minutes. We provide a complete solution that's ready to go.

We build your container images, push them to a registry, and deploy them based on your preferences. We also handle the lifecycle of the applications deployed.

We use Cluster Autoscaler to automatically adjust the number of nodes (cluster size) based on your actual usage to ensure efficiency. Additionally, we deploy Vertical and Horizontal Pod Autoscalers to scale your applications' resources as their needs change automatically.

By taking care of these complexities, Qovery frees your developers to focus solely on what matters most: building incredible applications.

Bart: How large is your team of engineers?

Pierre: We have ten engineers working on the project.

Bart: How do you manage hundreds of clusters with such a small team?

Pierre: We run various tests on each code change, including unit tests for individual components and end-to-end tests that simulate real-world usage. These tests cover configurations and deployment scenarios to catch potential issues early on.

Before deploying a new cluster for a customer, we put it through its paces on our internal systems for weeks. Then, we deploy it to a separate non-production environment where we closely monitor its performance and address any problems before it reaches your applications.

We closely monitor Kubernetes and cloud providers' updates by following official changelogsand using RSS feeds, allowing us to anticipate potential issues and adapt our infrastructure proactively.

We also leverage tools like Kubent, popeye, kdave, and Pluto to help us manage API deprecations (when Kubernetes deprecates features in updates) and ensure the overall health of our infrastructure.

Our multi-layered approach has proven successful. We haven't encountered any significant problems when deploying clusters to production environments.

Bart: Managing new releases in the Kubernetes ecosystem can be daunting, especially with the extensive changelog. How do you navigate this complexity and spot potential difficulties when a new release is on the horizon?

Pierre: While reading the official update changelogs from Kubernetes and cloud providers is our first step, there are other paths to smooth sailing. Furthermore, understanding these detailed technical documents can be challenging, especially for newer team members who don’t have prior on-premise Kubernetes experience.

Cloud providers typically offer well-defined upgrade processes and document significant changes like removed functionalities, changes in API behavior, or security updates in their changelogs. However, many elements are interconnected in a Kubernetes cluster, especially when you deploy multiple charts for components like logging, observability, and ingress. Even with automated tools, we still need extensive testing and a manual process to ensure everything functions smoothly after an update.

Bart: So, what is your upgrading plan for helm charts?

Pierre: Upgrading Helm charts can be tricky because they bundle both the deployment and the software; for example, upgrading the Loki chart also upgrades Loki itself. To better understand what's changing, you need to review two changelogs: one for the chart itself and another for the software it includes.

We keep a close eye on all the charts we use by storing them in a central repository. This way, we have a clear history of every version we've used. We use a tool called helm-freeze to lock down the specific version of each chart we want to use. We can also track changes between chart and software versions using the git diff command.

If needed, we can also adjust specific settings within the chart using values override.

Like any other code change, we thoroughly test the upgraded charts with unit and functional tests to ensure everything works as expected.

Once testing is complete, we route the updated charts to our test cluster for a final round of real-world testing. After a few days of monitoring, if everything looks good, we confidently release the updates to our customers.

Bart: How do you handle unexpected situations? Do you have a specific strategy or write more automation in the Helm charts?

Pierre: We're excited to see more community Helm charts, including built-in tests! This practice will make it easier for everyone to trust and use these charts in the future.

At Qovery, we enable specific Helm options by default, like 'atomic' and 'wait,' which help prevent upgrade failures during the process. However, there can still be issues that only show up in the logs, so we run additional tests specifically designed to catch these hidden problems.

Upgrading charts that deploy Custom Resource Definitions (CRDs) requires special attention. We've automated this process to upgrade the CRDs first (to the required version) and then upgrade the chart itself. Additionally, for critical upgrades like cert-manager (which manages certificates), we back up and restore resources before applying the upgrade to avoid losing any critical certificates.

If you’re running an older version of a non-critical tool like a logging system, upgrading through each minor version one by one can be time-consuming. We have a better way! Our system allows you to skip to the desired newer version, bypassing all those intermediate updates.

We've also built safeguards into our system to handle potential problems before they occur during cluster upgrades. For example, the system checks for issues like failed jobs, incorrect Pod Disruption Budgets configuration, or ongoing processes that might block the upgrade. If it detects any problems, our engine automatically attempts to fix or clean up the issue. It will also warn you if any manual intervention is needed.

Our ultimate goal is to automate the upgrade process as much as possible.

Bart: Would you say CRDs are your favorite feature in Kubernetes, or do you have another one?

Pierre: CRDs are a powerful tool for customizing Kubernetes, offering a high degree of flexibility. However, the current support and tooling around them leave room for improvement. For example, enhancing Helm with better CRD management capabilities would significantly improve the user experience.

Despite these limitations, the potential of CRDs for customizing Kubernetes is undeniable, making them a genuinely standout feature.

Bart: With your vast Kubernetes experience since 2016, how does your current process scale beyond 100 clusters? What do you need for such scalability?

Pierre: While basic application metrics can provide a general sense of health, managing hundreds of clusters requires more in-depth testing. Here at Qovery, with our experience handling nearly 300 clusters, we've found that:

More than basic metrics are needed. We need comprehensive testing that leverages application-specific metrics to ensure everything functions as expected.

Scaling requires more granular control over deployments, such as halting failures and providing detailed information to our users. For instance, quota issues from the cloud provider might necessitate user intervention.

Drawing from my experience at Criteo, where robust tooling was essential for managing complex tasks, powerful tools are the key to effectively scaling beyond 100 clusters.

Bart: Looking ahead at Qovery's roadmap, what's next for your team?

Pierre: Qovery will add Google Cloud Platform (GCP) by year-end, joining AWS and Scaleway! This expansion gives you more choices for your cloud needs.

We're extracting reusable code sections, like those related to Helm integration, and transforming them into dedicated libraries. By making these functionalities available as open-source libraries, we empower the developer community to leverage them in their projects.

We strongly believe in Rust as a powerful language for building production-grade software, especially for systems like ours that run alongside Kubernetes.

We're also developing a service catalog feature that offers a user-friendly interface and streamlines complex deployments. This feature will allow users to focus on their applications, not the intricacies of the underlying technology.

Bart: Do you have any plans to include Azure?

Pierre: Yes, we have, but integrating a new cloud provider, given our current team size, is challenging. While we are a team of seniors, each cloud provider has nuances; some are more mature or resource-extensive than others.

Today, our focus is on AWS and GCP, as our customers most request. However, we're also working on a more modular approach that will allow Qovery to be deployed on any Kubernetes cluster, irrespective of the cloud provider, although this is still in progress.

Bart: We're looking forward to hearing more about that. So, with your black belt in karate, how does that experience influence how you approach challenges, breaking them down into manageable steps?

Pierre: Karate has taught me the importance of discipline, focus, and breaking down complex tasks into manageable steps. Like in karate, where each move is deliberate and precise, I apply the same approach to challenges in my work, breaking them down into smaller, achievable goals.

Karate has also instilled in me a sense of perseverance and resilience, which are invaluable when facing difficult situations.

Bart: I'm a huge martial arts fan. How do you see martial arts' influence on managing stress in challenging situations?

Pierre: It varies from person to person. My experience in the banking industry has shown me that while some can handle stressful situations, others struggle. Martial arts can help manage stress somewhat, depending on the person.

Bart: How has your 25-year journey in karate shaped your perspective?

Pierre: Karate has become a part of me, and I plan to continue as long as possible.

Bart: What's the best way to reach out to you?

Pierre: You can reach me on LinkedIn or via email. I'm always happy to help.

Wrap up 🌄

If you enjoyed this interview and want to listen to more Kubernetes stories and opinions, head to KubeFM and subscribe to the podcast.
If you want to keep up-to-date with Kubernetes, subscribe to Learn Kubernetes Weekly.
If you want to become an expert in Kubernetes, look at courses on Learnk8s.
And finally, if you want to keep in touch with me, follow me on Linkedin.

DEV Community: Gulcan Topcu

What `os.cpu_count()` Gets Wrong in a CPU-Limited Kubernetes Pod

What the YAML Actually Promises

What Python Sees vs What the Kernel Enforces

What Gunicorn Does With That Number

The Endpoint and the Load Setup

So Which Worker Count Actually Won?

The Prometheus View

Reading the Quota Before Sizing Workers

Key Takeaways

References

Kubelet Metrics: How cAdvisor and CRI Collect Kubernetes Stats

Table of contents

How Kubernetes Monitoring Layers Stack Up

Where Metrics Originate

cgroup v1 with cgroupfs: The Legacy Baseline

At the crux of how cgroup hierarchy is shaped

How Kubernetes Creates and Manages the Cgroup Hierarchy

Kubernetes QoS Classes and cgroup Placement

Auto-Detecting cgroup Drivers via KubeletCgroupDriverFromCRI

cAdvisor: Embedded Resource Monitoring in Kubelet

Kubelet’s Metrics Endpoints

From cAdvisor to CRI: How Kubelet Collects Metrics Today

Validating CRI-Based Metrics Collection in Kubelet

Summary

References

Hacking Alibaba Cloud's Kubernetes Cluster

eBPF, sidecars, and the future of the service mesh

Clusters Are Cattle Until You Deploy Ingress

Wrap up

Upgrading Hundreds of Kubernetes Clusters