DEV Community: Ivan Porta

HAMi Explained: Why Your GPUs Sit Idle While Your Bill Doesn't

Ivan Porta — Tue, 14 Jul 2026 16:32:43 +0000

Before the release of Dynamic Resource Allocation and its alpha partitionable-device and consumable-capacity extensions, Kubernetes scheduled GPUs only as whole cards, with no native option to allocate a fraction of a device. For example, if a pod requires only 4 GB of an 80 GB H100, it still reserves the entire GPU for its lifetime. This leads to significant underutilization of expensive hardware in shared clusters. Chatbots designed for peak traffic may remain idle on dedicated cards during off-peak hours; multiple inference services often run on separate GPUs despite low usage, and monitoring dashboards may show all GPUs allocated while tools like nvidia-smi often reveal minimal actual utilization. As a result, new budget requests for additional GPUs arise because the cluster appears to be at capacity. DRA narrows this gap by preventing over-committing a device's declared budget, but it does not natively enforce resource limits once a pod is running, which remains the device driver's responsibility. This is where HAMi provides value.

HAMi (Heterogeneous AI Computing Virtualization Middleware) closes the gap by allowing pods to request an exact portion of a GPU’s memory and compute, with an in-container enforcement layer that holds them to it, with no changes to applications or images. Companies like SNOW and NIO run HAMi in production, and the project reached CNCF Incubating in July 2026.

What is HAMi, actually?

HAMi is an open-source GPU virtualization layer for Kubernetes. It lets a pod request a precise amount of memory and an AI accelerator’s compute, then schedules the pod onto a device with that much free capacity, binding it to that slice at runtime. The same model spans many accelerators, including NVIDIA GPUs, Cambricon MLUs, Hygon DCUs, Ascend NPUs, Moore Threads, MetaX, and others, each via its own device plugin and isolation backend. On NVIDIA, for example, HAMi replaces the stock device plugin and enforces the memory cap by intercepting the CUDA driver API within the container; other vendors provide analogous library- or hardware-level isolation.

The user experience is deliberately simple: the standard device resource request is extended with one or more limits that the scheduler reads at scheduling time to find a card with sufficient free memory and compute resources. On NVIDIA, you keep requesting nvidia.com/gpu and add the two extended resources beside it: nvidia.com/gpumem and nvidia.com/gpucores.

resources:
  limits:
    nvidia.com/gpu: 1       
    nvidia.com/gpumem: 8000 
    nvidia.com/gpucores: 30

Two comparisons put that isolation model in context. MIG draws a hardware boundary: each instance gets its own SMs and a dedicated route through L2 cache slices, memory controllers, and DRAM channels. HAMi’s boundary is a fence, not a wall. Its limits are enforced in software inside the container, which reliably holds workloads to their budgets but cannot contain a device-level failure: a hung process, a driver reset, or an XID error on the shared card still hits every tenant on it. The two work at different layers, though, and they can be complementary rather than competing; we will come back to how HAMi can drive MIG directly.

Compared with time-slicing, however, HAMi offers significant improvements in stability. Time-slicing simply advertises multiple schedulable replicas of a single card and interleaves their workloads on the compute engine, without per-replica memory caps or isolation. That makes it dangerously easy for one workload to allocate past its notional share and take the others down with it. HAMi refuses the excess allocation instead. The following log shows PyTorch reporting a total capacity matching the 1000 MB limit set by nvidia.com/gpumem; when the pod tried to exceed it, the allocator stopped it with an OOM error, leaving the other vGPU tenants unaffected:

(EngineCore pid=95) ERROR 07-11 21:15:46 [core.py:1231] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 1000.00 MiB of which 12.00 MiB is free. Including non-PyTorch memory, this process has 1020.00 MiB memory in use. Of the allocated memory 864.68 MiB is allocated by PyTorch, and 41.32 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://docs.pytorch.org/docs/stable/notes/cuda.html#optimizing-memory-usage-with-pytorch-cuda-alloc-conf)

How a shared-GPU request flows through HAMi

From manifest to enforced slice, a request passes through four hands, shown in the diagram below: the device plugin advertises each card’s capacity, the mutating webhook reroutes the pod to HAMi’s scheduler, the scheduler extender picks a node and a slice, and an in-container library enforces that slice at runtime.

On each node, the device-plugin (which runs as a DaemonSet) polls the local devices every 30 seconds via the vendor's management library (e.g., NVML for NVIDIA, which powers nvidia-smi). It then publishes this telemetry data by patching the node with a hami.io/node-<vendor>-register annotation.

kubectl get nodes gtrekter -o yaml
apiVersion: v1
kind: Node
metadata:
  annotations:
    hami.io/node-handshake: Requesting_2026-07-09 09:03:57
    hami.io/node-nvidia-register: '[{"id":"GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e","count":10,"devmem":6141,"devcore":100,"type":"NVIDIA GeForce RTX 4050 Laptop GPU","mode":"hami-core","health":true,"devicepairscore":{}}]'
    nvidia.com/gpu-driver-upgrade-enabled: "true"
  ...

When a pod is created, the API server calls HAMi's mutating webhook, which checks the pod's manifest against its supported device drivers, including NVIDIA, Ascend, AWS Neuron, and others. Each driver checks the container's resources.limits for recognized resource keys. For example, the NVIDIA driver looks for nvidia.com/gpu, nvidia.com/gpumem, and nvidia.com/gpucores, while the Cambricon MLU driver matches cambricon.com/vmlu. As soon as a compatible request is detected, HAMi updates the pod's schedulerName to hami-scheduler, ensuring the pod is handled by HAMi's device-aware scheduling logic rather than the default Kubernetes scheduler.

Note: Under HAMi, requesting nvidia.com/gpu: 1 no longer means "one whole card," but rather "one shared physical card," with the memory and core limits dictating the exact slice. However, a pod that omits nvidia.com/gpumem falls back to the configured default (100% of the card's memory).

kubectl logs -n hami-system deploy/hami-scheduler -c vgpu-scheduler-extender
I0710 13:51:38.552275  config.go:111] Initializing NVIDIA device
I0710 13:51:38.552355  config.go:248] All devices initialized successfully
...
I0710 14:02:14.161116  devices.go:647] "Resource requirements collected" pod="default/chat-qwen-64c4bd66f9-8qccs" requests=[{"NVIDIA":{"Nums":1,"Type":"NVIDIA","Memreq":4000,"MemPercentagereq":101,"Coresreq":60}}]

The vgpu-scheduler-extender builds a live usage map for every physical GPU in the cluster. It reads device metadata from each node's hami.io/node-nvidia-register annotation, then loops through the pods already assigned to that node to calculate the consumed GPU memory and compute capacity.

kubectl logs -n hami-system deploy/hami-scheduler -c vgpu-scheduler-extender
...
I0710 14:02:14.161182 node_policy.go:82] node node1 used 0, usedCore 0, usedMem 0,
I0710 14:02:14.161191 gpu_policy.go:104] device GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e user 0, userCore 0, userMem 0,
...
I0710 16:54:54.475656 node_policy.go:82] node node1 used 1, usedCore 60, usedMem 4000,
I0710 16:54:54.475667 gpu_policy.go:104] device GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e user 1, userCore 60, userMem 4000,

The scheduler then filters out any nodes lacking sufficient capacity and scores the remaining nodes based on the configured HAMi scheduling policy (nodeSchedulerPolicy and gpuSchedulerPolicy; the chart defaults are binpack at the node level and spread at the GPU level). The selected node and chosen GPU slice are recorded in the pod's annotations (hami.io/vgpu-devices-allocated) as a comma-separated list of UUID,Type,memMB,cores records.

$ kubectl describe pod -n default embed-bge-f84696cc9-7587j
Name:         embed-bge-f84696cc9-7587j
Labels:       hami.io/vgpu-node=node1
...
Annotations:  hami.io/vgpu-devices-allocated: GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e,NVIDIA,2000,30:;
              hami.io/vgpu-devices-to-allocate: ;;
              hami.io/bind-phase: success
              hami.io/bind-time: 1783702494
...

Once the pod is bound to a node, the Kubelet calls Allocate on that node's HAMi device-plugin. The plugin reads the pod's hami.io/vgpu-devices-to-allocate annotation (cleared once the plugin has consumed it, which is why the describe output above shows it empty) to decode the requested slice, identifying the target physical device and its exact compute and memory limits. It then returns a response detailing what the container needs to run against that specific slice: environment variables, host-path mounts (including the enforcement library), and device handles.

kubectl logs -n hami-system ds/hami-device-plugin -c device-plugin | grep -i 'Allocate Response'
...
I0710 16:54:54.499747  232069 server.go:719] Allocate Response [envs:{key:"CUDA_DEVICE_MEMORY_LIMIT_0" value:"2000m"} envs:{key:"CUDA_DEVICE_MEMORY_SHARED_CACHE" value:"/usr/local/vgpu/584ed76a-0856-40b3-8113-1ad696daa252.cache"} envs:{key:"CUDA_DEVICE_SM_LIMIT" value:"30"} envs:{key:"LIBCUDA_LOG_LEVEL" value:"1"} envs:{key:"NVIDIA_VISIBLE_DEVICES" value:"GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e"} mounts:{container_path:"/usr/local/vgpu/libvgpu.so" host_path:"/usr/local/vgpu/libvgpu.so.v2.9.0" read_only:true} mounts:{container_path:"/usr/local/vgpu" host_path:"/usr/local/vgpu/containers/1a5bac44-7888-4ea5-9382-3ef49ae1f1ba_tei"} mounts:{container_path:"/tmp/vgpulock" host_path:"/tmp/vgpulock"} mounts:{container_path:"/etc/ld.so.preload" host_path:"/usr/local/vgpu/ld.so.preload" read_only:true}]

The Kubelet applies this response directly to the container during creation. From that point, HAMi's preloaded control library sits between the application and the vendor driver, enforcing the slice in-process. It intercepts the driver's memory-allocation calls, checks each request against the slice's budget, and returns an OOM error if the workload attempts to exceed its cap. It also intercepts and rewrites driver management queries (like nvidia-smi), ensuring the reported total capacity reflects the assigned limit, and the free capacity reflects that limit minus current usage.

kubectl exec -n default chat-qwen-64c4bd66f9-8qccs -c vllm -- env
CUDA_VERSION=13.0.2
NVIDIA_VISIBLE_DEVICES=GPU-a2fd58d2-edbe-4cfd-4dbd-d4197b35431e
CUDA_DEVICE_MEMORY_LIMIT_0=4000m
CUDA_DEVICE_SM_LIMIT=60
CUDA_DEVICE_MEMORY_SHARED_CACHE=/usr/local/vgpu/5e999bc3-2dbf-46b0-ba81-cc479a1c11dd.cache
...

kubectl exec -n default chat-qwen-64c4bd66f9-8qccs -c vllm -- cat /etc/ld.so.preload
/usr/local/vgpu/libvgpu.so

kubectl exec -n default chat-qwen-64c4bd66f9-8qccs -c vllm -- nvidia-smi --query-gpu=memory.total --format=csv,noheader
4000 MiB
[HAMI-core Msg(272:133498069993280:multiprocess_memory_limit.c:703)]: Cleanup on exit for PID 272
[HAMI-core Msg(272:133498069993280:multiprocess_memory_limit.c:739)]: Exit cleanup complete for PID 272

Scheduling policies that move the bill

As mentioned earlier, the scheduler first filters out nodes lacking sufficient capacity. For the remaining candidates, HAMi determines the optimal placement using two distinct scheduling tiers: the node level and the GPU level. Currently, HAMi supports two primary strategies for each: binpack (pack tasks as densely as possible) and spread (distribute tasks as widely as possible).

metadata:
  annotations:
    hami.io/node-scheduler-policy: "binpack"
    hami.io/gpu-scheduler-policy: "binpack"

Both policies come with distinct operational and financial trade-offs.

Node-Level Strategies

At the node level, the binpack policy packs workloads densely onto the fewest possible machines. This approach enables cluster autoscalers to quickly drain and spin down empty nodes, directly cutting cloud infrastructure bills. However, this density concentrates tenants, meaning a single node or driver failure has a much larger blast radius, and leads to increased contention with less bandwidth headroom.

Conversely, the spread policy distributes workloads evenly across all available nodes. While this successfully lowers per-node contention to provide headroom for unexpected traffic spikes and securely distributes fault domains, it keeps every node "warm." As a result, the autoscaler cannot scale down the cluster, leaving you paying for idle capacity.

GPU-Level Strategies

At the GPU level, the binpack policy completely fills one physical card before scheduling workloads on the next. This preserves entirely free GPUs for large models that require undivided resources and maximizes the number of small slices sharing a single card. The inherent trade-off is that it creates the densest possible cards, maximizing "noisy-neighbor" pressure and exposing a high number of tenants to single-card XID faults.

Alternatively, the spread policy balances workloads across all available cards. This minimizes compute and memory bandwidth contention, making it the superior choice for latency-sensitive inference workloads. The downside is that it fragments available memory across the cluster, meaning a subsequent large job requiring massive, continuous memory might fail to schedule even if the total free memory across the cluster is technically sufficient.

These are placement-time policies only: HAMi scores new pods as they arrive; it does not actively migrate running workloads. Financial savings therefore materialize as pods churn, meaning binpack pays off aggressively on highly elastic inference fleets, but much less so on long-lived, pinned pods.

HAMi and MIG: complementary layers

The fence and the wall operate at different layers, and you do not have to pick one per cluster: HAMi can manage both from a single pool. With dynamic MIG enabled, HAMi drives nvidia-mig-parted to carve and re-carve MIG geometry to fit incoming jobs, removing manual nvidia-smi mig work and pre-committed slice layouts. Pods keep requesting nvidia.com/gpu and nvidia.com/gpumem exactly as before; by default, the scheduler may satisfy that with either a hami-core slice or a MIG instance, and a workload that needs hard isolation can pin itself to the wall with the nvidia.com/vgpu-mode: "mig" annotation. The trade-off is granularity: MIG placements round up to the nearest allowed geometry (an 8 GB request on an A100-40GB lands on a 2g.10gb instance), where hami-core would have fenced off exactly what you asked for.

Operational reality

HAMi replaces your device plugin. The HAMi device plugin and the standard nvidia-device-plugin both want to own nvidia.com/gpu; running both on the same node causes them to fight over resource registration, leading to capacity gaps between real and virtualized counts, and GPU allocation is served by whichever plugin registered last, silently bypassing HAMi’s isolation. If you run the GPU Operator, keep it for the driver, container toolkit, DCGM, and disable its device plugin on HAMi-managed nodes. Scope HAMi with the gpu=on node label and roll it pool by pool, not cluster-wide on day one.
The scheduler image tracks your control plane. hami-scheduler wraps a stock kube-scheduler image, and current charts resolve its tag from the API server’s version at install time, which means a Kubernetes upgrade quietly leaves the old scheduler running until your next Helm upgrade. Add scheduler.kubeScheduler.image.tag to your Kubernetes upgrade runbook, or it will bite you six months in.
gpucores throttles; it does not reserve. The scheduler books your share at placement time, but at runtime that share is a ceiling, never a floor. The limiter is a token bucket checked at kernel launch, not a hardware partition: expect oscillation around the target, bursting above the share while the card is idle, and a few seconds of overlap before limits re-arm when a neighbor arrives mid-burst (set GPU_CORE_UTILIZATION_POLICY=force to pin the limiter on at all times). Benchmark p99 under a noisy neighbor before you commit to an SLO.

A practical recommendation

If your GPU fleet is A100/H100-class, shared by multiple teams, and your utilization dashboards would embarrass you in the finance review, pilot HAMi on one inference node pool this week. The install fits in an afternoon: label two nodes gpu=on, helm install, convert one over-provisioned inference Deployment to gpumem/gpucores slices, and let it run for a week while you watch allocation against DCGM utilization. The arithmetic you’re testing is blunt: an 8-GPU H100 node runs roughly $290,000 to $480,000 a year on-demand depending on cloud and region, so every node that bin-packing drains is real money back. On owned hardware, at roughly $25–30k per H100, every ten points of sustained utilization on a thousand-GPU fleet is $2.5–3M of capex doing work instead of nothing.

AWS Summit Seoul 2026: Korean Enterprises And Agentic AI

Ivan Porta — Thu, 21 May 2026 02:32:09 +0000

Over the last few years, people have been asking the same question about AI: with so much money going into models, GPUs, and data centers, when will it pay off? Earnings calls and analyst reports have often mentioned a possible bubble, and companies that tried generative-AI pilots in 2024 have been waiting to see real production results.

At AWS Summit Seoul 2026, Samsung Electronics, Yogiyo, and Yanolja presented production-oriented agentic-AI/AIOps cases, while KB Kookmin Bank presented a production-scale KBaaS API-infrastructure modernization case. Together, the sessions showed how Korean enterprises are applying AI-era cloud architecture to operations, development, and embedded finance. Here's a summary of how the event was organized, what the enterprise sessions showed, which AWS tools were most common, and the key patterns you can apply to your own plans.

The shift the day was framed around

AWS Korea CEO Ham Ki-ho opened the event by outlining three stages: Generative AI, Agentic AI, and Physical AI. He described Agentic AI as the active phase, where large language models act as reasoning engines that plan, use tools, and take action, instead of just answering prompts. Physical AI was presented as the next step, with a new Physical AI Frontier Program for Korean robotics, AI-chip, and manufacturing companies. This program will support everything from data collection to edge inference and help with global expansion.

AWS CFO Jon Felton announced a cumulative ₩12.6 trillion investment plan in Korea by 2031, described publicly as the largest-ever investment in Korea by a global cloud provider, with an estimated ₩15.06 trillion contribution to Korea's GDP between 2023 and 2027. The Industry Day talks that followed showed how this commitment will play out in practice.

The keynote also highlighted the main agentic tools AWS featured at the event, with Vice President Jason Bennett explaining each one:

Kiro: a new agentic IDE for software development, described on stage as a peer to Claude Code and Copilot-style coding assistants
Amazon Quick Suite: an agentic AI workspace for finding insights, conducting research, automating tasks, visualizing data, and taking actions across apps
AWS Transform and AWS Transform Custom: agentic modernization tools for mainframe, VMware, Windows, and legacy-code modernization, including version upgrades, runtime migrations, language translations, and architecture changes

What the enterprise talks actually showed

Across the agentic-AI sessions, a pattern emerged, but it was not universal. Samsung and Yanolja both used a supervisor or core-agent structure that delegates work to specialist agents. Yogiyo showed a Bedrock AgentCore-based AIOps workflow connected to operational data and tool functions. KBaaS was different: an EKS-based API platform and gateway modernization for embedded finance.

The numbers made it clear these were real deployments, not just experiments. Samsung Electronics used Samsung Account, which supports 2.1 billion users, over 50 services, more than 2.7 million requests per second, and 200,000 transactions per second across four regions on EKS with over 70 namespaces. Samsung set Day-1 AIOps goals including 90%+ MTTR reduction, 99% incident detection within 10 minutes, and reducing human-in-the-loop operational work below 20%, as part of a roadmap toward 'Toil 0% / Human First.' KB Kookmin Bank's KBaaS platform now handles about 1,800 APIs and 200 million calls per day after switching from a third-party API gateway to their own on AWS EKS. Yogiyo showed how they moved from checking multiple consoles and logs during incidents to using an AgentCore-based AIOps loop, which brings metric observation, correlation analysis, change-history comparison, and RCA grounding into one workflow. Yanolja shared the most technical details, describing how their domain-specialist agents for DevOps and SRE are managed by a Core Agent on Bedrock AgentCore Runtime, using Strands Agent as the SDK and end-to-end authentication.

Samsung, Yogiyo, and Yanolja all centered their agentic-AI cases on AIOps or infrastructure operations rather than customer-facing copilots. KB Kookmin Bank's case showed a different but related production pattern: modernizing a high-volume API platform on EKS for embedded finance.

On the development side, the counterpart is AI-DLC (AI-Driven Development Lifecycle), which is AWS's approach to AI-led development with humans checking at validation and final approval. Samsung saw a 70% reduction in lead time using this method, and LG Electronics' MS division reported double the productivity. The main idea is to separate intent from execution: humans decide what should happen and check the results, while agents handle the steps in between.

The AWS-native stack that recurred

A few key components showed up again and again in the enterprise sessions:

Amazon Bedrock AgentCore appeared across Samsung, Yogiyo, and Yanolja. Yanolja explicitly presented Bedrock AgentCore Runtime with Strands Agent and authentication. Yogiyo showed an AIOps architecture using Amazon Bedrock AgentCore with runtime, memory, gateway, and MCP tool functions. Samsung showed an AgentCore-based AIOps architecture with Supervisor, Domain, and Task agents.
Kiro was used as the agentic IDE, and Samsung included it in their toolset along with Amazon Quick Suite and Bedrock AgentCore.
Amazon EKS was the hosting platform for both Samsung Account and KBaaS, the two cases where the host was clearly specified.

The themes worth taking home

Several patterns stood out during the day. First, the supervisor pattern appeared twice independently. Samsung and Yanolja reaching the same structure without coordination is a strong signal. Second, in three of the four cases (Samsung, Yogiyo, Yanolja), the entry point was AIOps or infrastructure operations rather than customer-facing copilots. KB Kookmin Bank instead modernized its API platform on EKS for embedded finance. Third, humans are intentionally kept in the loop. Datadog's session made this clear, stating that sovereignty and judgment remain with people, while agents handle execution via the Datadog AI Agent Builder. Samsung's AI-DLC approach on the development side followed the same principle. Fourth, infrastructure needs to catch up. GS Neotek's session highlighted that the real GPU issue in agent workloads is not a shortage but poor utilization, excessive idle time, and over-allocation. They suggested using Dynamic Resource Allocation (DRA) on EKS as a next-generation operations model that considers workload needs, sharing policies, and topology.

Kubecost Explained: Kubernetes FinOps That Moves the Bill

Ivan Porta — Tue, 19 May 2026 01:38:37 +0000

Most platform teams are familiar with Kubernetes costs. The monthly cloud bill arrives, finance asks why it’s higher, and engineering can only respond with “more workloads.” This gap between what the bill shows and what platform teams can explain is exactly what FinOps aims to address to help optimize operational costs. The real question is whether you need an enterprise platform for this. For most teams, the answer is no. OpenCost and Kubecost can give platform teams the visibility they need, as long as the tool is paired with an operating cadence.

The pressure is real

Kubernetes accounting is no longer just about a single cluster or a single cloud. Most teams now manage fleets of clusters, often across multiple cloud providers, and sometimes combine on-premises control planes with managed services. Containers move between nodes, nodes move between zones, and the same workload might run in several regions to meet latency or compliance needs, making the cost attribution even harder.

Industry data has shown the same problem for years. Back in 2021, a CNCF FinOps survey found that most teams couldn’t reliably measure their Kubernetes spending, with over-provisioning and lack of accountability as the main issues. The same story happened in 2025, with a fleet-telemetry benchmark using real cluster data from over 2,100 organizations on AWS, GCP, and Azure, showing average CPU utilization at 10% and memory utilization at 23%. These numbers come from production telemetry, not just survey responses. The problem hasn’t changed in four years; if anything, it’s become clearer as fleets have grown.

The real issue isn’t that Kubernetes is too expensive. It’s that most teams don’t have enough visibility into what they’re spending.

What Kubecost actually is

Kubecost is a Kubernetes platform for cost allocation, optimization, and governance. After IBM acquired it in 2024, Kubecost now offers both open-source and enterprise versions. At its core, it sits on top of OpenCost, and runs as an in-cluster agent stack with several microservices: data collectors, a cloud-cost ingestor, a forecasting service, a per-node network-costs DaemonSet, and a fast ClickHouse-backed aggregator. By combining Kubernetes telemetry with cloud-provider billing data, Kubecost provides detailed cost allocation views, and accuracy improves when cloud billing integrations are configured and reconciled.

The platform is built around three primary pillars:

Cost allocation rolls spend up by namespace, label, service, workload, or Collection. Kubecost v3 adds a grouping concept that bundles Kubernetes and external cloud-side costs into a single deduplicated unit. Costs are tracked across CPU, memory, persistent volumes, GPUs, and network traffic.
Optimization recommendations suggest right-sized requests based on real usage, propose cheaper node types, and let teams configure quantile-based controls instead of accepting a one-size default. Recommendations can be archived for historical reference and exported as CSV or PDF.
Alerts and governance ship as configurable budget actions, scheduled reports, and Slack/email notifications. Alerts live next to the budget they belong to rather than as a separate alerting subsystem.

How a Kubecost install flows

Cost data collection begins inside your clusters. When it starts, the FinOps agent sets up a watch on the Kubernetes API for the pricing ConfigMap, so any custom pricing rules can be applied without a restart. It also resolves node pricing per node (falling back to a computed value when the cloud provider's price is unavailable). It then collects metrics from each Network Costs DaemonSet pod, creates a new binary snapshot, and, if configured, writes it to external storage such as Azure Blob Storage or AWS S3.

The Network Costs DaemonSet subscribes to the kernel's conntrack table via a netlink socket, parsing each flow per-direction byte and packet counters, and maintains an in-memory map of pod, node, service, and endpoint state via Kubernetes API watches. It uses this map to link observed connections to specific workloads.

While these agents send internal snapshots to shared storage, the Cloud Cost Ingestor manages external financial data. It runs on a schedule, connects to cloud provider billing exports, pulls daily CSVs, and backfills historical data. Because cloud providers release billing data with a several-hour delay, Kubecost reconciles cluster data with cloud billing after a short wait. This means the most recent day or two of cost data is only an estimate, while older data is fully reconciled (assuming a working cloud-billing integration).

The Aggregator is the main engine and uses an embedded ClickHouse database. In a multi-cluster deployment, it fans in snapshots from several agent clusters. It checks multiple ConfigMaps for configuration and falls back to defaults when none are present. It ingests agent snapshots and, when configured, external billing CSVs, then drives them through a multi-stage SQL pipeline that reconciles and de-duplicates overlapping costs and produces the final cost tables that other microservices consume. The Aggregator also manages data retention by setting per-table, per-resolution TTLs in ClickHouse, so fine-grained windows expire within days while rollups are kept for weeks or months.

Finally, the Forecasting Service serves as a predictive cost-monitoring tool, using this data to generate cost forecasts. At the same time, the Cluster Controller uses the Aggregator’s optimization insights to take actions, such as applying right-sizing recommendations directly in the cluster.

Allocate to a real owner first

Kubecost spreads node costs across the pods running on each node, typically weighted by resource requests, and rolls the result up by namespace, label, service, workload, or Collection, covering CPU, memory, PV, GPU, and network.

What makes this approach effective is good organization, not just technical setup. Each namespace should match a real owner, like a team, product, or department. When this mapping is in place, the allocation view shows which team is responsible for each cost, without needing a spreadsheet. Teams that already use labels like team, cost-center, or product can use these for the same purpose, and Collections help make label-based views easy to use.

Labels can change over time, namespaces can increase, and someone will eventually deploy into default. Reviewing the unlabeled bucket each week helps keep the data accurate and useful.

Right-size requests against actual usage

This is the lever that moves the bill the most. Often, developers set CPU and memory requests defensively, never revisit them, and the gap between request and use shows up directly on the invoice. The 10%-CPU / 23%-memory benchmark cited above is a useful authority anchor when finance asks for a number.

A pattern that reliably finds savings: plot requested vs. actual CPU and memory per workload over a few weeks, then walk each workload's request down to what it actually uses. One practical case from a service mesh deployment had proxy sidecars set to 100 millicores each. The node could host roughly 200 pods on paper, but the scheduler exhausted allocatable CPU at around 90 pods because every pod carried a 100 millicores sidecar request on top of its own. After the request-rightsizing pass, pod density per node tripled with no application change and no node fleet change.

Kubecost 3.0 makes this loop tighter. Container Request Sizing Insights show usage visualizations directly in the UI, recommendations can be archived, CSV/PDF exports include labels, and quantile-based controls let you set tighter recommendation percentiles for predictable services and looser ones for bursty workloads. The enterprise tier adds an Automated Container Request Sizing UI that operates across clusters with custom profiles, suspension controls, audit history, and a comparison between recommended and realized savings; the open-source tier gets a free allowance up to 250 cores on EKS primary clusters.

Using a percentile-based recommendation policy is usually the most effective approach in production.

Set the CPU request to the 90th percentile of actual CPU usage from the past week or month, and then add a safety margin. The Kubernetes VPA default is about 15% for CPU. Because CPU is time-shared, the kernel lets bursty workloads use extra capacity when it is available. Adding more padding for rare spikes usually just increases requests without much benefit.
Set the memory request to a high percentile of peak usage, and then add a safety margin. Memory is not time-shared, so going over the limit can cause OOM kills instead of graceful degradation. Aim for about the 90th percentile of peaks, then add a margin. The VPA default is about 20% for memory.

These settings decide the QoS class. A pod is considered Guaranteed only if every container has CPU and memory requests set equal to their limits. If this is not the case for any container, the whole pod is treated as Burstable, or as BestEffort if no container sets any requests or limits. This setup works well for most workloads. Reserve the fully Guaranteed class for critical workloads with strict latency SLAs, where CPU and memory requests match their limits. In those cases, you might waste some headroom, but you get the best eviction protection and, if needed, exclusive CPU pinning.

If you want an additional feedback before automating any of this, you can run VPA in recommendation-only mode for several weeks, or use KRR open-source. Comparing recommendations across KRR, VPA-recommendation-mode, and Kubecost is more reliable than trusting any single tool's number.

Capacity-versus-request as the North Star metric

The ratio that tells you most about cluster efficiency is total pod requests ÷ total node allocatable capacity, across CPU and memory: how much of what you pay for is even claimed by a pod. It is also what Kubecost's request right-sizing is built on. Kubecost ships with several targets against which recommendations are computed: Production 0.65, Development 0.80, High Availability 0.50 (Cluster Right-Sizing API). Below it you carry capacity nothing asks for; above it you've spent the headroom that cluster class should keep. Kubecost also picks the utilization it sizes against by context; development the trending 85th-percentile, production the 98th, HA the 99.9th; and only on a one-day window; longer windows use maximum usage. The often-quoted "85th percentile" is just the development one-day default, not a universal setting.

The ratio is what you watch; an autoscaler (Karpenter, Cluster Autoscaler) moves it — but only if requests are honest, which is why Kubecost's request right-sizing sits upstream of any autoscaling story. The autoscaler reacts to requests; Kubecost tells you whether they reflect reality.

Recent days are directional by design: reconciliation needs a full day of billing data, so for a roughly 48-hour window costs stay at public on-demand pricing unless a node is provably not on-demand; Spot is accurate sooner only through a separately configured AWS Spot data feed (Cloud Billing Integrations). Read efficiency — independent of reconciled pricing — separately from cost.
Node Group Sizing — formerly Cluster Right-Sizing, rebuilt in v3.0 — turns this into an action: it analyzes in-cluster CPU, RAM, and GPU utilization against node capacity over a configurable window and recommends, per node group, changing the node count or switching the instance type. It runs from a preset profile or a custom metric — usage.max/p95/p85/avg or request.max/avg — with a target-utilization threshold per resource, never below average requested resources. It detects node groups by each provider's standard label, so it works across EKS, AKS, and GKE without setup (v3.x docs).

Find the always-on workloads that don't need to be

Once you’ve handled allocation and right-sizing, look at workloads that run all day, every day, even when they don’t have to. In one platform team’s review, 31% of workloads used less than 25% CPU for almost the entire day, yet Kubernetes costs still went up by about 18% over the year. This happened because engineers spent a lot of time tuning capacity and dealing with alerts, and because each team set up its own autoscaling rules differently.

The triage falls into three buckets. Production services that are genuinely over-spec’d belong in the right-sizing loop above. Non-production environments — dev, integration, demo — rarely need to run on weekends or overnight; a scheduled scale-to-zero is the highest-ROI change in this category. Batch and stateless workloads with retry tolerance are candidates for Spot instances, which trade roughly a 90% discount for a two-minute interruption notice.

Kubecost helps you find underused workloads. With Kubecost 3.0’s Advanced Filters, you can quickly sort workloads by namespace, label, or service using AND/OR conditions right in the UI, instead of having to do it elsewhere.

Track commitment coverage and utilization separately

Reserved capacity and savings commitments are common sources of unnecessary cloud costs. Teams often either ignore them and pay full on-demand prices, or buy them and forget to check if they are being used, leaving discounts unused on resources that are no longer needed. There are two important metrics to watch, and they are easy to mix up:

Coverage means the portion of your regular usage that is protected by a commitment.
Utilization is how much of your commitment you actually use.

Each cloud provider has different tools, but they all fit into three main types, and the calculations work the same way everywhere:

Mechanism	Typical max discount vs on-demand	Commitment
Flexible spend commitment	~60–66%	1 or 3 yr; hourly $ commitment; applies across families/regions
Instance-specific reservation	~55–72%	1 or 3 yr; locked to a region + instance family/SKU
Spot / preemptible	up to ~90%	none; interruption notice from ~30 sec to ~2 min

A good approach is to aim for commitment utilization between 80% and 95%, instead of trying to reach 100%. Going for 100% leaves no room for normal changes, like removing unused instances, changing instance types, or handling a drop in traffic. It may look efficient in a quarterly review, but it can cause problems day-to-day. For coverage, aiming for 60% to 75% is reasonable. This range is high enough to get a good discount, but low enough to allow for changes each quarter. These ranges are based on practical experience, not rules set by the cloud provider.

With Kubecost, costs are first estimated using public on-demand cloud provider prices until the actual cloud bill is ready. When the bill becomes available, usually within about 48 hours, Kubecost updates its estimates with the real costs. This update includes Reserved Instances, Savings Plans, committed-use discounts, and Spot pricing, along with any special rates you might have, such as Enterprise Discount Programs.

When to use a commercial FinOps platform instead

Most teams should start with the open-source chart. You can look at the commercial tiers later, once your needs grow.

Capability / Feature	Open-source Kubecost	Commercial FinOps platform
Cost allocation (namespace, label, workload)	✓	✓
Optimization recommendations (right-sizing)	✓ (manual application)	✓ + automated application across clusters
Cloud-billing reconciliation	✓ (basic)	✓ + EDP / RI / custom-discount aware
Multi-cluster aggregation	Manual / federation	✓ (built-in)
SSO, RBAC, audit log	Limited	✓
History retention	Limited by your storage layer	Long-term, vendor-managed
Collections (cloud + K8s dedup)	✓ (3.x)	✓
Automated Container Request Sizing UI	✕ (free tier limited)	✓
Quantile-based recommendation controls	✓ (3.x)	✓
Advanced filters (AND/OR)	✓ (3.x)	✓
Support / SLA	Community	Vendor SLA

If you just need per-namespace allocation, basic recommendations, and Slack alerts for a few clusters, the open-source version is enough. But if you manage many clusters across different clouds, need automated fixes, want vendor-managed history, or need to give your finance team detailed, reconciled discount numbers, then a commercial platform is worth it. The decision should be based on these needs, not just on how the dashboard looks.

Operational reality

ClickHouse and a unified agent replace the old stack. In v3, the 2.x DuckDB store is replaced with a ClickHouse database. This change makes allocation and cloud-cost API queries much faster and more reliable at scale. It also removes the need for Prometheus, which cuts down on memory use and makes deployment easier, while still providing OpenCost-standard metrics.
History is a deliberate choice, not a default. Whatever the storage backend, the retention window is the upper bound on the period-over-period reporting you can produce. Monthly reporting requires at least 30 days; year-over-year requires a year. Tier cold data to object storage if on-cluster retention gets expensive faster than the engineering time to set up the tiered pipeline.
Reconciliation lag is structural. The 24–48 hour billing-reconciliation delay is a property of cloud-provider billing exports, not of Kubecost. Build the operating model around it: argue about last week, not yesterday.
Multi-cluster needs a story. Open-source Kubecost can federate across clusters, but the experience is rougher than the commercial multi-cluster aggregator. Beyond five or six clusters, decide early whether to run per-cluster Kubecost and aggregate externally — into your own warehouse, for example — or pay for the commercial multi-cluster path. Either is defensible; drifting between the two is not.
The EKS add-on offers a quick way to get started. The Kubecost v3 free tier has a $100k USD spend limit over 30 days, while the Amazon EKS optimized Kubecost bundle is listed by AWS as exempt from that spend limit.
The operating model is the deliverable. If a team installs Kubecost but does not set up regular reviews, they will drift just like a team without any FinOps tools. The standard approach is to have a small FinOps group, such as a platform engineer, a finance analyst, and an SRE on rotation, meet each week to review the capacity-versus-request ratio, identify the most over-provisioned workloads, and check any namespace with a significant change in monthly cost. For smaller teams, a 30-minute review every two weeks with the platform engineer and CTO can achieve the same results.

A practical recommendation

If you are considering a FinOps approach for a Kubernetes platform and do not have a contractual obligation to choose a commercial product, start by piloting open-source Kubecost 3.x. Installation can be completed in an afternoon. Assign at least one namespace to a designated owner, provide a request-versus-usage dashboard to one team for two weeks, and share the capacity-versus-request ratio in a channel visible to the platform team. If regular reviews of these metrics become routine, you have achieved FinOps. If not, adopting a commercial platform will not resolve the underlying issues.

What Platform Teams Can Expect From Crossplane v2.2

Ivan Porta — Tue, 05 May 2026 05:21:20 +0000

A developer submits a ticket to request a database. Three days later, the platform team responds, but the configuration isn’t quite right. The developer files another ticket and waits again. By the third week, the database might finally be ready. This cycle repeats across every team and environment, creating a daily reality that most platform teams recognize: developers lose significant time waiting for infrastructure, while platform teams struggle to keep up with the constant flow of requests.

There are plenty of good tools for provisioning. Terraform is platform-agnostic and widely used. CloudFormation is the go-to option on AWS, and every cloud provider offers its own console and CLI. Each tool does its job well. This article isn’t about choosing the best one. Instead, it looks at the problem from a different perspective.

Crossplane offers a Kubernetes-native approach to managing cloud resources. Instead of setting up infrastructure outside the cluster and then linking it back, Crossplane brings that infrastructure under the same control loop as your applications. This fits naturally with how many teams already work. Paired with GitOps, a pull request becomes the primary way to manage changes, and the cluster continuously reconciles toward the desired state.

The project has moved quickly over the past year. In August 2025, Crossplane v2 introduced big changes, like removing Claims and adding namespaced composite and managed resources. The latest release, v2.2, adds an alpha Pipeline Inspector for troubleshooting, broader CEL validation, and more improvements.

What Crossplane actually is

Crossplane is a control plane framework for platform engineering. You install it into a Kubernetes cluster, known as the management cluster, and that cluster becomes the control plane for everything outside it: cloud accounts, SaaS APIs, internal tools, and even other Kubernetes clusters. All of these are managed through the same Kubernetes API your applications already use. The management cluster itself must be set up separately; Crossplane does not create it for you. Once Crossplane is running, there is no state file and no separate workflow. Drift is fixed by the same reconciliation loop that keeps your Deployments healthy.

Crossplane has four major components. You can use all four or only the ones you need.

Managed resources (MRs) map directly to external cloud resources in Kubernetes. For example, an S3 from AWS or a ResourceGroup from Azure is considered an MR. Crossplane uses spec.forProvider as the main reference and keeps the actual cloud resource in sync with it. You create MRs using kubectl, and the provider handles provisioning and reconciliation.
Composition lets you create custom APIs using a function pipeline. There are three main parts to understand:
- A CompositeResourceDefinition (XRD) defines a schema. It tells Kubernetes, “here’s a new custom API kind I’m creating, and these are its fields.” You can think of it as a CRD with added features for Crossplane.
- A Composition acts as a recipe. It says, “when someone creates an XR of kind Foo, run this set of functions to create these MRs or other Kubernetes resources.” In version 2, this always uses a function pipeline.
- A Composite Resource (XR) is an instance of the API you defined with an XRD. When a user creates an XR, Crossplane uses the matching Composition’s pipeline to generate the needed resources. You can write functions in YAML, KCL, Python, or Go.
Operations run function pipelines to completion, similar to a Kubernetes Job. There are three modes: Operation (one-time), CronOperation (scheduled), and WatchOperation (event-driven). Operations are currently in alpha.
The package manager handles installing and updating providers, configurations, and functions.

How a Crossplane request flows

There are two entry points into this flow, depending on what you're applying.

When a developer or a pipeline creates an XR in a namespace, the composition engine watches it, runs the configured function pipeline, and creates the needed resources. These resources can be other Kubernetes resources, managed resources, or both.

When a user applies an MR directly, either by itself or as part of a Composition, the provider takes over. It monitors the MR through the Kubernetes API, calls the external system to create or update the real resource, and updates the status. After that, it keeps checking: if the real resource changes from spec.forProvider, the provider fixes it. All state is stored in etcd, so there is no separate state file.

When to use traditional IaC instead

Crossplane and tools like Terraform or CloudFormation overlap in scope (both can provision a cloud database) and differ in how. The right choice depends on where your platform already lives.

Capability / Feature	Terraform	CloudFormation	Crossplane
Control loop	Manual `apply` (or pipeline)	Manual stack create/update	Continuous reconciliation
Drift handling	Detect with `plan`; correct manually	Detect drift action; correct via stack update	Detected and corrected automatically
State	`tfstate` in a remote backend you secure (e.g., S3 with versioning, HCP Terraform)	AWS-managed (server-side)	Kubernetes API objects in the management cluster's etcd
Workflow	Separate from app deployment	Separate from app deployment	Same as `kubectl apply`
Composition	Modules	Nested stacks, Modules	XRDs + Compositions + functions
Languages	HCL, JSON	YAML, JSON	YAML, Go, Python, KCL, CUE, HCL (via composition functions)
Built-in policy	Variable validation and pre/postconditions (OSS); Sentinel and OPA integration in HCP Terraform / Enterprise	cfn-guard, Hooks	XRD CEL validations (incl. metadata in v2.2)
Multi-cloud	Provider per cloud, separate state	AWS-first (third-party types via the CloudFormation registry)	One control plane, one API surface
Footprint	A binary	AWS-managed service (CLI/SDK only)	A Kubernetes control plane (Crossplane core, providers, functions) backed by etcd
Operates outside Kubernetes	✓	✓	✕ — requires a management cluster

If your team does not use Kubernetes, Crossplane is not the best place to start. Terraform is simpler and does not need a control plane. But if you are on Kubernetes, especially if you already use Argo CD or Flux, it is easy to manage your infrastructure in the same way. Crossplane is the closest option for writing infrastructure as code and handling it like the rest of your declarative cluster state.

What new with v2.2

v2.2 adds five things you'll notice in practice and one that quietly improves reliability. Each one closes a specific gap that platform teams have been hitting in production.

Pipeline inspector (alpha): Composition functions are powerful, but they have always been hard to debug. If a pipeline acts up on a running control plane, you could only see what each function got and returned by writing tests, running crossplane render locally, or adding your own instrumentation. v2.2 adds the pipeline inspector. When you turn on the feature flag, the Crossplane controller intercepts every RunFunctionRequest and RunFunctionResponse and forwards them over gRPC to a Unix socket you set up. A sidecar read from this socket and handle the data however you need: stream it to stdout during development or send it to an audit pipeline in production. To use it, add --enable-pipeline-inspector to Crossplane. The default socket path is /var/run/pipeline-inspector/socket, but you can change it with --pipeline-inspector-socket.

  # Enable the pipeline inspector feature flag
  args:
    - --enable-pipeline-inspector
    - --pipeline-inspector-socket=/var/run/pipeline-inspector/socket

  # Inject the pipeline inspector sidecar
  sidecarsCrossplane:
    - name: pipeline-inspector
      image: xpkg.crossplane.io/crossplane/inspector-sidecar:v0.0.3
      args:
        - --socket-path=/var/run/pipeline-inspector/socket
        - --max-recv-msg-size=8388608  # 8MB
      volumeMounts:
        - name: pipeline-inspector-socket
          mountPath: /var/run/pipeline-inspector
      resources:
        requests: { cpu: 10m, memory: 64Mi }
        limits:   { cpu: 100m, memory: 128Mi }

  # Add the shared volume for Unix socket communication
  extraVolumesCrossplane:
    - name: pipeline-inspector-socket
      emptyDir: {}

  extraVolumeMountsCrossplane:
    - name: pipeline-inspector-socket
      mountPath: /var/run/pipeline-inspector

XRD validation outside spec: XRD validation outside x-kubernetes-validations, (which are Kubernetes' CEL-based validation rules) used to only work on fields under an XR's spec. If you wanted to enforce rules like "all Database names must start with db-", you had to use an external admission controller such as Kyverno, OPA/Gatekeeper, or a custom webhook. With v2.2, that restriction is gone. Now, you can write CEL rules outside of spec, and the API server enforces them at admission time.

  apiVersion: apiextensions.crossplane.io/v1
  kind: CompositeResourceDefinition
  metadata:
    name: databases.platform.example.org
  spec:
    group: platform.example.org
    names:
      kind: Database
      plural: databases
    versions:
      - name: v1alpha1
        served: true
        referenceable: true
        schema:
          openAPIV3Schema:
            type: object
            x-kubernetes-validations:
              - rule: "self.metadata.name.startsWith('db-')"
                message: "Database names must start with 'db-'"
            properties:
              spec:
                type: object
                properties:
                  region:
                    type: string

ImageConfig runtime for dependencies: A Crossplane package, including Providers, runs as a Deployment. To customize the Deployment, such as by adding service account annotations, pod labels, or container arguments, use a DeploymentRuntimeConfig and reference it from the package.

kind: Provider
spec:
  package: xpkg.crossplane.io/crossplane-contrib/provider-azure-network:v1.0.0
  runtimeConfigRef:
    name: azure-workload-identity

This approach works well when you install the package directly. However, Crossplane can also install packages as dependencies. In that case, you could not get Workload Identity or any other runtime customization onto providers installed as dependencies.

ImageConfig is a cluster-scoped resource that matches packages based on their image prefix, not on which Provider or Configuration object created them. In v2.2, a new field was added: spec.runtime.configRef. With this change, Crossplane applies the DeploymentRuntimeConfig to any package whose image is matched, regardless of how it was installed.

  apiVersion: pkg.crossplane.io/v1beta1
  kind: ImageConfig
  metadata:
    name: azure-workload-identity
  spec:
    matchImages:
      - prefix: xpkg.crossplane.io/crossplane-contrib/provider-azure-
      - prefix: xpkg.crossplane.io/crossplane-contrib/provider-family-azure
    runtime:
      configRef:
        name: azure-workload-identity

Every Azure family provider, whether installed directly or added as a dependency, receives the runtime config.

RequiredSchemas for functions: Composition and functions sometimes need the OpenAPI schema of a resource to validate inputs, make schema-aware decisions, or generate resources dynamically. Before v2.2 you could ask Crossplane for the corresponding CRD as a RequiredResource, parse it, and extract the schema yourself, but only for custom resources, since built-in kinds like Deployment don't have a CRD. v2.2 introduces RequiredSchemas on the RunFunctionResponse thich returns the schema for any kind, built-in or custom.
crossplane beta trace improvements: You can now pass a kind (and optionally a namespace) instead of a single resource and get the dependency tree for every instance. And --watch (alias -w) keeps the output live, the way kubectl get -w does.
Function packages no longer install bundled CRDs: CRDs included in a function package are not applied to the cluster anymore. Also, packages with unknown or disallowed kinds now install successfully and simply skip those objects. Previously, the install would fail in these cases.
The package cache layout has changed: Cache filenames now come from the package’s OCI source and digest instead of the PackageRevision’s Kubernetes name. This change affects side-loading used by some provider e2e suites.

Operational reality

The management cluster is your state. Crossplane does not use an external state file. All XRDs, Compositions, XRs, and managed resources are stored in the management cluster’s etcd. If you lose that cluster without backups, your cloud resources keep running, but Crossplane loses track of them and stops reconciling. Silent drift can build up. Treat the management cluster like any production-critical Kubernetes cluster: use a highly available control plane, back up etcd, and avoid running it on your laptop. Local k3s or kind clusters are fine for learning, demos, or the Get Started guide, but not for important state. This is the trade-off for not having a Terraform state file: you solve one operational problem but gain another that is easier to overlook.
Upgrade through v2.1, not directly. Crossplane performs CRD migrations with each minor version upgrade, so skipping versions can cause you to miss important migrations. If you are on v1.x, use the Crossplane v2 upgrade guide. If you are on v2.1, upgrade directly to v2.2.
v1.20 is not yet end-of-life. v1.20 is still supported and has not reached end-of-life yet. However, since you are on a maintenance-only branch, it’s a good time to start planning your upgrade to v2.x.
Pipeline inspector is alpha. The flag is off by default and the contract may still change. Sidecar image versioning is also not stable yet. Try it in development, since function pipelines are much easier to understand when you can see them, but do not add it to your incident-response runbook yet.
Namespaced MRs are not yet universal. AWS managed resources are fully namespaced. The Upbound Azure provider, which is widely used, and GCP are currently rolling out this feature.
v2 removed several things. Native patch-and-transform composition, the ControllerConfig type, external secret stores, composite resource connection details, and the default registry for packages are no longer available. Most users can upgrade without breaking changes, but if you use these features, you will need to do some cleanup. Before upgrading, run kubectl get pkg and make sure every package uses a fully qualified image, such as registry.example.com/repo/package:tag.

A practical recommendation

If you are considering a control plane for your Kubernetes platform and do not have a strong reason to stick with Terraform, try Crossplane v2.2 first. The Get Started guide can be completed in an afternoon on any Kubernetes cluster. If Crossplane meets your needs, you can manage both application and infrastructure workflows with one declarative model. If not, you will have a clear, documented reason to keep your current tools.

If you are already using Crossplane v2.1, upgrade to v2.2. Features like server-side apply on the MRD controller, dependency-aware runtime config, schema access for functions, and better trace output are valuable even if you do not use the pipeline inspector. If you are still on v1.x, pin to v1.20, migrate any deprecated features, then upgrade to v2.x and continue from there. v2 offers good backward compatibility, but the deprecations are real.

Deep Dive Into Linkerd Automated Sidecar Injection Workflow

Ivan Porta — Sat, 21 Jun 2025 23:28:24 +0000

The Linkerd Proxy-Injector uses a mutating webhook to intercept requests to the Kubernetes API whenever a new Pod is created. If the namespace or Pod is annotated with linkerd.io/inject: enabled, the webhook automatically injects the Linkerd proxy and ProxyInit containers into the Pod spec. In this article, we will take a guided dive into its source code by walking through a sample application.

Prerequisites

macOS/Linux/Windows with a Unix‑style shell
k3d (v5+) for local Kubernetes clusters
kubectl (v1.25+)
Helm (v3+)
Smallstep (step) CLI for certificate generation

Setup

First, we need to spin up a new cluster with k3d. Create the following configuration file:

cat << 'EOF' > cluster.yaml
apiVersion: k3d.io/v1alpha5
kind: Simple
metadata:
  name: "cluster"
servers: 1
agents: 0
image: rancher/k3s:v1.33.0-k3s1
network: playground
options:
  k3s:
    extraArgs:
      - arg: --disable=traefik
        nodeFilters: ["server:*"]
      - arg: --cluster-cidr=10.23.0.0/16
        nodeFilters: ["server:*"]
      - arg: --service-cidr=10.247.0.0/16
        nodeFilters: ["server:*"]
      - arg: --debug
        nodeFilters: ["server:*"]
ports:
  - port: 8081:80
    nodeFilters: ["loadbalancer"]
EOF

We will then use this file to create the cluster:

k3d cluster create --kubeconfig-update-default -c ./cluster.yaml

Now that the cluster is running, we can install Linkerd. First, Linkerd requires a root trust anchor and an intermediate issuer certificate for mTLS identity.

step certificate create root.linkerd.cluster.local ./certificates/ca.crt ./certificates/ca.key \
    --profile root-ca \
    --no-password \
    --insecure
step certificate create identity.linkerd.cluster.local ./certificates/issuer.crt ./certificates/issuer.key \
    --profile intermediate-ca \
    --not-after 8760h \
    --no-password \
    --insecure \
    --ca ./certificates/ca.crt \
    --ca-key ./certificates/ca.key

Finally, install Linkerd with Helm:

helm repo add linkerd-edge https://helm.linkerd.io/edge
helm repo update
helm install linkerd-crds linkerd-edge/linkerd-crds \
  -n linkerd \
  --create-namespace \
  --set installGatewayAPI=true
helm upgrade --install linkerd-control-plane \
  -n linkerd \
  --set-file identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --set controllerLogLevel=debug \
  --set policyController.logLevel=debug \
  linkerd-edge/linkerd-control-plane

The Injection process

First, let’s deploy the following sample application:

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: simple-app
  annotations:
    linkerd.io/inject: enabled
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-app-v1
  namespace: simple-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: server
      version: v1
  template:
    metadata:
      labels:
        app: server
        version: v1
    spec:
      containers:
        - name: http-app
          image: kong/httpbin:latest
          ports:
            - containerPort: 80
EOF

Because the simple-app namespace is annotated with linkerd.io/inject: enabled, Linkerd’s proxy-injector webhook automatically injects a sidecar into the simple-app-v1 Deployment’s Pods. At this point, Kubernetes sends a CREATE request for the Deployment’s ReplicaSet and then for each Pod.

Interactions with the Kuberentes API

Before creating the actual Pod, Kubernetes processes the resource and invokes any matching mutating webhooks in alphabetical order. In Linkerd’s case, this is the *linkerd-proxy-injector-webhook *(defined in a MutatingWebhookConfiguration).

kubectl get mutatingwebhookconfiguration linkerd-proxy-injector-webhook-config -o yaml
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
  name: linkerd-proxy-injector-webhook-config
webhooks:
- admissionReviewVersions:
  - v1
  - v1beta1
  clientConfig:
    caBundle: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSURVakNDQWpxZ0F3SUJBZ0lRSzVya2tEMHVmQVNyRTRPeEV0Q2JTekFOQmdrcWhraUc5dzBCQVFzRkFEQXQKTVNzd0tRWURWUVFERXlKc2FXNXJaWEprTFhCeWIzaDVMV2x1YW1WamRHOXlMbXhwYm10bGNtUXVjM1pqTUI0WApEVEkxTURZd05ERXhNVGcxTVZvWERUSTJNRFl3TkRFeE1UZzFNVm93TFRFck1Da0dBMVVFQXhNaWJHbHVhMlZ5ClpDMXdjbTk0ZVMxcGJtcGxZM1J2Y2k1c2FXNXJaWEprTG5OMll6Q0NBU0l3RFFZSktvWklodmNOQVFFQkJRQUQKZ2dFUEFEQ0NBUW9DZ2dFQkFNOGM3ZXNMNXhNakxFMzlXYUMwZVpJOThtTVhSK24zTUdvWHJJSXc0S3NCeUw1QwpuWHp3Um9ISTV1WnVMR1ZMY0N6L1h0YWozWWp3T0RhL2pLODVKRHZ4ajF2MTFMV3J2NWN5b1ladTBJRm8ybkVLCnpIY21TdVJZSjJwSHFFOHhZQXRmcnh0SktDdldWK3FZTTFLTTI2V1lVT2kzSU9DVGNoV0d4MS9vSENCclFiUnAKalRpSUEvY2d3QU55dXpqQUV3a1ZCRWl4UE92YnduVHl4YmhDZVFBTGZCV2JiM3Z6MGJwTUVKOUxpNkoxVms2egpWOW9ycFA2UW0yam1iNHJ3SElWVGRTN1dXOXU5YWY5SEFGdlozeFdldHhYRXkzRzNvSEl0REFiQ3YyemhaZDNWCkVlYmZHdGR3RDFTQmNqbnlHbTllc1IzSlMySU4vejRKWC9KWmoyVUNBd0VBQWFOdU1Hd3dEZ1lEVlIwUEFRSC8KQkFRREFnV2dNQjBHQTFVZEpRUVdNQlFHQ0NzR0FRVUZCd01CQmdnckJnRUZCUWNEQWpBTUJnTlZIUk1CQWY4RQpBakFBTUMwR0ExVWRFUVFtTUNTQ0lteHBibXRsY21RdGNISnZlSGt0YVc1cVpXTjBiM0l1YkdsdWEyVnlaQzV6CmRtTXdEUVlKS29aSWh2Y05BUUVMQlFBRGdnRUJBR1pxRlFFY0g0THBqK0l5K1dwVFY1VTFuOHFqRGNOMFcyS3AKMHg0T25RaHp3NkZNUm8rR2NBdUR0Nk5kNkROVlZHZjNEdFBtcXhBM21wTUxDTDFSbytnUm9FSWg5N3pxdlZjSQpNalBmeXpkNGhRQ09ocmhyblJFazh2OEN6Rm5YREtPYmkyaUx1THVTNlJtc3I0alpPV2FrdWRKTzlqaUREUmJVCnlvcHhpWTgycW81VmNoT1IvaGg4K1o3S1FKL29lT29BMlp0Zk9QbmZ0VGYvenBwekJPQmtXRUxvYlRRVHRUbUoKbVdNaS9URUQ5QlE4U0NMUU5TUk1SRXpuaElFTGhja0lPVzBqMkNkYmJWWXdkZ2wrTSt1aVYweHdlTk9pQ2RxZApyZm9Yamp6TTh6MWk1Y0FzMS9IcGcvaEt2czlpM2hKNHQwd0NrZ0JQVXcyZzQxN29MU2s9Ci0tLS0tRU5EIENFUlRJRklDQVRFLS0tLS0=
    service:
      name: linkerd-proxy-injector
      namespace: linkerd
      path: /
      port: 443
  failurePolicy: Ignore
  matchPolicy: Equivalent
  name: linkerd-proxy-injector.linkerd.io
  namespaceSelector:
    matchExpressions:
    - key: config.linkerd.io/admission-webhooks
      operator: NotIn
      values:
      - disabled
    - key: kubernetes.io/metadata.name
      operator: NotIn
      values:
      - kube-system
      - cert-manager
  objectSelector:
    matchExpressions:
    - key: linkerd.io/control-plane-component
      operator: DoesNotExist
    - key: linkerd.io/cni-resource
      operator: DoesNotExist
  reinvocationPolicy: Never
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    - services
    scope: Namespaced
  sideEffects: None
  timeoutSeconds: 10

Several selectors determine whether the webhook is triggered. For example, the webhook is skipped if:

The Pod is in a namespace with the label config.linkerd.io/admission-webhooks=disabled, kubernetes.io/metadata.name=kube-system, or kubernetes.io/metadata.name=cert-manager.
The Pod has the label linkerd.io/control-plane-component or linkerd.io/cni-resource.

As you can see in the webhooks.clientConfig.service block, the actual webhook logic resides in the linkerd-proxy-injector Service in the linkerd namespace. Next, let’s take a look at the linkerd-proxy-injector Pod, which contains the actual injection container:

kubectl get pods -n linkerd linkerd-proxy-injector-******** -o yaml
apiVersion: v1
kind: Pod
  ...
  name: linkerd-proxy-injector-********
  namespace: linkerd
spec:
  containers:
  ...
  - args:
    - proxy-injector
    - -log-level=debug
    - -log-format=plain
    - -linkerd-namespace=linkerd
    - -enable-pprof=false
    image: ghcr.io/buoyantio/controller:enterprise-2.18.0
    ...
    name: proxy-injector
    ports:
    - containerPort: 8443
      name: proxy-injector
      protocol: TCP
    - containerPort: 9995
      name: admin-http
      protocol: TCP
    volumeMounts:
    - mountPath: /var/run/linkerd/config
      name: config
    - mountPath: /var/run/linkerd/identity/trust-roots
      name: trust-roots
    - mountPath: /var/run/linkerd/tls
      name: tls
      readOnly: true
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access
      readOnly: true
  volumes:
  - configMap:
      defaultMode: 420
      name: linkerd-config
    name: config
  - configMap:
      defaultMode: 420
      name: linkerd-identity-trust-roots
    name: trust-roots
  - name: tls
    secret:
      defaultMode: 420
      secretName: linkerd-proxy-injector-k8s-tls
  - name: kube-api-access
    ...
  - name: linkerd-identity-token
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          audience: identity.l5d.io
          expirationSeconds: 86400
          path: linkerd-identity-token

Some important details worth mentioning are:

The Pod mounts a volume backed by the linkerd-config ConfigMap, which holds the chart‑rendered values.yaml, including defaults for the proxy image, opaque ports, resource limits, and more.
It also mounts a volume with the linkerd-identity-trust-roots ConfigMap, which contains the trust‑anchor certificate.

This information is read at the beginning of execution and will be used later during injection to configure proxy arguments and environment variables.

valuesConfig, err := config.Values(pkgK8s.MountPathValuesConfig)
if err != nil {
  return nil, err
}
caPEM, err := os.ReadFile(pkgK8s.MountPathTrustRootsPEM)
if err != nil {
  return nil, err
}
valuesConfig.IdentityTrustAnchorsPEM = string(caPEM)

Next, the webhook fetches the namespace object to read any namespace‑level Linkerd annotations. This is important because these values will be propagated to the proxy itself if no annotations in the Deployment override them.

ns, err := api.Get(k8s.NS, request.Namespace)
if err != nil {
  return nil, err
}

Then it constructs a ResourceConfig struct that carries:

The chart values merged with any overrides from Pod annotations.
All namespace‑level annotations, so that any Linkerd annotation set at the namespace is automatically inherited by each Pod.
The resource kind, so the code knows where to look for a Pod template if the resource is a Deployment.
An OwnerRetriever function that can look up the parent resource (e.g., Deployment → ReplicaSet → Pod) so that injected events can be attached to the highest‑level owner.

resourceConfig := inject.NewResourceConfig(valuesConfig, inject.OriginWebhook, linkerdNamespace).
  WithOwnerRetriever(ownerRetriever(ctx, api, request.Namespace)).
  WithNsAnnotations(ns.GetAnnotations()).
  WithKind(request.Kind.Kind)
...
func NewResourceConfig(values *l5dcharts.Values, origin Origin, ns string) *ResourceConfig {
 config := &ResourceConfig{
  namespace:     ns,
  nsAnnotations: make(map[string]string),
  values:        values,
  origin:        origin,
 }
 config.workload.Meta = &metav1.ObjectMeta{}
 config.pod.meta = &metav1.ObjectMeta{}
 config.pod.labels = map[string]string{k8s.ControllerNSLabel: ns}
 config.pod.annotations = map[string]string{}
 return config
}
func (conf *ResourceConfig) WithOwnerRetriever(f OwnerRetrieverFunc) *ResourceConfig {
 conf.ownerRetriever = f
 return conf
}
func (conf *ResourceConfig) WithNsAnnotations(m map[string]string) *ResourceConfig {
 conf.nsAnnotations = m
 return conf
}
func (conf *ResourceConfig) WithKind(kind string) *ResourceConfig {
 conf.workload.metaType = metav1.TypeMeta{Kind: kind}
 return conf
}

At this point, the webhook deserializes the raw JSON bytes from the admission request into typed Kubernetes objects.

report, err := resourceConfig.ParseMetaAndYAML(request.Object.Raw)
if err != nil {
  return nil, err
}
log.Infof("received %s", report.ResName())

You will see logs like the following:

time="2025-06-04T11:19:18Z" level=info msg="received service/simple-app-v1"
time="2025-06-04T11:19:18Z" level=info msg="received admission review request \"919c6889-a59c-4168-be0d-6d448460af98\""

If the resource has a parent, the code looks it up so that any Injected or Skipped Kubernetes Event is attached to the parent:

var parent *metav1.PartialObjectMetadata
var ownerKind string
if ownerRef := resourceConfig.GetOwnerRef(); ownerRef != nil {
  res, err := k8s.GetAPIResource(ownerRef.Kind)
  if err != nil {
    log.Tracef("skipping event for parent %s: %s", ownerRef.Kind, err)
  } else {
    objs, err := api.GetByNamespaceFiltered(res, request.Namespace, ownerRef.Name, labels.Everything())
    if err != nil {
      log.Warnf("couldn't retrieve parent object %s-%s-%s; error: %s", request.Namespace, ownerRef.Kind, ownerRef.Name, err)
    } else if len(objs) == 0 {
      log.Warnf("couldn't retrieve parent object %s-%s-%s", request.Namespace, ownerRef.Kind, ownerRef.Name)
    } else {
      parent = objs[0]
    }
    ownerKind = strings.ToLower(ownerRef.Kind)
  }
}

With those pieces in place, the webhook calls report.Injectable() to decide whether injection should proceed. To do so, it checks the following conditions:

HostNetwork mode (iptables will not work if the Pod is on the host network).
Existing sidecar detection that would make injection redundant
Unsupported resource kinds.
An explicit annotation that disables injection.
Whether the Pod automatically mounts its ServiceAccount token (needed for mTLS).

If any of these checks fail, the function returns false along with one or more human‑readable reasons.

injectable, reasons := report.Injectable()
...
func (r *Report) Injectable() (bool, []string) {
    var reasons []string
    if r.HostNetwork {
        reasons = append(reasons, hostNetworkEnabled)
    }
    if r.Sidecar {
        reasons = append(reasons, sidecarExists)
    }
    if r.UnsupportedResource {
        reasons = append(reasons, unsupportedResource)
    }
    if r.InjectDisabled {
        reasons = append(reasons, r.InjectDisabledReason)
    }

    if !r.AutomountServiceAccountToken {
        reasons = append(reasons, disabledAutomountServiceAccountToken)
    }

    if len(reasons) > 0 {
        return false, reasons
    }
    return true, nil
}

If Injectable() returns true, the webhook proceeds with injection and:

Adds the linkerd-init init container (which configures the iptables rules).
Adds the linkerd-proxy sidecar container with all required environment variables, volume mounts, and command‑line flags.
Appends a config.linkerd.io/created-by annotation.
If the Pod does not already have a config.linkerd.io/opaque-ports annotation, splits the comma‑separated default opaque ports from valuesConfig.Proxy.OpaquePorts, filters them against the actual container ports in the Pod spec, and then sets the annotation to just the matching ports.
If a parent was found, emits a Kubernetes Event on the parent resource with reason Injected and message Linkerd sidecar proxy injected.

Finally, it logs the generated patch at INFO level and debug‑prints the full JSON patch.


if injectable {
  resourceConfig.AppendPodAnnotation(pkgK8s.CreatedByAnnotation, fmt.Sprintf("linkerd/proxy-injector %s", version.Version))
  inject.AppendNamespaceAnnotations(resourceConfig.GetOverrideAnnotations(), resourceConfig.GetNsAnnotations(), resourceConfig.GetWorkloadAnnotations())
  if !resourceConfig.HasWorkloadAnnotation(pkgK8s.ProxyOpaquePortsAnnotation) {
    defaultPorts := strings.Split(resourceConfig.GetValues().Proxy.OpaquePorts, ",")
    filteredPorts := resourceConfig.FilterPodOpaquePorts(defaultPorts)
    if len(filteredPorts) != 0 {
      ports := strings.Join(filteredPorts, ",")
      resourceConfig.AppendPodAnnotation(pkgK8s.ProxyOpaquePortsAnnotation, ports)
    }
  }
  patchJSON, err := resourceConfig.GetPodPatch(true)
  if err != nil {
    return nil, err
  }
  if parent != nil {
    recorder.Event(parent, v1.EventTypeNormal, eventTypeInjected, "Linkerd sidecar proxy injected")
  }
  log.Infof("injection patch generated for: %s", report.ResName())
  log.Debugf("injection patch: %s", patchJSON)
  proxyInjectionAdmissionResponses.With(admissionResponseLabels(ownerKind, request.Namespace, "false", "", report.InjectAnnotationAt, configLabels)).Inc()
  patchType := admissionv1beta1.PatchTypeJSONPatch
  return &admissionv1beta1.AdmissionResponse{
    UID:       request.UID,
    Allowed:   true,
    PatchType: &patchType,
    Patch:     patchJSON,
  }, nil
}
...
func (conf *ResourceConfig) GetPodPatch(injectProxy bool) ([]byte, error) {
    namedPorts := make(map[string]int32)
    if conf.HasPodTemplate() {
        namedPorts = util.GetNamedPorts(conf.pod.spec.Containers)
    }
    values, err := GetOverriddenValues(conf.values, conf.getAnnotationOverrides(), namedPorts)
    values.Proxy.PodInboundPorts = getPodInboundPorts(conf.pod.spec)
    if err != nil {
        return nil, fmt.Errorf("could not generate Overridden Values: %w", err)
    }
    if values.ClusterNetworks != "" {
        for _, network := range strings.Split(strings.Trim(values.ClusterNetworks, ","), ",") {
            if _, _, err := net.ParseCIDR(network); err != nil {
                return nil, fmt.Errorf("cannot parse destination get networks: %w", err)
            }
        }
    }
    patch := &podPatch{
        Values:      *values,
        Annotations: map[string]string{},
        Labels:      map[string]string{},
    }
    switch strings.ToLower(conf.workload.metaType.Kind) {
    case k8s.Pod:
    case k8s.CronJob:
        patch.PathPrefix = "/spec/jobTemplate/spec/template"
    default:
        patch.PathPrefix = "/spec/template"
    }
    if conf.pod.spec != nil {
        conf.injectPodAnnotations(patch)
        if injectProxy {
            conf.injectObjectMeta(patch)
            conf.injectPodSpec(patch)
        } else {
            patch.Proxy = nil
            patch.ProxyInit = nil
        }
    }
    rawValues, err := yaml.Marshal(patch)
    if err != nil {
        return nil, err
    }
    files := []*loader.BufferedFile{
        {Name: chartutil.ChartfileName},
        {Name: "requirements.yaml"},
        {Name: "templates/patch.json"},
    }
    chart := &charts.Chart{
        Name:      "patch",
        Dir:       "patch",
        Namespace: conf.namespace,
        RawValues: rawValues,
        Files:     files,
        Fs:        static.Templates,
    }
    buf, err := chart.Render()
    if err != nil {
        return nil, err
    }
    res := rTrail.ReplaceAll(buf.Bytes(), []byte("\n"))
    return res, nil
}

You can see the related events in the parent deployment:

kubectl describe deployment/simple-app-v1 -n simple-app
...
Events:
  Type    Reason             Age                    From                    Message
  ----    ------             ----                   ----                    -------
  Normal  ScalingReplicaSet  2m26s                  deployment-controller   Scaled up replica set simple-app-v1-658b475d7c from 0 to 1
  Normal  Injected           2m25s (x2 over 2m26s)  linkerd-proxy-injector  Linkerd sidecar proxy injected
  Normal  ScalingReplicaSet  2m25s                  deployment-controller   Scaled up replica set simple-app-v1-76fc99b86b from 0 to 1
  Normal  ScalingReplicaSet  2m19s                  deployment-controller   Scaled down replica set simple-app-v1-658b475d7c from 1 to 0

Once the JSON is generated it returns the resulting []byte patch that Kubernetes applies. You can see the raw byte array of the incoming admission request, including the trust‑anchor PEM, Linkerd‑related annotations, and Pod spec in the logs. For example:

time="2025-06-04T11:19:18Z" level=debug msg="admission request: &AdmissionRequest{UID:919c6889-a59c-4168-be0d-6d448460af98,Kind:/v1, Kind=Service,Resource:{ v1 services},SubResource:,Name:simple-app-v2,Namespace:simple-app,Operation:CREATE,UserInfo:{system:admin  [system:masters system:authenticated] map[authentication.kubernetes.io/credential-id:[X509SHA256=4bc6de6278f805fc173745e09f4a564b7c7b4fac138201f729d912cd623fa55b]]},Object:{[123 34 97 112 105 86 101 114 115 105 111 110 34 58 34 118 49 34 44 34 107 105 110 100 34 58 34 83 101 114 118 105 99 101 34 44 34 109 101 116 97 100 97 116 97 34 58 123 34 97 110 110 111 116 97 116 105 111 110 115 34 58 123 34 107 117 98 101 99 116 108 46 107 117 98 101 114 110 101 116 101 115 46 105 111 47 108 97 115 116 45 97 112 112 108 105 101 100 45 99 111 110 102 105 103 117 114 97 116 105 111 110 34 58 34 123 92 34 97 112 105 86 101 114 115 105 111 110 92 34 58 92 34 118 49 92 34 44 92 34 107 105 110 100 92 34 58 92 34 83 101 114 118 105 99 101 92 34 44 92 34 109 101 116 97 100 97 116 97 92 34 58 123 92 34 97 110 110 111 116 97 116 105 111 110 115 92 34 58 123 125 44 92 34 110 97 109 101 92 34 58 92 34 115 105 109 112 108 101 45 97 112 112 45 118 50 92 34 44 92 34 110 97 109 101 115 112 97 99 101 92 34 58 92 34 115 105 109 112 108 101 45 97 112 112 92 34 125 44 92 34 115 112 101 99 92 34 58 123 92 34 112 111 114 116 115 92 34 58 91 123 92 34 112 111 114 116 92 34 58 56 48 44 92 34 116 97 114 103 101 116 80 111 114 116 92 34 58 53 54 55 56 125 93 44 92 34 115 101 108 101 99 116 111 114 92 34 58 123 92 34 97 112 112 92 34 58 92 34 115 105 109 112 108 101 45 97 112 112 45 118 50 92 34 44 92 34 118 101 114 115 105 111 110 92 34 58 92 34 118 50 92 34 125 125 125 92 110 34 125 44 34 99 114 101 97 116 105 111 110 84 105 109 101 115 116 97 109 112 34 58 110 117 108 108 44 34 109 97 110 97 103 101 100 70 105 101 108 100 115 34 58 91 123 34 97 112 105 86 101 114 115 105 111 110 34 58 34 118 49 34 44 34 102 105 101 108 100 115 84 121 112 101 34 58 34 70 105 101 108 100 115 86 49 34 44 34 102 105 101 108 100 115 86 49 34 58 123 34 102 58 109 101 116 97 100 97 116 97 34 58 123 34 102 58 97 110 110 111 116 97 116 105 111 110 115 34 58 123 34 46 34 58 123 125 44 34 102 58 107 117 98 101 99 116 108 46 107 117 98 101 114 110 101 116 101 115 46 105 111 47 108 97 115 116 45 97 112 112 108 105 101 100 45 99 111 110 102 105 103 117 114 97 116 105 111 110 34 58 123 125 125 125 44 34 102 58 115 112 101 99 34 58 123 34 102 58 105 110 116 101 114 110 97 108 84 114 97 102 102 105 99 80 111 108 105 99 121 34 58 123 125 44 34 102 58 112 111 114 116 115 34 58 123 34 46 34 58 123 125 44 34 107 58 123 92 34 112 111 114 116 92 34 58 56 48 44 92 34 112 114 111 116 111 99 111 108 92 34 58 92 34 84 67 80 92 34 125 34 58 123 34 46 34 58 123 125 44 34 102 58 112 111 114 116 34 58 123 125 44 34 102 58 112 114 111 116 111 99 111 108 34 58 123 125 44 34 102 58 116 97 114 103 101 116 80 111 114 116 34 58 123 125 125 125 44 34 102 58 115 101 108 101 99 116 111 114 34 58 123 125 44 34 102 58 115 101 115 115 105 111 110 65 102 102 105 110 105 116 121 34 58 123 125 44 34 102 58 116 121 112 101 34 58 123 125 125 125 44 34 109 97 110 97 103 101 114 34 58 34 107 117 98 101 99 116 108 45 99 108 105 101 110 116 45 115 105 100 101 45 97 112 112 108 121 34 44 34 111 112 101 114 97 116 105 111 110 34 58 34 85 112 100 97 116 101 34 44 34 116 105 109 101 34 58 34 50 48 50 53 45 48 54 45 48 52 84 49 49 58 49 57 58 49 56 90 34 125 93 44 34 110 97 109 101 34 58 34 115 105 109 112 108 101 45 97 112 112 45 118 50 34 44 34 110 97 109 101 115 112 97 99 101 34 58 34 115 105 109 112 108 101 45 97 112 112 34 125 44 34 115 112 101 99 34 58 123 34 105 110 116 101 114 110 97 108 84 114 97 102 102 105 99 80 111 108 105 99 121 34 58 34 67 108 117 115 116 101 114 34 44 34 112 111 114 116 115 34 58 91 123 34 112 111 114 116 34 58 56 48 44 34 112 114 111 116 111 99 111 108 34 58 34 84 67 80 34 44 34 116 97 114 103 101 116 80 111 114 116 34 58 53 54 55 56 125 93 44 34 115 101 108 101 99 116 111 114 34 58 123 34 97 112 112 34 58 34 115 105 109 112 108 101 45 97 112 112 45 118 50 34 44 34 118 101 114 115 105 111 110 34 58 34 118 50 34 125 44 34 115 101 115 115 105 111 110 65 102 102 105 110 105 116 121 34 58 34 78 111 110 101 34 44 34 116 121 112 101 34 58 34 67 108 117 115 116 101 114 73 80 34 125 44 34 115 116 97 116 117 115 34 58 123 34 108 111 97 100 66 97 108 97 110 99 101 114 34 58 123 125 125 125] <nil>},OldObject:{[] <nil>},DryRun:*false,Options:{[123 34 97 112 105 86 101 114 115 105 111 110 34 58 34 109 101 116 97 46 107 56 115 46 105 111 47 118 49 34 44 34 102 105 101 108 100 77 97 110 97 103 101 114 34 58 34 107 117 98 101 99 116 108 45 99 108 105 101 110 116 45 115 105 100 101 45 97 112 112 108 121 34 44 34 102 105 101 108 100 86 97 108 105 100 97 116 105 111 110 34 58 34 83 116 114 105 99 116 34 44 34 107 105 110 100 34 58 34 67 114 101 97 116 101 79 112 116 105 111 110 115 34 125] <nil>},RequestKind:/v1, Kind=Service,RequestResource:/v1, Resource=services,RequestSubResource:,}"
time="2025-06-04T11:19:18Z" level=debug msg="request object bytes: {\"apiVersion\":\"v1\",\"kind\":\"Service\",\"metadata\":{\"annotations\":{\"kubectl.kubernetes.io/last-applied-configuration\":\"{\\\"apiVersion\\\":\\\"v1\\\",\\\"kind\\\":\\\"Service\\\",\\\"metadata\\\":{\\\"annotations\\\":{},\\\"name\\\":\\\"simple-app-v2\\\",\\\"namespace\\\":\\\"simple-app\\\"},\\\"spec\\\":{\\\"ports\\\":[{\\\"port\\\":80,\\\"targetPort\\\":5678}],\\\"selector\\\":{\\\"app\\\":\\\"simple-app-v2\\\",\\\"version\\\":\\\"v2\\\"}}}\\n\"},\"creationTimestamp\":null,\"managedFields\":[{\"apiVersion\":\"v1\",\"fieldsType\":\"FieldsV1\",\"fieldsV1\":{\"f:metadata\":{\"f:annotations\":{\".\":{},\"f:kubectl.kubernetes.io/last-applied-configuration\":{}}},\"f:spec\":{\"f:internalTrafficPolicy\":{},\"f:ports\":{\".\":{},\"k:{\\\"port\\\":80,\\\"protocol\\\":\\\"TCP\\\"}\":{\".\":{},\"f:port\":{},\"f:protocol\":{},\"f:targetPort\":{}}},\"f:selector\":{},\"f:sessionAffinity\":{},\"f:type\":{}}},\"manager\":\"kubectl-client-side-apply\",\"operation\":\"Update\",\"time\":\"2025-06-04T11:19:18Z\"}],\"name\":\"simple-app-v2\",\"namespace\":\"simple-app\"},\"spec\":{\"internalTrafficPolicy\":\"Cluster\",\"ports\":[{\"port\":80,\"protocol\":\"TCP\",\"targetPort\":5678}],\"selector\":{\"app\":\"simple-app-v2\",\"version\":\"v2\"},\"sessionAffinity\":\"None\",\"type\":\"ClusterIP\"},\"status\":{\"loadBalancer\":{}}}"
time="2025-06-04T11:19:18Z" level=debug msg="/var/run/linkerd/config/values config YAML: clusterDomain: cluster.local\nclusterNetworks: 10.0.0.0/8,100.64.0.0/10,172.16.0.0/12,192.168.0.0/16,fd00::/8\ncniEnabled: false\ncommonLabels: {}\ncontrolPlaneTracing: false\ncontrolPlaneTracingNamespace: linkerd-jaeger\ncontroller:\n  podDisruptionBudget:\n    maxUnavailable: 1\ncontrollerGID: -1\ncontrollerImage: ghcr.io/buoyantio/controller\ncontrollerImageVersion: \"\"\ncontrollerLogFormat: plain\ncontrollerLogLevel: debug\ncontrollerReplicas: 1\ncontrollerUID: 2103\ndebugContainer:\n  image:\n    name: cr.l5d.io/linkerd/debug\n    pullPolicy: \"\"\n    version: edge-25.4.4\ndeploymentStrategy:\n  rollingUpdate:\n    maxSurge: 25%\n    maxUnavailable: 25%\ndestinationController:\n  additionalArgs:\n  - -ext-endpoint-zone-weights\n  livenessProbe:\n    timeoutSeconds: 1\n  podAnnotations: {}\n  readinessProbe:\n    timeoutSeconds: 1\ndisableHeartBeat: false\ndisableIPv6: true\negress:\n  globalEgressNetworkNamespace: linkerd-egress\nenableEndpointSlices: true\nenableH2Upgrade: true\nenablePSP: false\nenablePodAntiAffinity: false\nenablePodDisruptionBudget: false\nenablePprof: false\nidentity:\n  externalCA: false\n  issuer:\n    clockSkewAllowance: 20s\n    issuanceLifetime: 24h0m0s\n    scheme: linkerd.io/tls\n    tls:\n      crtPEM: |\n        -----BEGIN CERTIFICATE-----\n        MIIBsjCCAVigAwIBAgIQG4RR1EkQLvanRZspKw9R3jAKBggqhkjOPQQDAjAlMSMw\n        IQYDVQQDExpyb290LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAeFw0yNTA2MDQxMTE4\n        NDJaFw0yNjA2MDQxMTE4NDJaMCkxJzAlBgNVBAMTHmlkZW50aXR5LmxpbmtlcmQu\n        Y2x1c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABKyE5Px3kwpI\n        ZEGR9Ky0feN3/X/3DQOSDweb3B1O6JK4fAtYDetnyUul+T0zXKtrLX0lrAdRzyaj\n        MLhci5ZMEd6jZjBkMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEA\n        MB0GA1UdDgQWBBSkrmSMxXmF/CJz14sL5SNbwNh9qjAfBgNVHSMEGDAWgBSw5rC0\n        vxQuKzp3Qyo9+367k6kzMTAKBggqhkjOPQQDAgNIADBFAiA7L9KiSSJdKD8WxSXM\n        cLcyqPe7Sw9lBko/Wcgcue80iwIhAJjddq/892QBoQspnTBctEfUVovznJCIMSKq\n        P4YtzyEn\n        -----END CERTIFICATE-----\n  kubeAPI:\n    clientBurst: 200\n    clientQPS: 100\n  livenessProbe:\n    timeoutSeconds: 1\n  podAnnotations: {}\n  readinessProbe:\n    timeoutSeconds: 1\n  serviceAccountTokenProjection: true\nidentityTrustAnchorsPEM: |\n  -----BEGIN CERTIFICATE-----\n  MIIBjTCCATSgAwIBAgIRAIMD4XLxwxvmNPAOcIuzz/EwCgYIKoZIzj0EAwIwJTEj\n  MCEGA1UEAxMacm9vdC5saW5rZXJkLmNsdXN0ZXIubG9jYWwwHhcNMjUwNjA0MTEx\n  ODQyWhcNMzUwNjAyMTExODQyWjAlMSMwIQYDVQQDExpyb290LmxpbmtlcmQuY2x1\n  c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABLwQ70dJQiN0LHY6\n  q4fvIND1LqcyypW8P+qrhVuIdHThgPx/KXXLa2+KjAbUzzeu8PRagGriwRn6+A69\n  AixeeuKjRTBDMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEBMB0G\n  A1UdDgQWBBSw5rC0vxQuKzp3Qyo9+367k6kzMTAKBggqhkjOPQQDAgNHADBEAiAt\n  ZkhSf0dHy7c6dDorCcfUiwNVjSdV2Z+Sl2EJ0ZxorgIgO9hII30K/26KlicbXygh\n  CxaYQ3t5qyY437Z08s11FEg=\n  -----END CERTIFICATE-----\nidentityTrustDomain: cluster.local\nimagePullPolicy: IfNotPresent\nimagePullSecrets: []\nkubeAPI:\n  clientBurst: 200\n  clientQPS: 100\nlicenseResources:\n  resources:\n    limits:\n      cpu: 500m\n      memory: 256Mi\n    requests:\n      cpu: 250m\n      memory: 128Mi\nlicenseSecret: null\nlinkerdVersion: enterprise-2.18.0\nmanageExternalWorkloads: true\nnetworkValidator:\n  connectAddr: \"\"\n  enableSecurityContext: true\n  listenAddr: \"\"\n  logFormat: plain\n  logLevel: debug\n  timeout: 10s\nnodeSelector:\n  kubernetes.io/os: linux\npodAnnotations: {}\npodLabels: {}\npodMonitor:\n  controller:\n    enabled: true\n    namespaceSelector: |\n      matchNames:\n        - {{ .Release.Namespace }}\n        - linkerd-viz\n        - linkerd-jaeger\n  enabled: false\n  labels: {}\n  proxy:\n    enabled: true\n  scrapeInterval: 10s\n  scrapeTimeout: 10s\n  serviceMirror:\n    enabled: true\npolicyController:\n  image:\n    name: ghcr.io/buoyantio/policy-controller\n    pullPolicy: \"\"\n    version: \"\"\n  livenessProbe:\n    timeoutSeconds: 1\n  logLevel: info\n  probeNetworks:\n  - 0.0.0.0/0\n  - ::/0\n  readinessProbe:\n    timeoutSeconds: 1\n  resources:\n    cpu:\n      limit: \"\"\n      request: \"\"\n    ephemeral-storage:\n      limit: \"\"\n      request: \"\"\n    memory:\n      limit: \"\"\n      request: \"\"\npolicyValidator:\n  caBundle: \"\"\n  crtPEM: \"\"\n  externalSecret: false\n  injectCaFrom: \"\"\n  injectCaFromSecret: \"\"\n  namespaceSelector:\n    matchExpressions:\n    - key: config.linkerd.io/admission-webhooks\n      operator: NotIn\n      values:\n      - disabled\npriorityClassName: \"\"\nprofileValidator:\n  caBundle: \"\"\n  crtPEM: \"\"\n  externalSecret: false\n  injectCaFrom: \"\"\n  injectCaFromSecret: \"\"\n  namespaceSelector:\n    matchExpressions:\n    - key: config.linkerd.io/admission-webhooks\n      operator: NotIn\n      values:\n      - disabled\nprometheusUrl: \"\"\nproxy:\n  additionalEnv:\n  - name: BUOYANT_BALANCER_LOAD_LOW\n    value: \"0.1\"\n  - name: BUOYANT_BALANCER_LOAD_HIGH\n    value: \"3.0\"\n  await: true\n  control:\n    streams:\n      idleTimeout: 5m\n      initialTimeout: 3s\n      lifetime: 1h\n  cores: null\n  defaultInboundPolicy: all-unauthenticated\n  disableInboundProtocolDetectTimeout: false\n  disableOutboundProtocolDetectTimeout: false\n  enableExternalProfiles: false\n  enableShutdownEndpoint: false\n  gid: -1\n  image:\n    name: ghcr.io/buoyantio/proxy\n    pullPolicy: \"\"\n    version: \"\"\n  inbound:\n    server:\n      http2:\n        keepAliveInterval: 100s\n        keepAliveTimeout: 100s\n  inboundConnectTimeout: 100ms\n  inboundDiscoveryCacheUnusedTimeout: 90s\n  livenessProbe:\n    initialDelaySeconds: 10\n    timeoutSeconds: 1\n  logFormat: plain\n  logHTTPHeaders: \"off\"\n  logLevel: warn,linkerd=debug,hickory=error,linkerd_proxy_http::client[{headers}]=on\n  metrics:\n    hostnameLabels: false\n  nativeSidecar: false\n  opaquePorts: 25,587,3306,4444,5432,6379,9300,11211\n  outbound:\n    server:\n      http2:\n        keepAliveInterval: 200s\n        keepAliveTimeout: 200s\n  outboundConnectTimeout: 1000ms\n  outboundDiscoveryCacheUnusedTimeout: 5s\n  outboundTransportMode: transport-header\n  ports:\n    admin: 4191\n    control: 4190\n    inbound: 4143\n    outbound: 4140\n  readinessProbe:\n    initialDelaySeconds: 2\n    timeoutSeconds: 1\n  requireIdentityOnInboundPorts: \"\"\n  resources:\n    cpu:\n      limit: \"\"\n      request: \"\"\n    ephemeral-storage:\n      limit: \"\"\n      request: \"\"\n    memory:\n      limit: \"\"\n      request: \"\"\n  runtime:\n    workers:\n      maximumCPURatio: null\n      minimum: 1\n  shutdownGracePeriod: \"\"\n  startupProbe:\n    failureThreshold: 120\n    initialDelaySeconds: 0\n    periodSeconds: 1\n  uid: 2102\n  waitBeforeExitSeconds: 0\nproxyInit:\n  closeWaitTimeoutSecs: 0\n  ignoreInboundPorts: 4567,4568\n  ignoreOutboundPorts: 4567,4568\n  image:\n    name: ghcr.io/buoyantio/proxy-init\n    pullPolicy: \"\"\n    version: enterprise-2.18.0\n  iptablesMode: legacy\n  kubeAPIServerPorts: 443,6443\n  logFormat: \"\"\n  logLevel: \"\"\n  privileged: false\n  runAsGroup: 65534\n  runAsRoot: false\n  runAsUser: 65534\n  skipSubnets: \"\"\n  xtMountPath:\n    mountPath: /run\n    name: linkerd-proxy-init-xtables-lock\nproxyInjector:\n  caBundle: \"\"\n  crtPEM: \"\"\n  externalSecret: false\n  injectCaFrom: \"\"\n  injectCaFromSecret: \"\"\n  livenessProbe:\n    timeoutSeconds: 1\n  namespaceSelector:\n    matchExpressions:\n    - key: config.linkerd.io/admission-webhooks\n      operator: NotIn\n      values:\n      - disabled\n    - key: kubernetes.io/metadata.name\n      operator: NotIn\n      values:\n      - kube-system\n      - cert-manager\n  objectSelector:\n    matchExpressions:\n    - key: linkerd.io/control-plane-component\n      operator: DoesNotExist\n    - key: linkerd.io/cni-resource\n      operator: DoesNotExist\n  podAnnotations: {}\n  readinessProbe:\n    timeoutSeconds: 1\n  timeoutSeconds: 10\nrevisionHistoryLimit: 10\nruntimeClassName: \"\"\nspValidator:\n  livenessProbe:\n    timeoutSeconds: 1\n  readinessProbe:\n    timeoutSeconds: 1\nwebhookFailurePolicy: Ignore\n"

References

From Trust Anchors to SPIFFE IDs: Understanding Linkerd’s Automated Identity Pipeline

Ivan Porta — Thu, 19 Jun 2025 07:53:23 +0000

Linkerd automatically enables mTLS for all TCP traffic between meshed pods. To do so, it relies on several certificates that must be in place for the control plane to function correctly. You can supply these certificates during installation or generate them with third-party tools such as cert-manager or trust-manager. The required certificates are the Root Trust Anchor and an Identity Intermediate Issuer Certificate, which work together to issue a unique Leaf Certificate for every meshed workload.

The Root Trust Anchor Certificate

Linkerd’s Root Trust Anchor is a public CA certificate that serves as the ultimate trust point for all service-mesh certificates. It never issues workload certificates directly; instead, it signs intermediate CA certificates, which then issue the workload certificates. This separation lets each clusters (or multiple clusters) can run its own issuer while still validating against the same root anchor, maintaining mesh-wide trust without exposing the root key in day-to-day workflows.

The Root Trust Anchor certificate (containing only the public key) is stored in the ConfigMap named linkerd-identity-trust-roots. Since this ConfigMap holds no private key material, it’s safe to store it in plain view and use it to bootstrap trust for all intermediates and end-entity certificates. A common practice for many enterprises is to leverage their own PKI to generate a new intermediate certificate that chains back to this root.

When a new Linkerd proxy is injected into a workload pod, it receives its Root Trust Anchor certificate through an environment variable and a mounted volume.

linkerd-proxy:
    Container ID:    containerd://f348b4bebec14d557c44951f309e07fac969de2ea93f20e9d1920b4a8e02180e
    Image:           cr.l5d.io/linkerd/proxy:edge-25.5.3
    ...
    Environment:
     ...
      LINKERD2_PROXY_IDENTITY_DIR:                               /var/run/linkerd/identity/end-entity
      LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS:                     <set to the key 'ca-bundle.crt' of config map 'linkerd-identity-trust-roots'>  Optional: false
      LINKERD2_PROXY_IDENTITY_TOKEN_FILE:                        /var/run/secrets/tokens/linkerd-identity-token
      ...
    Mounts:
      /var/run/linkerd/identity/end-entity from linkerd-identity-end-entity (rw)
      /var/run/secrets/tokens from linkerd-identity-token (rw)
...
Volumes:
  trust-roots:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      linkerd-identity-trust-roots
    Optional:  false
  linkerd-identity-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  linkerd-identity-end-entity:
    Type:        EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:      Memory
    SizeLimit:   <unset>

At startup, the proxy loads the trust-anchor certificate specified by LINKERD2_PROXY_IDENTITY_TRUST_ANCHORS, ensures the directory indicated by LINKERD2_PROXY_IDENTITY_DIR exists, and generates an ECDSA P-256 key pair. The private key is then encoded in PKCS#8 PEM format and written to key.p8 file.

func generateAndStoreKey(p string) (key *ecdsa.PrivateKey, err error) {
    key, err = tls.GenerateKey()
    if err != nil {
        return
    }
    pemb := tls.EncodePrivateKeyP8(key)
    err = os.WriteFile(p, pemb, 0600)
    return

Next, it generates an X.509 CSR whose CN and DNS SAN are set to the proxy’s identity, saving it as csr.der.

func generateAndStoreCSR(p, id string, key *ecdsa.PrivateKey) ([]byte, error) {
    csr := x509.CertificateRequest{
        Subject:  pkix.Name{CommonName: id},
        DNSNames: []string{id},
    }
    csrb, err := x509.CreateCertificateRequest(rand.Reader, &csr, key)
    if err != nil {
        return nil, fmt.Errorf("failed to create CSR: %w", err)
    }
    if err := os.WriteFile(p, csrb, 0600); err != nil {
        return nil, fmt.Errorf("failed to write CSR: %w", err)
    }
    return csrb, nil
}

Finally, it starts the Rust identity client, which reads the ServiceAccount JWT via TokenSource::load(), loads the Root Trust Anchor certificate along with key.p8 and csr.der, and sends the raw CSR in a gRPC CertifyRequest.

let req = tonic::Request::new(api::CertifyRequest {
  token: token.load()?,                   
  identity: name.to_string(),               
  certificate_signing_request: docs.csr_der.clone(),
});
let api::CertifyResponse { leaf_certificate, intermediate_certificates, valid_until } =
  IdentityClient::new(client).certify(req).await?.into_inner();

Here, identity contains the SPIFFE ID (spiffe://<cluster>/ns/<namespace>/sa/<serviceaccount>). The control plane uses this value to issue a certificate whose URI SAN matches that SPIFFE ID, ignoring any SANs present in the CSR itself.

The Identity Intermediate Issuer Certificate

The intermediate issuer certificate is stored in the linkerd-identity-issuer secret within the linkerd namespace. When the Identity service receives a Certificate Signing Request, it first validates the related ServiceAccount token by submitting a TokenReview to the Kubernetes API (authentication.k8s.io/v1/tokenreviews). The request includes:

the ServiceAccount token extracted from the CSR, and
the identity.l5d.io audience (so that only tokens issued specifically for Linkerd are accepted).

If the validation fails or the token is not authentcated the validation fails immediately, otherwise, the API server will go ahead and verify the token’s signature, expiration, issuer, and intended audience.

The Identity service parses the ServiceAccount reference (system:serviceaccount::), verifies that each segment is a valid DNS-1123 label, and constructs a SPIFFE URI in the configured trust domain. It then builds an x509.Certificate template that includes

the public key from the CSR,
a SAN set to the SPIFFE URI, and
a default 24-hour validity period. The certificate is signed with x509.CreateCertificate(rand.Reader, &template, issuerCert, csr.PublicKey, issuerKey) and returned to the proxy. You can observe this workflow by increasing the Identity pod’s log level to debug.

kubectl logs -n linkerd       linkerd-identity-56d78cdd86-8c64w 
Defaulted container "identity" out of: identity, linkerd-proxy, linkerd-init (init)
time="2025-05-21T12:11:32Z" level=info msg="running version enterprise-2.17.1"
time="2025-05-21T12:11:32Z" level=info msg="starting gRPC license client" component=license-client grpc-address="linkerd-enterprise:8082"
time="2025-05-21T12:11:32Z" level=info msg="starting admin server on :9990"
time="2025-05-21T12:11:32Z" level=info msg="Using k8s client with QPS=100.00 Burst=200"
time="2025-05-21T12:11:32Z" level=info msg="POST https://10.247.0.1:443/apis/authorization.k8s.io/v1/selfsubjectaccessreviews 201 Created in 1 milliseconds"
time="2025-05-21T12:11:32Z" level=debug msg="Loaded issuer cert: -----BEGIN CERTIFICATE-----\nMIIBsjCCAVigAwIBAgIQZelMfABi9RPUkaa1fEXfIjAKBggqhkjOPQQDAjAlMSMw\nIQYDVQQDExpyb290LmxpbmtlcmQuY2x1c3Rlci5sb2NhbDAeFw0yNTA1MjExMjEx\nMDJaFw0yNjA1MjExMjExMDJaMCkxJzAlBgNVBAMTHmlkZW50aXR5LmxpbmtlcmQu\nY2x1c3Rlci5sb2NhbDBZMBMGByqGSM49AgEGCCqGSM49AwEHA0IABO52MoQ7mva8\nYPg7abR7rqO3UhE0csDoPgFKoqM54JAfQY9/8rwgKWn3AUvH9NKNNy46Nq0MmPFd\nZgz/qSX3i0WjZjBkMA4GA1UdDwEB/wQEAwIBBjASBgNVHRMBAf8ECDAGAQH/AgEA\nMB0GA1UdDgQWBBTSq+l58FRN+T4ZSwqPyX9EFJmysTAfBgNVHSMEGDAWgBQpPJRY\nnNGBgGrC7LAnIDcwXkIHVjAKBggqhkjOPQQDAgNIADBFAiA7bw59dCwkhQ9CSyUN\nLR4/U7nt2mFV519zCtvD5cJmjgIhAKhPME9EJVtN28L6ZpaYSWbnSTyih1aL/b7m\neqW0acqg\n-----END CERTIFICATE-----\n"
time="2025-05-21T12:11:32Z" level=debug msg="Issuer has been updated"
time="2025-05-21T12:11:32Z" level=info msg="starting gRPC server on :8080"
time="2025-05-21T12:11:37Z" level=debug msg="Validating token for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local"
time="2025-05-21T12:11:37Z" level=info msg="POST https://10.247.0.1:443/apis/authentication.k8s.io/v1/tokenreviews 201 Created in 2 milliseconds"
time="2025-05-21T12:11:37Z" level=info msg="issued certificate for linkerd-identity.linkerd.serviceaccount.identity.linkerd.cluster.local until 2025-05-22 12:11:57 +0000 UTC: a7048ff55002e726894ad92eccfd6738fcbc72b496d58ef3071a73c866c8e311"

The Proxy Leaf Certificate

After the proxy receives the certificate, it loads it into its in-memory store and immediately uses it for mTLS. It automatically renews the certificate when roughly 70 % of its TTL has elapsed, generating a new CSR to rotate the certificate.

fn refresh_in(config: &Config, expiry: SystemTime) -> Duration {
    match expiry.duration_since(SystemTime::now()).ok().map(|d| d * 7 / 10) // 70% duration
    {
        None => config.min_refresh,
        Some(lifetime) if lifetime < config.min_refresh => config.min_refresh,
        Some(lifetime) if config.max_refresh < lifetime => config.max_refresh,
        Some(lifetime) => lifetime,
    }
}

The overall flow is the following:

References:

End-to-End Distributed Tracing: Integrating Linkerd with Splunk Observability Cloud

Ivan Porta — Thu, 08 May 2025 03:43:57 +0000

Observability is critical at multiple layers of an organization. For IT operations teams, whose primary focus is maintaining system uptime and reliability, it provides real-time visibility into system performance, sends alerts when anomalies occur, and offers critical data to quickly diagnose issues. Meanwhile, at the executive level, observability supports strategic decision-making by visualizing KPIs related to customer journeys via a user friendly dashboards. For example, a retail company might track the cost of abandoned shopping carts or the average customer spend. According to a 2024 Dynatrace survey, 77% of technology leaders report that more non-IT teams are now involved in decisions driven by observability insights.

Data is at the core of observability and monitoring, typically originating from logs, metrics, and traces. As applications add new functionality and user numbers grow, the data volume can skyrocket, making analysis increasingly complex. The challenge becomes even more pronounced when organizations shift from monolithic architectures to microservices-based systems, where an application is composed of numerous loosely coupled services where a single user request may trigger a series of calls across multiple back-end services, each contributing to overall performance and reliability. In fact, 79% of technology leaders report that cloud-native technology stacks generate so much data that it exceeds human capacity to manage.

Additionally, microservices often communicate via asynchronous or synchronous calls that happen concurrently and independently, complicating troubleshooting efforts. Without distributed tracing, pinpointing delays or failures in this parallel ecosystem would be a daunting task.

Distributed Tracing and Observability Tools

To reduce complexity and avoid manually piecing together data from different monitoring systems, many organizations turn to integrated observability platforms. Vendors such as Dynatrace, Cisco, Datadog, and New Relic have developed or acquired solutions that bring logs, metrics, traces, and infrastructure data under one roof. On the open-source side, tools like Zipkin and Jaeger offer user-friendly interfaces for visualizing distributed traces.

How Distributed Tracing Works

Distributed tracing typically starts by updating the application code so that each incoming request can be tracked as it travels through multiple services. Many implementations use the OpenTracing API, which supports popular languages like Go, Java, Python, JavaScript, Ruby, and PHP. This automatically creates “spans” whenever a request enters a service, eliminating the need for custom tracing logic. For example:

{
  "traceid": "123abc4d5678ef91g234h",
  "spanID": "a12cde34f5gh67",
  "parentSpanID": "a1b2345678c91",
  "operationName": "/API",
  "serviceName": "API",
  "startTime": 1608239395286533,
  "duration": 1000000,
  "logs": [],
  "tags": [
    {
      "http.method": "GET",
      "http.path": "/api"
    }
  ]
}

Spans are then linked through unique trace and span IDs to form a single trace for each request, revealing the end-to-end path across all involved services. If the services communicate via HTTP, the trace information can be passed through HTTP headers, using open-source standards such as B3. B3 uses headers like X-B3-TraceId, X-B3-SpanId, and X-B3-ParentSpanId to carry identifiers from one service to the next.

Spans are sent to the collector, which validates them, applies any necessary transformations, and stores them in back-end storage before rendering them in a UI.

Distributed Tracing and Linkerd

Linkerd supports distributed tracing by emitting trace spans directly from its data-plane proxies, using either OpenCensus or OpenTelemetry. When the Linkerd proxy detects a tracing header in an incoming HTTP request, it automatically creates a span to capture metrics such as the time spent inside the proxy and other relevant metadata.

Integrating Linkerd with Splunk

In this tutorial, you’ll learn how to integrate Linkerd with Splunk, one of the major observability platforms in the Gartner Magic Quadrant. We’ll start by creating a local Kubernetes environment, installing Linkerd, deploying a sample application, and then configuring everything to send traces to Splunk.
We’ll use k3d to create a local Kubernetes cluster:

k3d cluster create training \
  --agents 0 \
  --servers 1 \
  --image rancher/k3s:v1.30.8-k3s1 \
  --network playground \
  --port 8080:80@loadbalancer

Next, install Linkerd using Helm and generate your own mTLS certificates with step:

helm repo add linkerd-buoyant https://helm.buoyant.cloud
helm repo update

helm install linkerd-crds \
  --create-namespace \
  --namespace linkerd \
  linkerd-buoyant/linkerd-enterprise-crds

step certificate create root.linkerd.cluster.local ca.crt ca.key \
  --profile root-ca \
  --no-password \
  --insecure

step certificate create identity.linkerd.cluster.local issuer.crt issuer.key \
  --profile intermediate-ca \
  --not-after 8760h \
  --no-password \
  --insecure \
  --ca ca.crt \
  --ca-key ca.key

helm install linkerd-control-plane \
  --namespace linkerd \
  --set license=$BUOYANT_LICENSE \
  --set-file identityTrustAnchorsPEM=ca.crt \
  --set-file identity.issuer.tls.crtPEM=issuer.crt \
  --set-file identity.issuer.tls.keyPEM=issuer.key \
  linkerd-buoyant/linkerd-enterprise-control-plane

We’ll use the Emojivoto demo application. The source code uses the contrib.go.opencensus.io/exporter/ocagentlibrary to send OpenCensus traces over gRPC to an agent configured via the OC_AGENT_HOST environment variable.

kubectl apply -f https://run.linkerd.io/emojivoto.yml

Finally, let’s enable automatic sidecar injection for the Emojivoto namespace and restart the deployments, so that the pods are going to be injected with the Linkerd proxy.

kubectl annotate ns emojivoto linkerd.io/inject=enabled
kubectl rollout restart deploy -n emojivoto

Linkerd Jaeger configuration

For this demo, we’ll send proxy metrics to the collector over OpenTelemetry and the application traces over OpenCensus. This setup highlights Linkerd’s flexibility in handling different telemetry protocols.

Linkerd Jaeger Extension has three different components:

Jaeger (optional): UI for rendering traces collected by the linkerd-jaeger collector.
Injector: Injects Linkerd proxies with the environment variables that define the protocol and endpoint for sending traces. For instance, webhook.collectorSvcAddr sets LINKERD2_PROXY_TRACE_COLLECTOR_SVC_NAME (the collector endpoint) and webhook.collectorTraceProtocol specifies which tracing protocol (OpenCensus or OpenTelemetry) to use.
Collector (optional): Receives traces from the proxies. Its configuration (in the collector-config ConfigMap) includes the list of exporters, endpoints, protocols, and other attributes/metadata.

By default, Linkerd Jaeger configures the proxy to send metrics via OpenTelemetry on port 55678 and enables only that exporter. Since we want both OpenTelemetry and OpenCensus, we need to customize these settings. Below is an example value.yaml snippet showing how to enable multiple exporters:

jaeger:
  enabled: false
collector:
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      opencensus: {}
      zipkin: {}
      jaeger:
        protocols:
          grpc: {}
          thrift_http: {}
          thrift_compact: {}
          thrift_binary: {}
    processors:
      batch: {}
    extensions:
      health_check: {}
    exporters:
      jaeger:
        endpoint: collector.linkerd-jaeger.svc.cluster.local:14250
        tls:
          insecure: true
      otlp:
        endpoint: collector.linkerd-jaeger.svc.cluster.local:4317
        tls:
          insecure: true
    service:
      extensions: [health_check]
      pipelines:
        traces:
          receivers: [otlp, opencensus, zipkin, jaeger]
          processors: [batch]
          exporters: [otlp, jaeger]

Here, we’ve disabled the Jaeger UI because we plan to visualize data in Splunk. Install and configure Linkerd Jaeger via Helm:

helm repo add linkerd-edge https://helm.linkerd.io/edge
helm install linkerd-jaeger \
  --create-namespace \
  --namespace linkerd-jaeger \
  --file value.yaml \
  linkerd-edge/linkerd-jaeger

To send application-level (not proxy) traces via OpenCensus, we will need to set the OC_AGENT_HOST environment variable to the Jaeger collector endpoint:

kubectl -n emojivoto set env --all deploy OC_AGENT_HOST=collector.linkerd-jaeger:55678

If we check the pods we will be able to confirm that the Linkerd injector has updated environment variables for tracing:

kubectl describe pod -n emojivoto deploy/emoji
...
Containers:
  linkerd-proxy:
    Environment:
      LINKERD2_PROXY_TRACE_ATTRIBUTES_PATH:                      /var/run/linkerd/podinfo/labels
      LINKERD2_PROXY_TRACE_COLLECTOR_SVC_ADDR:                   collector.linkerd-jaeger:55678
      LINKERD2_PROXY_TRACE_PROTOCOL:                             opentelemetry
      LINKERD2_PROXY_TRACE_SERVICE_NAME:                         linkerd-proxy
      LINKERD2_PROXY_TRACE_COLLECTOR_SVC_NAME:                   collector.linkerd-jaeger.serviceaccount.identity.linkerd.cluster.local
      LINKERD2_PROXY_TRACE_EXTRA_ATTRIBUTES:                     k8s.pod.uid=$(_pod_uid)
                                                                 k8s.container.name=$(_pod_containerName)

Distributed Tracing with Linkerd and Splunk Observability Cloud

Before being able to send the traces to Splunk we will need to install the OpenTelemetry collector and Splunk agent. To do so:

Go to Data Management, then select Add Integration.

Since our cluster is not running in any Cloud Service Provider, choose Deploy Splunk OpenTelemetry Collector for other environments.

Configure the collector for your setup, and click Next.

Splunk will generate the deployment commands, including an access token and other parameters.

Execute the instructions listed in the platform. This will typically create a DaemonSet to run the Splunk OpenTelemetry Collector on each node:

kubectl get ds 
NAMESPACE     NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
...
default       splunk-otel-collector-agent   1         1         0       1            0           kubernetes.io/os=linux   47s

The OpenTelemetry opeator and k8s Cluster Receiver deployments:

kubectl get deploy -A
NAMESPACE        NAME                                         READY   UP-TO-DATE   AVAILABLE   AGE
...
default          splunk-otel-collector-k8s-cluster-receiver   1/1     1            1           74s
default          splunk-otel-collector-operator               0/1     1            0           74s

And a ConfigMap containing the Splunk OpenTelemetry Collector configuration.

kubectl get cm -A
NAMESPACE         NAME                                                   DATA   AGE
default           kube-root-ca.crt                                       1      16m
default           splunk-otel-collector-otel-agent                       2      2m33s
default           splunk-otel-collector-otel-k8s-cluster-receiver        1      2m33s

The agent will now send information about the cluster health directly to splunk.

By default, Linkerd Jaeger will export traces to its own Jaeger or OpenTelemetry endpoint. To forward these traces to Splunk, update the exporters in your value.yaml (or another Helm values file) to point to the Splunk OpenTelemetry Collector service:

jaeger:
  enabled: false
collector:
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}
      opencensus: {}
      zipkin: {}
      jaeger:
        protocols:
          grpc: {}
          thrift_http: {}
          thrift_compact: {}
          thrift_binary: {}
    processors:
      batch: {}
    extensions:
      health_check: {}
    exporters:
      jaeger:
        endpoint: splunk-otel-collector-agent.default.svc.cluster.local:14250
        tls:
          insecure: true
      otlp:
        endpoint: splunk-otel-collector-agent.default.svc.cluster.local:4317
        tls:
          insecure: true
    service:
      extensions: [health_check]
      pipelines:
        traces:
          receivers: [otlp, opencensus, zipkin, jaeger]
          processors: [batch]
          exporters: [otlp, jaeger]

After updating the Jaeger collector configuration, restart your pods so they pick up the new telemetry settings. Once everything is running, you should see distributed traces from both the Linkerd proxies and the Emojivoto application in Splunk Observability Cloud.

References:

Linkerd Jaeger Helm Chart: https://artifacthub.io/packages/helm/linkerd2-edge/linkerd-jaeger
Linkerd Official Documentation: https://linkerd.io/2.17/tasks/distributed-tracing

Mesh Expansion with Linkerd, AKS, and Azure Virtual Machines

Ivan Porta — Tue, 04 Mar 2025 08:16:54 +0000

Kubernetes adoption continues to grow at an unprecedented pace, especially among larger organizations. According to a recent PortWorx survey of over 500 participants from companies with more than 500 employees, 58% plan to migrate at least some of their VM-managed applications to Kubernetes, while 85% plan to move the majority of their VM workloads to cloud-native platforms. This popularity of container orchestration is driven by scalability, flexibility, operational simplicity, and cost considerations , which make hybrid cloud environments particularly appealing.

At the same time, many enterprises still maintain a significant on-premises footprint, and recent uncertainties due to the Broadcom acquisition of VMware have accelerated the push to modernize traditional VM-based workloads. However, as organizations adopt microservices, they often still need to communicate with legacy services running on-premises. This is where Mesh Expansion comes into play. By extending a service mesh beyond the confines of Kubernetes clusters, Mesh Expansion allows modern microservices to seamlessly interact with traditional on-premises services. In this article, I will show you how to expand your mesh using Linkerd Enterprise, Azure Kubernetes Service (AKS), and a Virtual Machine running in Azure.

Setup the environment

First, let’s deploy all the resources required for this demonstration. We’ll use Terraform to provision an Azure Resource Group, Virtual Networks (VNets), Subnets, a Kubernetes cluster (AKS), and a Linux Virtual Machine.

The following is the related Terraform configuration:

terraform {
  required_version = ">= 0.13"

  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = ">= 3.0.0"
    }
  }
}

provider "azurerm" {
  features {}
  subscription_id = "c4de0e1c-1377-4248-9beb-e1f803c76248"
}

# -----------------------------------------------------------
# General
# -----------------------------------------------------------
resource "azurerm_resource_group" "resource_group" {
  name     = "rg-training-krc"
  location = "Korea Central"
}

# -----------------------------------------------------------
# Networking
# -----------------------------------------------------------
resource "azurerm_virtual_network" "virtual_network_kuberentes" {
  name                = "vnet-training-aks-krc"
  address_space       = ["10.224.0.0/16"]
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
}

resource "azurerm_virtual_network" "virtual_network_virtual_machine" {
  name                = "vnet-training-vm-krc"
  address_space       = ["10.1.0.0/28"]
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
}

resource "azurerm_subnet" "subnet_kuberentes" {
  name                 = "aks-subnet"
  address_prefixes     = ["10.224.1.0/24"]
  resource_group_name  = azurerm_resource_group.resource_group.name
  virtual_network_name = azurerm_virtual_network.virtual_network_kuberentes.name
}

resource "azurerm_subnet" "subnet_virtual_machine" {
  name                 = "vm-subnet"
  address_prefixes     = ["10.1.0.0/29"]
  resource_group_name  = azurerm_resource_group.resource_group.name
  virtual_network_name = azurerm_virtual_network.virtual_network_virtual_machine.name
}

resource "azurerm_virtual_network_peering" "virtual_network_peering_virtual_machine" {
  name                      = "VirtualMachineToAzureKubernetesService"
  resource_group_name       = azurerm_resource_group.resource_group.name
  virtual_network_name      = azurerm_virtual_network.virtual_network_virtual_machine.name
  remote_virtual_network_id = azurerm_virtual_network.virtual_network_kuberentes.id
}

resource "azurerm_virtual_network_peering" "virtual_network_peering_kuberentes" {
  name                      = "KubernetesToVirtualMachine"
  resource_group_name       = azurerm_resource_group.resource_group.name
  virtual_network_name      = azurerm_virtual_network.virtual_network_kuberentes.name
  remote_virtual_network_id = azurerm_virtual_network.virtual_network_virtual_machine.id
}

resource "azurerm_route_table" "route_table" {
  name                = "rt-training-krc"
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
}

resource "azurerm_subnet_route_table_association" "route_table_association_virtual_machine" {
  subnet_id      = azurerm_subnet.subnet_virtual_machine.id
  route_table_id = azurerm_route_table.route_table.id
}

resource "azurerm_subnet_route_table_association" "route_table_association_kubernetes" {
  subnet_id      = azurerm_subnet.subnet_kuberentes.id
  route_table_id = azurerm_route_table.route_table.id
}

# -----------------------------------------------------------
# Kubernetes
# -----------------------------------------------------------
resource "azurerm_kubernetes_cluster" "kubernetes_cluster" {
  name                = "aks-training-krc"
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
  dns_prefix          = "trainingaks"
  identity {
    type = "SystemAssigned"
  }
  default_node_pool {
    name                         = "default"
    node_count                   = 1
    vm_size                      = "Standard_D2_v2"
    vnet_subnet_id               = azurerm_subnet.subnet_kuberentes.id
  }
}

# -----------------------------------------------------------
# Virtual Machine
# -----------------------------------------------------------
resource "azurerm_network_interface" "network_interface" {
  name                = "nic-training-krc"
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
  ip_configuration {
    name                          = "internal"
    subnet_id                     = azurerm_subnet.subnet_virtual_machine.id
    private_ip_address_allocation = "Dynamic"
    public_ip_address_id          = azurerm_public_ip.public_ip.id
  }
}

resource "azurerm_public_ip" "public_ip" {
  name                = "pip-training-krc"
  resource_group_name = azurerm_resource_group.resource_group.name
  location            = azurerm_resource_group.resource_group.location
  allocation_method   = "Static"
}

resource "azurerm_linux_virtual_machine" "virtual_machine" {
  name                            = "vm-training-krc"
  resource_group_name             = azurerm_resource_group.resource_group.name
  location                        = azurerm_resource_group.resource_group.location
  size                            = "Standard_F2"
  admin_username                  = "adminuser"
  admin_password                  = "Password1234!" 
  disable_password_authentication = false
  network_interface_ids = [
    azurerm_network_interface.network_interface.id,
  ]

  os_disk {
    caching              = "ReadWrite"
    storage_account_type = "Standard_LRS"
  }

  source_image_reference {
    publisher = "Canonical"
    offer     = "0001-com-ubuntu-server-jammy"
    sku       = "22_04-lts"
    version   = "latest"
  }
}

Networking configuration

Virtual Network Peering

Because we have deployed the Virtual Machine in a different Virtual Network from the AKS nodes, we need Virtual Network Peering so they can communicate. Peering allows traffic to flow through the Microsoft private backbone, making the separate virtual networks appear as one. This enables our Virtual Machine to reach AKS nodes by their private IP addresses and vice versa — without routing over the public internet.
The Terraform configuration above creates two VNet peering resources:

VirtualMachineToAzureKubernetesService (connects the VM’s VNet to the AKS VNet)
KubernetesToVirtualMachine (connects the AKS VNet back to the VM’s VNet) Both are necessary because peering is bidirectional. We can test the connectivity by using a privileged debug container on the node to ping the Virtual Machine’s private IP:

$ kubectl get nodes -o wide
NAME                              STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-default-37266765-vmss000000   Ready    <none>   12m   v1.30.9   10.224.1.4    <none>        Ubuntu 22.04.5 LTS   5.15.0-1079-azure   containerd://1.7.25-1

$ kubectl debug node/aks-default-37266765-vmss000000 -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0
Creating debugging pod node-debugger-aks-default-37266765-vmss000000-x9zjr with container debugger on node aks-default-37266765-vmss000000.
If you don't see a command prompt, try pressing enter.
/ # ping 10.1.0.4
PING 10.1.0.4 (10.1.0.4): 56 data bytes
64 bytes from 10.1.0.4: seq=0 ttl=64 time=5.452 ms
64 bytes from 10.1.0.4: seq=1 ttl=64 time=2.412 ms
64 bytes from 10.1.0.4: seq=2 ttl=64 time=1.018 ms
64 bytes from 10.1.0.4: seq=3 ttl=64 time=0.879 ms
64 bytes from 10.1.0.4: seq=4 ttl=64 time=1.046 ms
64 bytes from 10.1.0.4: seq=5 ttl=64 time=1.007 ms
--- 10.1.0.4 ping statistics ---
6 packets transmitted, 6 packets received, 0% packet loss
round-trip min/avg/max = 0.879/1.969/5.452 ms

And SSH into the Virtual Machine using its public IP and pinging the AKS node’s private IP:

$ ping  10.224.1.4
PING 10.224.1.4 (10.224.1.4) 56(84) bytes of data.
64 bytes from 10.224.1.4: icmp_seq=1 ttl=64 time=2.03 ms
64 bytes from 10.224.1.4: icmp_seq=2 ttl=64 time=1.30 ms
64 bytes from 10.224.1.4: icmp_seq=3 ttl=64 time=1.14 ms
^C
--- 10.224.1.4 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2003ms
rtt min/avg/max/mdev = 1.142/1.488/2.029/0.387 ms

Route Tables

Even though our Virtual Machine and the Kubernetes nodes can now communicate via private IP addresses, the VM still cannot resolve Kubernetes Cluster IPs or Pod IPs. This is a critical requirement because, once we install the Linkerd proxy, the VM will need to communicate with Linkerd components running inside the Kubernetes cluster , such as linkerd-destination, linkerd-identity, and any target services. These services have internal IPs provided by Kubernetes and rely on CoreDNS (running within the cluster) for name resolution.
To route requests from the VM to Kubernetes services, we can add custom routes in Azure so that any traffic destined for the Kubernetes services or Pod ranges gets forwarded to the AKS node. That node will then use CoreDNS to resolve the service and Pod IPs. In particular, you’ll need routes for the following CIDR:

The resulting routes will be the following:

Once these rules are in place the VM will be able to send traffic to Kubernetes services by using their cluster-internal addresses. If we check the services running in the AKS cluster we will see the kube-dns is at 10.0.0.10.

$ kubectl get svc -A
NAMESPACE     NAME             TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)         AGE
default       kubernetes       ClusterIP   10.0.0.1      <none>        443/TCP         48m
kube-system   kube-dns         ClusterIP   10.0.0.10     <none>        53/UDP,53/TCP   47m
kube-system   metrics-server   ClusterIP   10.0.144.73   <none>        443/TCP         47m

We can then test DNS resolution in the Virtual Machine by specifying kube-dns as our DNS server:

$ nslookup metrics-server.kube-system.svc.cluster.local 10.0.0.10
Server:  10.0.0.10
Address: 10.0.0.10#53

Name: metrics-server.kube-system.svc.cluster.local
Address: 10.0.144.73

To avoid having to specify the DNS server , we can to configure the VM’s netplan file so that 10.0.0.10 is recognized as a primary DNS server. On Ubuntu, netplan configuration files are typically located in /etc/netplan/. The file named 50-cloud-init.yaml is a default or auto-generated configuration that describes how the Ubuntu system should bring up its network interfaces, like eth0, and apply IP addresses, routing, and DNS settings.

$ vim /etc/netplan/50-cloud-init.yaml
network:
  version: 2
  ethernets:
    eth0:
      match:
        macaddress: "00:22:48:f6:f2:98"
        driver: "hv_netvsc"
      dhcp4: true
      nameservers:
        addresses:
          - 10.0.0.10
        search:
          - cluster.local
          - svc.cluster.local
      dhcp4-overrides:
        route-metric: 100
      dhcp6: false
      set-name: "eth0"

After editing, apply the new configuration:

$ sudo netplan apply

Now, DNS resolution on the VM automatically uses 10.0.0.10 for Kubernetes service lookups. You can verify this by running:

$ nslookupp metrics-server.kube-system.svc.cluster.local 10.0.0.10
Server:  10.0.0.10
Address: 10.0.0.10#53

Name: metrics-server.kube-system.svc.cluster.local
Address: 10.0.144.73

Installing Linkerd Enterprise

Now that the networking configuration is completed, we can move forward and install Linkerd. In this demonstration, I will use Helm Charts. If you wanna know more about the different ways to install linkerd, you can read my previous article. How to Install Linkerd Enterprise via CLI, Operator, and Helm Charts

First, we will need to install the Linkerd Custom Resource Defintions.

$ helm upgrade --install linkerd-enterprise-crds linkerd-buoyant/linkerd-enterprise-crds \
  --namespace linkerd \
  --create-namespace \
  --set manageExternalWorkloads=true

Next, we will need to install the Linkerd control plane with the value manageExternalWorkloads set to true.

helm upgrade --install linkerd-control-plane linkerd-buoyant/linkerd-enterprise-control-plane \
  --version 2.17.1 \
  --namespace linkerd \
  --create-namespace \
  --set-file identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --set linkerdVersion=enterprise-2.17.1 \
  --set manageExternalWorkloads=true \
  --set license=****

The manageExternalWorkloads set to true will deploy a the linkerd-autoregistration service and deployment.

kubectl get svc -A
NAMESPACE     NAME                        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
default       kubernetes                  ClusterIP   10.0.0.1       <none>        443/TCP         147m
kube-system   kube-dns                    ClusterIP   10.0.0.10      <none>        53/UDP,53/TCP   146m
kube-system   metrics-server              ClusterIP   10.0.144.73    <none>        443/TCP         146m
linkerd       linkerd-autoregistration    ClusterIP   10.0.228.3     <none>        8081/TCP        29s
linkerd       linkerd-dst                 ClusterIP   10.0.129.121   <none>        8086/TCP        4m13s
linkerd       linkerd-dst-headless        ClusterIP   None           <none>        8086/TCP        4m13s
linkerd       linkerd-enterprise          ClusterIP   10.0.56.38     <none>        8082/TCP        4m13s
linkerd       linkerd-identity            ClusterIP   10.0.201.170   <none>        8080/TCP        4m13s
linkerd       linkerd-identity-headless   ClusterIP   None           <none>        8080/TCP        4m13s
linkerd       linkerd-policy              ClusterIP   None           <none>        8090/TCP        4m13s
linkerd       linkerd-policy-validator    ClusterIP   10.0.180.41    <none>        443/TCP         4m13s
linkerd       linkerd-proxy-injector      ClusterIP   10.0.61.156    <none>        443/TCP         4m13s
linkerd       linkerd-sp-validator        ClusterIP   10.0.234.34    <none>        443/TCP         4m13s

Workload Authentication and SPIRE

When a Linkerd proxy starts inside a Kubernetes cluster, it generates a private key and submits a Certificate Signing Request (CSR). This CSR includes the service account token, which the Linkerd identity service uses — together with the Kubernetes API — to validate the proxy’s identity before issuing an x509 certificate. This certificate identifies the proxy in a DNS-like form and sets Subject Alternative Name (SAN) fields accordingly.
Outside of Kubernetes, we don’t have service accounts or a default identity mechanism. That’s where SPIFFE and SPIRE come into play:

SPIFFE defines a standard for identifying and securing workloads.
SPIRE is a production-ready implementation of SPIFFE that many service mesh and infrastructure providers (including Linkerd) can leverage for secure identity management.

SPIRE Architecture

In SPIRE, you run two main components:
Server: Manages and issues identities based on registration entries. It uses these entries to assign the correct SPIFFE ID to each authenticated agent.
Agent: Runs on the same node as the workload and exposes a gRPC API for workloads to request identities. The agent “attests” the workload by checking system or container-level attributes — such as Unix user ID, container image, or other selectors — to ensure the workload is truly what it claims to be. Once validated, the agent issues an x509 SVID (SPIFFE Verifiable Identity Document) containing a URI SAN in the form spiffe://trust-domain-name/path.

Install SPIRE

In production, it’s typical to run the SPIRE server on a dedicated node and have multiple SPIRE agents each running on separate nodes where workloads live. For simplicity, we’ll install both server and agent on a single virtual machine. First we will download and copy SPIRE binaries files to /opt/spire/, which is a common directory for add-on software packages on Linux.

wget https://github.com/spiffe/SPIRE/releases/download/v1.11.2/SPIRE-1.11.2-linux-amd64-musl.tar.gz
tar zvxf SPIRE-1.11.2-linux-amd64-musl.tar.gz
cp -r spire-1.11.2/. /opt/spire/

Since we are going to integrate SPIRE with the Linkerd Certificate Authority (CA), we will need to upload your TrustAnchor certificate and key to the VM:

$ scp ./certificates/ca.key adminuser@20.41.77.179:/home/adminuser/ca.key
adminuser@20.41.77.179's password: ********
ca.key              
                                                                                                   100%  227    24.9KB/s   00:00    
$ scp ./certificates/ca.crt adminuser@20.41.77.179:/home/adminuser/ca.crt
adminuser@20.41.77.179's password: ********
ca.crt      

$ mv /home/adminuser/ca.crt /opt/spire/certs/ca.crt
$ mv /home/adminuser/ca.key /opt/spire/certs/ca.key

Then we are going to create a simple SPIRE server configuration that will bind the server API to 127.0.0.1:8081, the it will use the root.linkerd.cluster.local as the trust domain and load the previously uploaded CA and key from /opt/spire/certs/ca.crt and /opt/spire/certs/ca.key.

cat >/opt/spire/server.cfg <<EOL
server {
    bind_address = "127.0.0.1"
    bind_port = "8081"
    trust_domain = "root.linkerd.cluster.local"
    data_dir = "/opt/spire/data/server"
    log_level = "DEBUG"
    ca_ttl = "168h"
    default_x509_svid_ttl = "48h"
}
plugins {
    DataStore "sql" {
        plugin_data {
            database_type = "sqlite3"
            connection_string = "/opt/spire/data/server/datastore.sqlite3"
        }
    }
    KeyManager "disk" {
        plugin_data {
            keys_path = "/opt/spire/data/server/keys.json"
        }
    }
    NodeAttestor "join_token" {
        plugin_data {}
    }
    UpstreamAuthority "disk" {
        plugin_data {
            cert_file_path = "/opt/spire/certs/ca.crt"
            key_file_path = "/opt/spire/certs/ca.key"
        }
    }
}
EOL

Then, it’s time for the SPIRE agent. In this case, we will need to instruct it to communicate with the server at 127.0.0.1:8081, use the same trust domain, root.linkerd.cluster.local and use the unix plugin for workload attestation.

cat >/opt/spire/agent.cfg <<EOL
agent {
    data_dir = "/opt/spire/data/agent"
    log_level = "DEBUG"
    trust_domain = "root.linkerd.cluster.local"
    server_address = "localhost"
    server_port = 8081
    insecure_bootstrap = true
}
plugins {
   KeyManager "disk" {
        plugin_data {
            directory = "/opt/spire/data/agent"
        }
    }
    NodeAttestor "join_token" {
        plugin_data {}
    }
    WorkloadAttestor "unix" {
        plugin_data {}
    }
}
EOL

We can now start the SPIRE server.

$ ./opt/spire/bin/spire-server run -config ./opt/spire/server.cfg
...
INFO[0000] Using legacy downstream X509 CA TTL calculation by default; this default will change in a future release 
WARN[0000] default_x509_svid_ttl is too high for the configured ca_ttl value. SVIDs with shorter lifetimes may be issued. Please set default_x509_svid_ttl to 28h or less, or the ca_ttl to 288h or more, to guarantee the full default_x509_svid_ttl lifetime when CA rotations are scheduled. 
WARN[0000] Current umask 0022 is too permissive; setting umask 0027 
INFO[0000] Configured                                    admin_ids="[]" data_dir=/opt/spire/data/server launch_log_level=debug version=1.11.2
INFO[0000] Opening SQL database                          db_type=sqlite3 subsystem_name=sql
INFO[0000] Initializing new database                     subsystem_name=sql
INFO[0000] Connected to SQL database                     read_only=false subsystem_name=sql type=sqlite3 version=3.46.1
INFO[0000] Configured DataStore                          reconfigurable=false subsystem_name=catalog
INFO[0000] Configured plugin                             external=false plugin_name=disk plugin_type=KeyManager reconfigurable=false subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
INFO[0000] Configured plugin                             external=false plugin_name=join_token plugin_type=NodeAttestor reconfigurable=false subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=join_token plugin_type=NodeAttestor subsystem_name=catalog
INFO[0000] Configured plugin                             external=false plugin_name=disk plugin_type=UpstreamAuthority reconfigurable=false subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=disk plugin_type=UpstreamAuthority subsystem_name=catalog
DEBU[0000] Loading journal from datastore                subsystem_name=ca_manager
INFO[0000] There is not a CA journal record that matches any of the local X509 authority IDs  subsystem_name=ca_manager
INFO[0000] Journal loaded                                jwt_keys=0 subsystem_name=ca_manager x509_cas=0
DEBU[0000] Preparing X509 CA                             slot=A subsystem_name=ca_manager
DEBU[0000] There is no active X.509 authority yet. Can't save CA journal in the datastore  subsystem_name=ca_manager
INFO[0000] X509 CA prepared                              expiration="2025-03-10 10:57:17 +0000 UTC" issued_at="2025-03-03 10:57:17.573729853 +0000 UTC" local_authority_id=721ccaf61807f4d9d1fe258476359e740feeb15e self_signed=false slot=A subsystem_name=ca_manager upstream_authority_id=737a7f3dfd9afd9669a777208b012bab53bf1164
INFO[0000] X509 CA activated                             expiration="2025-03-10 10:57:17 +0000 UTC" issued_at="2025-03-03 10:57:17.573729853 +0000 UTC" local_authority_id=721ccaf61807f4d9d1fe258476359e740feeb15e slot=A subsystem_name=ca_manager upstream_authority_id=737a7f3dfd9afd9669a777208b012bab53bf1164
INFO[0000] Creating a new CA journal entry               subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=1 local_authority_id=721ccaf61807f4d9d1fe258476359e740feeb15e subsystem_name=ca_manager
DEBU[0000] Successfully rotated X.509 CA                 subsystem_name=ca_manager trust_domain_id="spiffe://root.linkerd.cluster.local" ttl=604799.402084726
DEBU[0000] Preparing JWT key                             slot=A subsystem_name=ca_manager
WARN[0000] UpstreamAuthority plugin does not support JWT-SVIDs. Workloads managed by this server may have trouble communicating with workloads outside this cluster when using JWT-SVIDs.  plugin_name=disk subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=1 local_authority_id=721ccaf61807f4d9d1fe258476359e740feeb15e subsystem_name=ca_manager
INFO[0000] JWT key prepared                              expiration="2025-03-10 10:57:17.597952274 +0000 UTC" issued_at="2025-03-03 10:57:17.597952274 +0000 UTC" local_authority_id=6Yb52ncPDI4FDZy2unga1133Vne6HS8d slot=A subsystem_name=ca_manager
INFO[0000] JWT key activated                             expiration="2025-03-10 10:57:17.597952274 +0000 UTC" issued_at="2025-03-03 10:57:17.597952274 +0000 UTC" local_authority_id=6Yb52ncPDI4FDZy2unga1133Vne6HS8d slot=A subsystem_name=ca_manager
DEBU[0000] Successfully stored CA journal entry in datastore  ca_journal_id=1 local_authority_id=721ccaf61807f4d9d1fe258476359e740feeb15e subsystem_name=ca_manager
DEBU[0000] Rotating server SVID                          subsystem_name=svid_rotator
DEBU[0000] Signed X509 SVID                              expiration="2025-03-05T10:57:17Z" spiffe_id="spiffe://root.linkerd.cluster.local/spire/server" subsystem_name=svid_rotator
INFO[0000] Building in-memory entry cache                subsystem_name=endpoints
INFO[0000] Completed building in-memory entry cache      subsystem_name=endpoints
INFO[0000] Logger service configured                     launch_log_level=debug
DEBU[0000] Initializing health checkers                  subsystem_name=health
DEBU[0000] Initializing API endpoints                    subsystem_name=endpoints
INFO[0000] Starting Server APIs                          address="127.0.0.1:8081" network=tcp subsystem_name=endpoints
INFO[0000] Starting Server APIs                          address=/tmp/spire-server/private/api.sock network=unix subsystem_name=endpoints

Once it is up and running, we can generate a one-time join token that the agent will use to attest itself to the server.

$ /opt/spire/bin/spire-server token generate -spiffeID spiffe://root.linkerd.cluster.local/agent -output json | jq -r '.value'
5f497c6c-4fa5-45bd-b1ce-9d7770a7761b

Then, we can start the SPIRE agent.

$ /opt/spire/bin/spire-agent run -config /opt/spire/agent.cfg -joinToken "5f497c6c-4fa5-45bd-b1ce-9d7770a7761b"
INFO[0000] Creating spire agent UDS directory            dir=/tmp/spire-agent/public
WARN[0000] Current umask 0022 is too permissive; setting umask 0027 
INFO[0000] Starting agent                                data_dir=/opt/spire/data/agent version=1.11.2
INFO[0000] Configured plugin                             external=false plugin_name=disk plugin_type=KeyManager reconfigurable=false subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=disk plugin_type=KeyManager subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=join_token plugin_type=NodeAttestor subsystem_name=catalog
INFO[0000] Configured plugin                             external=false plugin_name=unix plugin_type=WorkloadAttestor reconfigurable=false subsystem_name=catalog
INFO[0000] Plugin loaded                                 external=false plugin_name=unix plugin_type=WorkloadAttestor subsystem_name=catalog
INFO[0000] Bundle is not found                           subsystem_name=attestor
DEBU[0000] No pre-existing agent SVID found. Will perform node attestation  subsystem_name=attestor
INFO[0000] SVID is not found. Starting node attestation  subsystem_name=attestor
WARN[0000] Insecure bootstrap enabled; skipping server certificate verification  subsystem_name=attestor
INFO[0000] Node attestation was successful               reattestable=false spiffe_id="spiffe://root.linkerd.cluster.local/spire/agent/join_token/5f497c6c-4fa5-45bd-b1ce-9d7770a7761b" subsystem_name=attestor
DEBU[0000] Entry created                                 entry=6f0e6ebd-cb2f-48ac-9919-30d6e2820ca8 selectors_added=1 spiffe_id="spiffe://root.linkerd.cluster.local/agent" subsystem_name=cache_manager
DEBU[0000] Renewing stale entries                        cache_type=workload count=1 limit=500 subsystem_name=manager
INFO[0000] Creating X509-SVID                            entry_id=6f0e6ebd-cb2f-48ac-9919-30d6e2820ca8 spiffe_id="spiffe://root.linkerd.cluster.local/agent" subsystem_name=manager
DEBU[0000] SVID updated                                  entry=6f0e6ebd-cb2f-48ac-9919-30d6e2820ca8 spiffe_id="spiffe://root.linkerd.cluster.local/agent" subsystem_name=cache_manager
DEBU[0000] Bundle added                                  subsystem_name=svid_store_cache trust_domain_id=root.linkerd.cluster.local
DEBU[0000] Initializing health checkers                  subsystem_name=health
INFO[0000] Starting Workload and SDS APIs                address=/tmp/spire-agent/public/api.sock network=unix subsystem_name=endpoints

Finally, create a registration entry for the process that will act as the Linkerd proxy outside of Kubernetes. The -selector "unix:uid:998" means any process running under UID 998 on this agent node will receive the SPIFFE ID specified:

/opt/spire/bin/spire-server entry create \
    -spiffeID "spiffe://root.linkerd.cluster.local/proxy-harness" \
    -parentID "spiffe://root.linkerd.cluster.local/agent" \
    -selector "unix:uid:998"

Install the Linkerd Proxy

With the SPIRE agent running and issuing identities, we can now set up Linkerd’s proxy harness on the virtual machine. The harness is a small daemon that installs the Linkerd proxy, configures iptables for traffic redirection, and registers itself with the Linkerd control plane running in the Kubernetes cluster. First we will need to download it with the following command:

wget https://github.com/BuoyantIO/linkerd-buoyant/releases/download/enterprise-2.17.1/linkerd-proxy-harness-enterprise-2.17.1-amd64.deb
apt-get -y install ./linkerd-proxy-harness-enterprise-2.17.1-amd64.deb

Create a Workload Group in Kubernetes

Then, in the Kubernetes cluster, we will need to deploy an ExternalGroup resource. This tells the Linkerd control plane that an external workload (running outside of Kubernetes) is part of the service mesh under the namespace training. The readiness probe ensures that Linkerd can verify when the proxy harness is up and healthy.

$ kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: training
---
apiVersion: workload.buoyant.io/v1alpha1
kind: ExternalGroup
metadata:
  name: training-vm
  namespace: training
spec:
  probes:
  - failureThreshold: 1
    httpGet:
      path: /ready
      port: 80
      scheme: HTTP
      host: 127.0.0.1
    initialDelaySeconds: 3
    periodSeconds: 10
    successThreshold: 1
    timeoutSeconds: 1
  template:
    metadata:
      labels:
        app: training-app
        location: vm
    ports:
    - port: 80
EOF

Next, use harnessctl (installed with the harness package) to point the harness at your Linkerd control plane:

harnessctl set-config \
  --workload-group-name=training-vm \
  --workload-group-namespace=training \
  --control-plane-address=linkerd-autoregistration.linkerd.svc.cluster.local.:8081 \
  --control-plane-identity=linkerd-autoregistration.linkerd.serviceaccount.identity.linkerd.cluster.local
Config updated

Finally, start the daemon:

systemctl start linkerd-proxy-harness

Running journalctl will output the harness logs where we can see it updating the iptables rules to ensure that all the traffic goes through the proxy.

journalctl -u linkerd-proxy-harness -f
...
Mar 03 11:36:23 vm-training-krc systemd[1]: Starting Linkerd proxy harness...
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -D PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables v1.8.7 (legacy): Couldn't load target `PROXY_INIT_REDIRECT':No such file or directory\n\nTry `iptables -h' or 'iptables --help' for more information.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -D OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables v1.8.7 (legacy): Couldn't load target `PROXY_INIT_OUTPUT':No such file or directory\n\nTry `iptables -h' or 'iptables --help' for more information.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -F PROXY_INIT_OUTPUT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables: No chain/target/match by that name.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -F PROXY_INIT_REDIRECT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables: No chain/target/match by that name.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -X PROXY_INIT_OUTPUT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables: No chain/target/match by that name.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -X PROXY_INIT_REDIRECT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="iptables: No chain/target/match by that name.\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy-save -t nat"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="# Generated by iptables-save v1.8.7 on Mon Mar  3 11:36:23 2025\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Mon Mar  3 11:36:23 2025\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy-save -t nat"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="# Generated by iptables-save v1.8.7 on Mon Mar  3 11:36:23 2025\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Mon Mar  3 11:36:23 2025\n"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -N PROXY_INIT_REDIRECT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4567,4568"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -N PROXY_INIT_OUTPUT"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 998 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4567,4568"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="/usr/sbin/iptables-legacy-save -t nat"
Mar 03 11:36:23 vm-training-krc harness-init[2131]: time="2025-03-03T11:36:23Z" level=info msg="# Generated by iptables-save v1.8.7 on Mon Mar  3 11:36:23 2025\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 998 -m comment --comment \"proxy-init/ignore-proxy-user-id\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 4567,4568 -m comment --comment \"proxy-init/ignore-port-4567,4568\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4567,4568 -m comment --comment \"proxy-init/ignore-port-4567,4568\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Mon Mar  3 11:36:23 2025\n"
Mar 03 11:36:23 vm-training-krc systemd[1]: Started Linkerd proxy harness.
Mar 03 11:36:23 vm-training-krc sudo[2160]:     root : PWD=/ ; USER=proxyharness ; COMMAND=/bin/bash -c '\\/bin\\/bash -c \\/var\\/lib\\/linkerd\\/bin\\/harness'
Mar 03 11:36:23 vm-training-krc sudo[2160]: pam_unix(sudo:session): session opened for user proxyharness(uid=998) by (uid=0)
Mar 03 11:36:24 vm-training-krc start-harness.sh[2161]: 2025-03-03T11:36:24.259338Z  INFO harness: Harness admin interface on 127.0.0.1:4192
Mar 03 11:36:24 vm-training-krc start-harness.sh[2161]: 2025-03-03T11:36:24.477490Z  INFO harness: identity used for control: spiffe://root.linkerd.cluster.local/proxy-harness
Mar 03 11:36:24 vm-training-krc start-harness.sh[2161]: 2025-03-03T11:36:24.498363Z  INFO controller{addr=linkerd-autoregistration.linkerd.svc.cluster.local:8081}: linkerd_pool_p2c: Adding endpoint addr=10.0.228.3:8081
Mar 03 11:36:24 vm-training-krc start-harness.sh[2161]: 2025-03-03T11:36:24.552077Z  INFO report_health:controller{addr=linkerd-autoregistration.linkerd.svc.cluster.local:8081}: linkerd_pool_p2c: Adding endpoint addr=10.0.228.3:8081
Mar 03 11:36:24 vm-training-krc start-harness.sh[2164]: [     0.120238s]  INFO ThreadId(01) linkerd2_proxy: release 0.0.0-dev (48069376) by Buoyant, Inc. on 2025-02-04T06:51:38Z
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.276965s]  INFO ThreadId(01) linkerd2_proxy::rt: Using single-threaded proxy runtime
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831749s]  INFO ThreadId(01) linkerd2_proxy: Admin interface on 0.0.0.0:4191
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831858s]  INFO ThreadId(01) linkerd2_proxy: Inbound interface on 0.0.0.0:4143
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831863s]  INFO ThreadId(01) linkerd2_proxy: Outbound interface on 127.0.0.1:4140
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831866s]  INFO ThreadId(01) linkerd2_proxy: Tap DISABLED
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831870s]  INFO ThreadId(01) linkerd2_proxy: SNI is training-vm-7ef4eba0.training.external.identity.linkerd.cluster.local
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831874s]  INFO ThreadId(01) linkerd2_proxy: Local identity is spiffe://root.linkerd.cluster.local/proxy-harness
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.831878s]  INFO ThreadId(01) linkerd2_proxy: Destinations resolved via linkerd-dst-headless.linkerd.svc.cluster.local:8086 (linkerd-destination.linkerd.serviceaccount.identity.linkerd.cluster.local)
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.868536s]  INFO ThreadId(01) policy:controller{addr=linkerd-policy.linkerd.svc.cluster.local:8090}: linkerd_pool_p2c: Adding endpoint addr=10.244.0.200:8090
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.868739s]  INFO ThreadId(01) dst:controller{addr=linkerd-dst-headless.linkerd.svc.cluster.local:8086}: linkerd_pool_p2c: Adding endpoint addr=10.244.0.200:8086
Mar 03 11:36:25 vm-training-krc start-harness.sh[2164]: [     0.889893s]  INFO ThreadId(02) daemon:identity: linkerd_app: Certified identity id=spiffe://root.linkerd.cluster.local/proxy-harness

Meanwhile, in the SPIRE agent logs, we’ll see entries confirming that the harness process (PID 3817 in the example) is attested under the proxyharness user (UID 998). The SPIRE agent issues an x509 SVID for spiffe://root.linkerd.cluster.local/proxy-harness, matching the registration entry we created earlier:

...
INFO[1013] Creating X509-SVID                            entry_id=5c0955f0-335c-4b5b-a3b4-9c0eae649e39 spiffe_id="spiffe://root.linkerd.cluster.local/proxy-harness" subsystem_name=manager
DEBU[1013] SVID updated                                  entry=5c0955f0-335c-4b5b-a3b4-9c0eae649e39 spiffe_id="spiffe://root.linkerd.cluster.local/proxy-harness" subsystem_name=cache_manager
DEBU[1013] PID attested to have selectors                pid=3817 selectors="[type:\"unix\" value:\"uid:998\" type:\"unix\" value:\"user:proxyharness\" type:\"unix\" value:\"gid:999\" type:\"unix\" value:\"group:proxyharness\" type:\"unix\" value:\"supplementary_gid:999\" type:\"unix\" value:\"supplementary_group:proxyharness\"]" subsystem_name=workload_attestor
DEBU[1013] Fetched X.509 SVID                            count=1 method=FetchX509SVID pid=3817 registered=true service=WorkloadAPI spiffe_id="spiffe://root.linkerd.cluster.local/proxy-harness" subsystem_name=endpoints ttl=172799.338898565
DEBU[1013] PID attested to have selectors                pid=3817 selectors="[type:\"unix\" value:\"uid:998\" type:\"unix\" value:\"user:proxyharness\" type:\"unix\" value:\"gid:999\" type:\"unix\" value:\"group:proxyharness\" type:\"unix\" value:\"supplementary_gid:999\" type:\"unix\" value:\"supplementary_group:proxyharness\"]" subsystem_name=workload_attestor
DEBU[1013] Fetched X.509 SVID                            count=1 method=FetchX509SVID pid=3817 registered=true service=WorkloadAPI spiffe_id="spiffe://root.linkerd.cluster.local/proxy-harness" subsystem_name=endpoints ttl=172799.334645563

Testing

With all components running, we’re now ready to verify traffic flow between a workload running on our VM and services in the Kubernetes cluster. First, we need to install Docker on the VM:

sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
  $(. /etc/os-release && echo "${UBUNTU_CODENAME:-$VERSION_CODENAME}") stable" | \
  sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

Then, start a simple HTTP echo service:

docker run -p 80:80 hashicorp/http-echo:latest echo -text="Welcome from $(hostname)"

To confirms the workload is serving traffic internally on port 80 we can execute the following command:

$ curl localhost:80
Welcome from vm-training-krc

Next, we’ll define a service in the training namespace pointing to the ExternalGroup. We’ll also deploy a test pod that has the Linkerd sidecar injected.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Service
metadata:
  name: training-vm
  namespace: training
spec:
  type: ClusterIP
  selector:
    app: training-app
    location: vm
  ports:
  - port: 80
    protocol: TCP
    name: one
---
apiVersion: v1
kind: Service
metadata:
  name: test-server
spec:
  type: ClusterIP
  selector:
    app: test-server
  ports:
  - port: 80
    protocol: TCP
---
apiVersion: v1
kind: Pod
metadata:
  name: curl-test
  annotations:
    linkerd.io/inject: enabled
spec:
  containers:
  - name: curl
    image: curlimages/curl:latest
    command: ["sleep", "infinity"]
EOF

Once the curl-test pod is running, you can exec into it and issue a request to the VM workload:

$ kubectl exec curl-test -c curl -- curl http://training-vm.training.svc.cluster.local:80
Welcome from vm-training-krc

The “Welcome from vm-training-krc” response confirms that the ExternalGroup and Linkerd proxy harness are working correctly, allowing in-cluster traffic to reach the VM workload. Next, we’ll confirm traffic can flow from the VM to a service in Kubernetes. To do so, we will deploy a simple application in the cluster.

kubectl apply -f - <<EOF
apiVersion: v1
kind: Namespace
metadata:
  name: simple-app
  annotations:
    linkerd.io/inject: enabled
---
apiVersion: v1
kind: Service
metadata:
  name: simple-app-v1
  namespace: simple-app
spec:
  selector:
    app: simple-app-v1
    version: v1
  ports:
    - port: 80
      targetPort: 5678
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: simple-app-v1
  namespace: simple-app
spec:
  replicas: 1
  selector:
    matchLabels:
      app: simple-app-v1
      version: v1
  template:
    metadata:
      labels:
        app: simple-app-v1
        version: v1
    spec:
      containers:
        - name: http-app
          image: hashicorp/http-echo:latest
          args:
            - "-text=Simple App v1"
          ports:
            - containerPort: 5678
EOF

On the VM, you can send a request to the simple-app-v1.simple-app.svc.cluster.local service using either curl or wget.

$ curl -v http://simple-app-v1.simple-app.svc.cluster.local:80
*   Trying 10.0.57.85:80...
* Connected to simple-app-v1.simple-app.svc.cluster.local (10.0.57.85) port 80 (#0)
> GET / HTTP/1.1
> Host: simple-app-v1.simple-app.svc.cluster.local
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< x-app-name: http-echo
< x-app-version: 1.0.0
< date: Mon, 03 Mar 2025 12:11:19 GMT
< content-length: 29
< content-type: text/plain; charset=utf-8
< 
Simple App v1
* Connection #0 to host simple-app-v1.simple-app.svc.cluster.local left intact

The “Simple App v1” response confirms that the VM-based application can reach in-cluster services via Linkerd.

References:

PortWorx Survery: https://www.cncf.io/blog/2024/06/06/the-voice-of-kubernetes-experts-report-2024-the-data-trends-driving-the-future-of-the-enterprise/
Linkerd Open Source Documentation: https://linkerd.io/2-edge/tasks/adding-non-kubernetes-workloads/
Linkerd Enterprise Harness Documentation: https://docs.buoyant.io/buoyant-enterprise-linkerd/latest/tasks/managing-external-workloads/

Mastering Service Mesh with Linkerd

Ivan Porta — Thu, 14 Nov 2024 13:08:54 +0000

As enterprises increasingly adopt microservices architecture to benefit from faster development, independent scalability, easier management, improved fault tolerance, better monitoring, and cost-effectiveness, the demand for robust infrastructure is growing. According to a 2023 Gartner report, 74% of respondents are using microservices, while 23% plan to adopt them.

However, as these services scale, the architecture can become complex, involving multiple clusters across different locations (on-premise and cloud). Managing networking, monitoring, and security in such environments becomes challenging. This is where a service mesh comes into play. In this article, I will dive into the fundamentals of service mesh technology, focusing on one of the leading players, Buoyant’s Linkerd.

If you’re more interested in the market trends, you’ll be pleased to know that the service mesh sector is experiencing rapid growth. This rise is fueled by the adoption of microservices, increasing cyberattacks, and supportive regulations like the EU’s “Path to the Digital Decade” initiative. Additionally, investments in AI are pushing countries like South Korea to leverage service mesh to transform data into intelligence. The service mesh market, valued at approximately USD 0.24 billion in 2022, is projected to grow to between USD 2.32 billion and USD 3.45 billion by 2030, with a compound annual growth rate (CAGR) of 39.7% from 2023 to 2030, according to reports from Next Move Strategy Consulting and Global Info Research.

What is Service Mesh?

A service mesh is an additional infrastructure layer that centralizes the logic governing service-to-service communication, abstracting it away from individual services and managing it at the network layer. This enables developers to focus on building features without worrying about communication failures, retries, or routing, as the service mesh consistently handles these aspects across all services.

Generally speaking, a service mesh provides the following benefits:

Automatic retries and circuit breaking: It ensures reliable communication by handling failures and prevent cascading failures by applying circuit breaking patterns.
Automated service discovery: It automatically detects services within the mesh without requiring manual configuration.
Improved security: Encrypts communication between services by using mTLS.
Fine-grained traffic control: It enables dynamic traffic routing, supporting deployment strategies like blue-green deployments, canary releases, and A/B testing without modifying the application’s cod.
Observability: It enhances monitoring with additional metrics, such as request success rates, failures, and requests per second, collected directly from the proxy.

How does service mesh works?

The architecture of service mesh has two components: a control planes and a data plane.

Data plane

The data plane consists of proxies (following the sidecar pattern) that are deployed alongside each application container instance within the same pod. These proxies intercept all inbound and outbound traffic to the service and, acting as intermediaries, implement features such as circuit breaking, request retries, load balancing, and enforcing mTLS (mutual TLS) for secure communication.

Control Plane

The control plane is responsible for managing and configuring the proxies in the data plane. While it doesn’t handle network packets directly, it orchestrates policies and configurations across the entire mesh. It does so by:

Maintaining a service registry and dynamically updating the list of services as they scale, join, or leave.
Defining and applying policies like traffic routing, security rules, and rate limiting.
Aggregating metrics, logs, and traces from the proxies for monitoring and observability.
Handling certificates and issuing cryptographic identities for mTLS to ensure secure communication between services.

The Linkerd control plane is composed of the following components:

Destination pod: This pod takes care of providing the IP and port of the destination service to a proxy when it’s sending traffic to another service. It also informs the proxy of the TLS identity it should expect on the other end of the connection. Finally, it fetches both policy information — which includes the types of requests allowed by authorization and traffic policies — and service profile information, which defines how the service should behave in terms of retries, per-route metrics, and timeouts.
Identity pod: It acts as a Certificate Authority by issuing signed certificates to proxies that send it a Certificate Signing Request (CSR) during their initialization.
Proxy injector pod: This Kubernetes admission controller checks for the existence of the annotation linkerd.io/inject: enabled when a pod is created. If present, it injects the proxy-init and linkerd-proxy containers into the pod.

A little bit of history of Linkerd

The major players in the service mesh market include Red Hat with OpenShift Service Mesh, HashiCorp with Consul, F5 with NGINX Service Mesh, and Istio, originally developed by Google and later transitioned to the Cloud Native Computing Foundation (CNCF).

We’re focusing on Buoyant’s product, Linkerd, due to its commitment to performance and minimal resource footprint. This focus has made it one of the earliest and most performant service meshes available. Linkerd achieved CNCF Graduated status on July 28, 2021, underscoring its stability and widespread adoption by organizations such as Monzo, Geico, Wells Fargo, and Visa (data from HGData).

Linkerd is an open-source service mesh that was initially released in 2016, originally built on Finagle — a scalable microservice library. Its first proxy, written in Scala, leveraged Java and Scala’s networking features. However, due to the JVM dependency and it’s complex surface ares, as well as some limitations caused by design choices of Scala, Netty, and Finagle, its adoption has some friction to production environment. As a result, in 2017, Buoyant developed a new lightweight proxy written in Rust. At the time, Rust was still an emerging language, but Buoyant chose it over Scala, Go, and C++ for several key reasons:

Predictable performance: Go’s garbage collector can cause latency spikes during collection cycles, which ruled it out. Rust, on the other hand, offers predictable performance with no garbage collection overhead.
** Security:** Many security vulnerabilities, such as buffer overflows (e.g., Heartbleed), stem from unsafe memory management in languages like C and C++. Rust handles memory safety at compile time, significantly reducing the risk of such vulnerabilities.

How does Linkerd do Load Balancing?

When you create a Service in Kubernetes, an associated Endpoint is automatically created based on the selector defined in the service specification. This selector identifies which pods should be part of the service:

$ kubectl get svc -n vastaya application-vastaya-svc -o yaml
apiVersion: v1
kind: Service
...
spec:
  ...
  selector:
    app.kubernetes.io/instance: application
    app.kubernetes.io/name: vastaya

The corresponding Endpoint is populated with the IP addresses of the selected pods:

$ kubectl get endpoints -n vastaya application-vastaya-svc -o yaml
apiVersion: v1
kind: Endpoints
...
subsets:
- addresses:
  - ip: 10.244.1.157
    nodeName: minikube
    targetRef:
      kind: Pod
      name: application-vastaya-dplmt-647b4dbdc-9bnwj
      namespace: vastaya
      uid: e3219642-4428-4bbc-89ec-a892ca571639

$ kubectl get pods -n vastaya -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  ...
  status:
    hostIP: 192.168.49.2
    hostIPs:
    - ip: 192.168.49.2
    phase: Running
    podIP: 10.244.1.157
    podIPs:
    - ip: 10.244.1.157

When a container is sending a request to a service, the proxy intercepts the request and looks up the target IP address in the Kubernetes API, and if the IP corresponds to a Kubernetes Service, it load balances using the EWMA (exponentially weighted moving average) algorithm to find and send requests to the fastest endpoints associated with that Service. Additionally, it applies any existing Service Policy (custom CRD) associated to the service that will enable traffic management capabilities like canary deployments, and more.

How Does the Linkerd Proxy Intercept Packets?

When a packet arrives at the network interface, it passes through the iptables rules, which contain multiple chains. Once a rule is matched, an action is taken. Linkerd redirects the packets to the proxy using the PREROUTING and OUTPUT chains in the nat (Network Address Translation) table. These rules are updated by the linkerd-init container, which, as an init container, runs before the other containers start. It modifies the pod’s iptables, creating new chains and updating existing ones. You can inspect these changes by checking the container’s logs.

Note: The iptables are specific to the pod’s namespace and are separate from the node’s iptables.

$ kubectl -n vastaya logs application-vastaya-dplmt-647b4dbdc-9bnwj linkerd-init
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-09-18T23:45:26Z" level=info msg="# Generated by iptables-save v1.8.10 on Wed Sep 18 23:45:26 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\nCOMMIT\n# Completed on Wed Sep 18 23:45:26 2024\n"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_REDIRECT"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp --match multiport --dports 4190,4191,4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4190,4191,4567,4568"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_REDIRECT -p tcp -j REDIRECT --to-port 4143 -m comment --comment proxy-init/redirect-all-incoming-to-proxy-port"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PREROUTING -j PROXY_INIT_REDIRECT -m comment --comment proxy-init/install-proxy-init-prerouting"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -N PROXY_INIT_OUTPUT"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -j RETURN -m comment --comment proxy-init/ignore-proxy-user-id"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -o lo -j RETURN -m comment --comment proxy-init/ignore-loopback"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp --match multiport --dports 4567,4568 -j RETURN -m comment --comment proxy-init/ignore-port-4567,4568"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A PROXY_INIT_OUTPUT -p tcp -j REDIRECT --to-port 4140 -m comment --comment proxy-init/redirect-all-outgoing-to-proxy-port"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy -t nat -A OUTPUT -j PROXY_INIT_OUTPUT -m comment --comment proxy-init/install-proxy-init-output"
time="2024-09-18T23:45:26Z" level=info msg="/sbin/iptables-legacy-save -t nat"
time="2024-09-18T23:45:26Z" level=info msg="# Generated by iptables-save v1.8.10 on Wed Sep 18 23:45:26 2024\n*nat\n:PREROUTING ACCEPT [0:0]\n:INPUT ACCEPT [0:0]\n:OUTPUT ACCEPT [0:0]\n:POSTROUTING ACCEPT [0:0]\n:PROXY_INIT_OUTPUT - [0:0]\n:PROXY_INIT_REDIRECT - [0:0]\n-A PREROUTING -m comment --comment \"proxy-init/install-proxy-init-prerouting\" -j PROXY_INIT_REDIRECT\n-A OUTPUT -m comment --comment \"proxy-init/install-proxy-init-output\" -j PROXY_INIT_OUTPUT\n-A PROXY_INIT_OUTPUT -m owner --uid-owner 2102 -m comment --comment \"proxy-init/ignore-proxy-user-id\" -j RETURN\n-A PROXY_INIT_OUTPUT -o lo -m comment --comment \"proxy-init/ignore-loopback\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m multiport --dports 4567,4568 -m comment --comment \"proxy-init/ignore-port-4567,4568\" -j RETURN\n-A PROXY_INIT_OUTPUT -p tcp -m comment --comment \"proxy-init/redirect-all-outgoing-to-proxy-port\" -j REDIRECT --to-ports 4140\n-A PROXY_INIT_REDIRECT -p tcp -m multiport --dports 4190,4191,4567,4568 -m comment --comment \"proxy-init/ignore-port-4190,4191,4567,4568\" -j RETURN\n-A PROXY_INIT_REDIRECT -p tcp -m comment --comment \"proxy-init/redirect-all-incoming-to-proxy-port\" -j REDIRECT --to-ports 4143\nCOMMIT\n# Completed on Wed Sep 18 23:45:26 2024\n"

Viewing these iptables rules directly can be tricky. You’ll need to run iptables-legacy as sudo, but networking restrictions may prevent you from doing this inside a container. Instead, you can access them from the node by using nsenter to enter the pod’s network namespace. Here’s how you can do it:

SSH into the node (in this case, using Minikube):

$ minikube ssh

Find the container ID:

$ docker ps | grep application-vastaya-dplmt-647b4dbdc-9bnwj
3f1f370f44e4   de30a4cc9fd9                "/docker-entrypoint.…"   39 minutes ago   Up 39 minutes             k8s_application-vastaya-cntr_application-vastaya-dplmt-647b4dbdc-9bnwj_vastaya_e3219642-4428-4bbc-89ec-a892ca571639_3
85b5f32df37e   4585864a6b91                "/usr/lib/linkerd/li…"   39 minutes ago   Up 39 minutes             k8s_linkerd-proxy_application-vastaya-dplmt-647b4dbdc-9bnwj_vastaya_e3219642-4428-4bbc-89ec-a892ca571639_3
e81bac0a9f9c   registry.k8s.io/pause:3.9   "/pause"                 39 minutes ago   Up 39 minutes             k8s_POD_application-vastaya-dplmt-647b4dbdc-9bnwj_vastaya_e3219642-4428-4bbc-89ec-a892ca571639_3

Get the process ID of the container:

$ docker@minikube:~$ docker inspect --format '{{.State.Pid}}' 3f1f370f44e4  
6859

Use nsenter to access the pod’s network namespace:

$ sudo nsenter -t 6859 -n

Finally, you can view the iptables rules:

$ root@minikube:/home/docker# iptables-legacy -t nat -L
Chain PREROUTING (policy ACCEPT)
target               prot opt source               destination         
PROXY_INIT_REDIRECT  all  --  anywhere             anywhere             /* proxy-init/install-proxy-init-prerouting */

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target             prot opt source               destination         
PROXY_INIT_OUTPUT  all  --  anywhere             anywhere             /* proxy-init/install-proxy-init-output */

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain PROXY_INIT_OUTPUT (1 references)
target     prot opt source               destination         
RETURN     all  --  anywhere             anywhere             owner UID match 2102 /* proxy-init/ignore-proxy-user-id */
RETURN     all  --  anywhere             anywhere             /* proxy-init/ignore-loopback */
RETURN     tcp  --  anywhere             anywhere             multiport dports 4567,4568 /* proxy-init/ignore-port-4567,4568 */
REDIRECT   tcp  --  anywhere             anywhere             /* proxy-init/redirect-all-outgoing-to-proxy-port */ redir ports 4140

Chain PROXY_INIT_REDIRECT (1 references)
target     prot opt source               destination         
RETURN     tcp  --  anywhere             anywhere             multiport dports sieve,4191,4567,4568 /* proxy-init/ignore-port-4190,4191,4567,4568 */
REDIRECT   tcp  --  anywhere

As you can see from the PROXY_INIT_REDIRECT chain, it redirect all incoming traffic to the port 4143, which is the port where the Linkerd proxy container is running.

$ kubectl get pods application-vastaya-dplmt-647b4dbdc-9bnwj -n vastaya -o yaml
apiVersion: v1
kind: Pod
metadata:
  ...
  name: application-vastaya-dplmt-647b4dbdc-9bnwj
spec:
  containers:
    image: cr.l5d.io/linkerd/proxy:edge-24.9.2
    name: linkerd-proxy
    ports:
    - containerPort: 4143
      name: linkerd-proxy
      protocol: TCP
    ...

The proxy processes the traffic and forwards it to the OUTPUT chain while preserving the original destination IP and port using the SO_ORIGINAL_DST socket option. Since the request will be owned by the proxy, the packet will be forwarded to the application.

Install Linkerd

There are two primary ways to install Linkerd as a service mesh: via Helm charts or the Linkerd CLI.

Using CLI

Installing Linkerd via the CLI is straightforward:

$ curl --proto '=https' --tlsv1.2 -sSfL https://run.linkerd.io/install-edge | sh
$ export PATH=$HOME/.linkerd2/bin:$PATH

Once the CLI is installed, you can verify that the environment is set up correctly for the installation:

$ linkerd check --pre

Next, similar to the Helm Chart installation, you’ll need to install the CRDs first, followed by the control plane:

$ linkerd install --crds | kubectl apply -f -
$ linkerd install | kubectl apply -f -

The CLI goes beyond just installation, offering fine-grained control over operations and useful extensions:

$ linkerd
linkerd manages the Linkerd service mesh.Usage:
  linkerd [command]Available Commands:
  authz        List authorizations for a resource
  check        Check the Linkerd installation for potential problems
  completion   Output shell completion code for the specified shell (bash, zsh or fish)
  diagnostics  Commands used to diagnose Linkerd components
  help         Help about any command
  identity     Display the certificate(s) of one or more selected pod(s)
  inject       Add the Linkerd proxy to a Kubernetes config
  install      Output Kubernetes configs to install Linkerd
  install-cni  Output Kubernetes configs to install Linkerd CNI
  jaeger       jaeger manages the jaeger extension of Linkerd service mesh
  multicluster Manages the multicluster setup for Linkerd
  profile      Output service profile config for Kubernetes
  prune        Output extraneous Kubernetes resources in the linkerd control plane
  uninject     Remove the Linkerd proxy from a Kubernetes config
  uninstall    Output Kubernetes resources to uninstall Linkerd control plane
  upgrade      Output Kubernetes configs to upgrade an existing Linkerd control plane
  version      Print the client and server version information
  viz          viz manages the linkerd-viz extension of Linkerd service mesh

Using Helm

Differently than Linekrd CLI, aprerequisite for installing Linkerd via Helm is generating certificates for mutual TLS (mTLS). To handle this, we will use the step CLI for certificate management.

To install the step CLI, run the following:

$ wget <https://dl.smallstep.com/cli/docs-cli-install/latest/step-cli_amd64.deb>
$ sudo dpkg -i step-cli_amd64.deb

Next, generate the required certificates:

$ step certificate create root.linkerd.cluster.local ca.crt ca.key --profile root-ca --no-password --insecure
$ step certificate create identity.linkerd.cluster.local issuer.crt issuer.key --profile intermediate-ca --not-after 8760h --no-password --insecure --ca ca.crt --ca-key ca.key

After creating the certificates, you can proceed with installing Linkerd in two steps using Helm charts. In this demo, I’ll be installing it on a local Minikube instance. Since Minikube uses Docker as the container runtime, the proxy-init container must run with root privileges (--set runAsRoot=true).

First, add the Linkerd Helm repository and update it:

$ helm repo add linkerd-edge https://helm.linkerd.io/edge
$ helm repo update

Then, install the CRDs and control plane:

$ helm install linkerd-crds linkerd-edge/linkerd-crds -n linkerd --create-namespace
$ helm install linkerd-control-plane linkerd-edge/linkerd-control-plane -n linkerd --create-namespace --set-file identityTrustAnchorsPEM=certificates/ca.crt --set-file identity.issuer.tls.crtPEM=certificates/issuer.crt --set-file identity.issuer.tls.keyPEM=certificates/issuer.key --set runAsRoot=true

Mesh your services

Once Linkerd’s CRDs and control plane are running in the cluster, you can start meshing your services. The proxy injector in Linkerd is implemented as a Kubernetes admission webhook, which automatically adds the proxy to new pods if the appropriate annotation is present in the namespace, deployment, or pod itself. Specifically, the annotation linkerd.io/inject: enabled is used to trigger proxy injection.

This injection adds two additional containers to each meshed pod:

linkerd-init: Configures the iptables to forward all incoming and outgoing TCP traffic through the proxy.
linkerd-proxy: The proxy itself, responsible for managing traffic, security, and observability.

Note: If the annotation is added to an existing namespace, deployment, or pod, the pod must be restarted to apply the changes as Kubernetes only triggers the webhook when creating or updating resources.

To manually enable injection for a deployment, use the following command:

$ kubectl annotate deployment -n vastaya application-vastaya-dplmt linkerd.io/inject=enabled
$ kubectl rollout restart -n vastaya deployment/projects-vastaya-dplmt

You can verify the injection by describing the pod:


$ kubectl describe pod -n vastaya application-vastaya-dplmt-647b4dbdc-9bnwj 
Name:             application-vastaya-dplmt-647b4dbdc-9bnwj
Namespace:        vastaya
...
Annotations:      linkerd.io/created-by: linkerd/proxy-injector edge-24.9.2
                  linkerd.io/inject: enabled
                  linkerd.io/proxy-version: edge-24.9.2
                  linkerd.io/trust-root-sha256: f6f154536a867a210de469e735af865c87a3eb61c77442bd9988353b4b632663
                  viz.linkerd.io/tap-enabled: true
Status:           Running
IP:               10.244.1.94
IPs:
  IP:           10.244.1.94
Controlled By:  ReplicaSet/application-vastaya-dplmt-647b4dbdc
Init Containers:
  linkerd-init:
    Container ID:    docker://06d8fedeac3d5d84b76aa2c4bb790f05e747402795247fe0a6087a49abd52e7a
    Image:           cr.l5d.io/linkerd/proxy-init:v2.4.1
    Image ID:        docker-pullable://cr.l5d.io/linkerd/proxy-init@sha256:e4ef473f52c453ea7895e9258738909ded899d20a252744cc0b9459b36f987ca
    Port:            <none>
    Host Port:       <none>
    SeccompProfile:  RuntimeDefault
    Args:
      --ipv6=false
      --incoming-proxy-port
      4143
      --outgoing-proxy-port
      4140
      --proxy-uid
      2102
      --inbound-ports-to-ignore
      4190,4191,4567,4568
      --outbound-ports-to-ignore
      4567,4568
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 18 Sep 2024 13:57:24 +0900
      Finished:     Wed, 18 Sep 2024 13:57:24 +0900
    Ready:          True
    Restart Count:  1
    Environment:    <none>
    Mounts:
      /run from linkerd-proxy-init-xtables-lock (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-kq29n (ro)
Containers:
  linkerd-proxy:
    Container ID:    docker://d91e9cdd9ef707538efc9467bc38ee35c95ed9d655d2d8f8a1b2a2834f910af4
    Image:           cr.l5d.io/linkerd/proxy:edge-24.9.2
    Image ID:        docker-pullable://cr.l5d.io/linkerd/proxy@sha256:43d1086980a64e14d1c3a732b0017efc8a9050bc05352e2dbefa9e954d6d607d
    Ports:           4143/TCP, 4191/TCP
    Host Ports:      0/TCP, 0/TCP
    SeccompProfile:  RuntimeDefault
    State:           Running
      Started:       Wed, 18 Sep 2024 13:57:25 +0900
    Last State:      Terminated
      Reason:        Completed
      Exit Code:     0
      Started:       Fri, 13 Sep 2024 20:02:28 +0900
      Finished:      Fri, 13 Sep 2024 22:32:19 +0900
    Ready:           True
    Restart Count:   1
    Liveness:        http-get http://:4191/live delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:       http-get http://:4191/ready delay=2s timeout=1s period=10s #success=1 #failure=3
    Environment:
      _pod_name:                                                 application-vastaya-dplmt-647b4dbdc-9bnwj (v1:metadata.name)
      _pod_ns:                                                   vastaya (v1:metadata.namespace)
      _pod_nodeName:                                              (v1:spec.nodeName)
      LINKERD2_PROXY_SHUTDOWN_ENDPOINT_ENABLED:                  false
      LINKERD2_PROXY_LOG:                                        warn,linkerd=info,hickory=error,[{headers}]=off,[{request}]=off
      ...
  application-vastaya-cntr:
    Container ID:   docker://e078170a412c392412f8f4fe170cdfcd139f212d2bd31dd6afda5801874b9225
    Image:          application:latest
    Image ID:       docker://sha256:de30a4cc9fd90fb6e51d51881747fb9b8a088d374e897a379c3ef87c848ace11
    Port:           80/TCP
    ...
Volumes:
  linkerd-proxy-init-xtables-lock:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  linkerd-identity-end-entity:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  <unset>
  linkerd-identity-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason          Age   From     Message
  ----    ------          ----  ----     -------
  Normal  SandboxChanged  92s   kubelet  Pod sandbox changed, it will be killed and re-created.
  Normal  Pulled          92s   kubelet  Container image "cr.l5d.io/linkerd/proxy-init:v2.4.1" already present on machine
  Normal  Created         92s   kubelet  Created container linkerd-init
  Normal  Started         92s   kubelet  Started container linkerd-init
  Normal  Pulled          92s   kubelet  Container image "cr.l5d.io/linkerd/proxy:edge-24.9.2" already present on machine
  Normal  Created         91s   kubelet  Created container linkerd-proxy
  Normal  Started         91s   kubelet  Started container linkerd-proxy
  Normal  Pulled          44s   kubelet  Container image "application:latest" already present on machine
  Normal  Created         44s   kubelet  Created container application-vastaya-cntr
  Normal  Started         44s   kubelet  Started container application-vastaya-cntr

Sometimes, you may want to mesh all pods in a namespace but exclude certain ones. You can achieve this by adding the annotation linkerd.io/inject: disabled to the pods you want to exclude.

Alternatively, you can use the Linkerd CLI to inject the proxy directly into existing YAML configurations. For example:

kubectl get -n vastaya deploy -o yaml | linkerd inject - | kubectl apply -f

Metrics

By default, Kubernetes collects metrics related to resource usage, such as memory and CPU, but it doesn’t gather information about the actual requests made between services. One of the key advantages of using Linkerd’s proxy is the additional metrics it collects and makes available. These include detailed data about the traffic flowing through the proxy, such as:

The number of requests the proxy has received.
Latency in milliseconds for each request.
Success and failure rates for service communication.

These metrics provide deeper insights into the behavior of your services and can be crucial for monitoring, troubleshooting, and optimizing performance.

References

**Companies using Linkerd:** [https://discovery.hgdata.com/product/linkerd](https://discovery.hgdata.com/product/linkerd)
**Lesson learned and reason to Linkerd.x2:** [https://www.infoq.com/articles/linkerd-v2-production-adoption/](https://www.infoq.com/articles/linkerd-v2-production-adoption/)
**Reasons why Byoyant choose Rust:** [https://linkerd.io/2020/07/23/under-the-hood-of-linkerds-state-of-the-art-rust-proxy-linkerd2-proxy/](https://linkerd.io/2020/07/23/under-the-hood-of-linkerds-state-of-the-art-rust-proxy-linkerd2-proxy/)
**Business Insight Service Mesh report:** [https://www.businessresearchinsights.com/market-reports/service-mesh-market-100139](https://www.businessresearchinsights.com/market-reports/service-mesh-market-100139)
**Proxy Injection:** [https://linkerd.io/2.16/features/proxy-injection/](https://linkerd.io/2.16/features/proxy-injection/)
**GitHub demo source-code:** [https://github.com/GTRekter/Vastaya](https://github.com/GTRekter/Vastaya)

Zero Downtime Deployments in Kubernetes with Linkerd

Ivan Porta — Thu, 14 Nov 2024 12:57:33 +0000

Releasing applications into production always comes with a sense of nervousness, no matter how stable the application and automation have been in prior environments. The fear of causing unexpected disruptions for critical clients — and potentially driving them away — is a significant risk. As a result, many businesses still rely on manual, after-hours deployments, following long instruction pages that detail every manual step. While this may minimize the immediate business impact, it remains vulnerable to human error. Engineers can get distracted or miss a crucial step, and even a minor oversight can lead to significant issues — something automated procedures are designed to prevent.

On the other hand, some companies fully embrace the “fail-fast” approach. For example, Netflix runs its “Simian Army” in production environments to ensure everything is functioning as expected, with their little monkeys trying to break things. However, reaching this level of confidence requires organizational maturity, and it takes time to get there.

Modern production deployment strategies are evolving to address these challenges through automation, continuous delivery, and the use of advanced tools that implement deployment techniques such as blue-green deployments, canary releases, and progressive rollouts. These strategies not only reduce downtime but also ensure smoother, more reliable transitions for production workloads. Convincing management to adopt these approaches may take time, but a proof of concept (POC) and an adoption plan can help your organization achieve this while saving both engineers and management from sleepless, stressful nights.

In this article, I will explain and demonstrate how to implement modern deployment strategies like canary deployment, A/B testing, and blue-green deployment in Kubernetes environments using Linkerd.

Traffic Management and Linkerd

Kubernetes natively supports traffic management features like timeouts, retries, and mirroring through the Gateway API’s HTTPRoute resource. This resource defines rules and matching conditions to determine which backend services should handle incoming traffic. By using the weight field, you can specify the proportion of requests sent to a particular backend, facilitating traffic splitting across different versions or environments.

Before version 2.14, Linkerd users had to rely on a custom resource definition (CRD) downstream of httproutes.gateway.networking.k8s.io, specifically httproutes.policy.linkerd.io, to instruct the Linkerd proxy on how to route requests. Starting with version 2.14, Linkerd extended its support to the native httproutes.gateway.networking.k8s.io. This means that regardless of which resource you use, the Linkerd proxy will route traffic based on either the Gateway API's or Linkerd's policy HTTPRoute resource. This functionality also applies to gRPC requests.

Note: By default, during installation, Linkerd attempts to install the Gateway API CRDs. However, if they are already present in the cluster, you can instruct Linkerd to skip this step by setting enableHttpRoutes to false in the Helm chart or CLI when installing Linkerd CRDs.

$ kubectl get crds | grep gateway
grpcroutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z
httproutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z

In this demonstration, I’ll use NGINX as the Ingress controller. By default, the NGINX Ingress controller retrieves the Endpoint resources for services specified in the Ingress and forwards traffic directly to the IP addresses of the pods. However, this behavior doesn’t align with HTTPRoute policy, which applies to traffic routed through the service itself. To solve this issue, we need to configure the NGINX Ingress controller to forward traffic to the service rather than directly to the pod endpoints. This can be achieved by adding the annotation nginx.ingress.kubernetes.io/service-upstream: "true" to the Ingress resource.

Additionally, since is the Linkerd proxy that handles the redirection of the traffic to the backend service, we need to inject it into the Ingress controller pod.

The overall traffic flow is the following:

The user sends a request to the application.
The inbound traffic is intercepted by the Linkerd proxy running in the Ingress controller pod and then forwarded to the NGINX Ingress controller for processing.
Due to the annotation nginx.ingress.kubernetes.io/service-upstream: "true", the Ingress controller forwards the traffic to the service defined in the upstream configuration located at /etc/nginx/nginx.conf.
The outbound traffic is intercepted again by the Linkerd proxy, which evaluates the destination based on its in-memory state, which includes discovery results, requests, and connections retrieved from the Linkerd destination service. Unused cached entries are evicted after a certain timeout period.
Once the target is determined, the proxy queries the Linkerd policy service for applicable routing policies and applies them as necessary.
Finally, the Linkerd proxy forwards the request to the backend defined by the policy — in this case, the canary version of the service.

Now that we have an idea of what’s happening behind the scenes, let’s dig into the different types of deployment available and what should be expected in the future.

Canary Deployment

This deployment strategy involves deploying a new version of the service (referred to as the “canary”) alongside the current stable version running in production. A percentage of traffic is then redirected to the canary. By doing this, the development team can quickly test the service with production traffic and identify any issues with a minimal “blast radius” (the number of users affected by the change). During this triage phase, the team also collects key metrics from the service. Based on these results, they can decide to gradually increase traffic to the new version (e.g., 25%, 75%, 100%) or, if necessary, abort the release.

Below is an example of an HTTPRoute configuration using Kubernetes Gateway API to implement a canary deployment where the traffic targeting the service projects-vastaya-svc is split between two services:

projects-vastaya-svc: Receives 10% of the traffic.
projects-canary-vastaya-svc: Receives 90% of the traffic.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 10
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 90

In the following image, you can see traffic being forwarded to both services by the Linkerd proxy. To visualize the inbound traffic to the services, I used the viz extension in Linkerd and injected Linkerd into both the canary and stable deployments. This allowed me to observe the traffic distribution using the command:

linkerd viz top deploy/projects-canary-vastaya-dplmt -n vastaya

**Note: **Injecting the Linkerd proxy into the destination pods is not required for traffic redirection, but I did it to collect detailed metrics on service performance.

Blue-green Deployment

A Blue-Green deployment is similar to a canary deployment but takes a more drastic approach. Instead of gradually directing an incremental percentage of traffic to the new version, both the old (Blue) and new (Green) versions run in parallel. However, only one version is active and accessible to users at any given time.

The key difference is that the new version (Green) remains inactive and hidden from users while you make any necessary adjustments to ensure it’s stable and reliable. Once you’re confident in the new version’s performance, you swap all traffic over to it in a single, coordinated switch. This approach minimizes downtime and allows for a quick rollback if issues are detected.

In contrast to canary deployments — where users actively access both versions as traffic is incrementally shifted — the Blue-Green strategy keeps the new version isolated until it’s fully ready for production use.

In our case, we’ll implement a Blue-Green deployment by changing the traffic weight from 0 to 1, directing all traffic to the new version. Here’s an example of the HTTPRoute configuration:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 1

A/B Testing

A/B testing is a method of experimentation that involves running two versions of the same environment to collect metrics like conversion rates, performance, and user engagement. Similar to canary deployments, it allows you to compare different versions of a service, but with a focus on gathering specific data from targeted user groups.

In A/B testing, the second version (the “B” version) targets one or more groups of users defined by predetermined criteria such as location, device type, user behavior, or other factors. This method is widely used in user experience (UX) design. For example, you might notice something appearing on your Netflix dashboard that your friend doesn’t see, or subtle changes in the application’s interface.

In our case, we can achieve this by adding additional filters to our HTTPRoute. In the following configuration we will:

Use the matches section to identify requests coming from users who have their locale set to Korean (Accept-Language: ko.*) and are using Firefox as their web browser (User-Agent: .*Firefox.*).
For these users, traffic is split evenly between the stable service (projects-vastaya-svc) and the canary service (projects-canary-vastaya-svc), each receiving 50% of the traffic.
For all other users, traffic is directed entirely to the stable service.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
    - matches:
      - headers:
        - name: "User-Agent"
          type: RegularExpression
          value: ".*Firefox.*"
        - name: Accept-Language
          type: RegularExpression
          value: "en-US.*" 
      backendRefs:
        - name: projects-vastaya-svc
          port: 80
          weight: 10
        - name: projects-canary-vastaya-svc
          port: 80
          weight: 90
    - backendRefs:
        - name: projects-vastaya-svc
          port: 80

By implementing this configuration, you can conduct A/B testing by routing 50% of the targeted users to the canary version while the rest continue to use the stable version. This allows you to collect specific metrics and assess the performance of the new version among a defined user segment.

Shadow Deployment (Mirrored Deployment)

In shadow deployment, also known as mirrored deployment, a new version of a service runs in the background and receives a copy of real-world traffic. Users are not impacted because only the response from the main (stable) service is considered; responses from the new version are ignored. This method allows the development team to test the new service against production traffic to observe how it behaves under real-world conditions without affecting users.

As of now, this feature is not fully supported by Linkerd, but the development team is actively working on it. You can track the progress through this GitHub issue: Linkerd Issue #11027.

Once this feature becomes available, you’ll be able to apply the following configuration without setting up a gateway, and the Linkerd proxy will handle the rest. The traffic sent to the service projects-vastaya-svc will be mirrored to projects-canary-vastaya-svc, but only the response from projects-vastaya-svc will be considered by the users.

Here’s an example of the HTTPRoute configuration:

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: projects-canary-vastaya-svc
          port: 80

References

Netflix and Canary Deployments: https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69
Linkerd 2.14 Release Notes: https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.0
Ingress configuration with Linkerd: https://linkerd.io/2.16/tasks/using-ingress/#nginx-community-version
Gateway API Traffic Splitting: https://gateway-api.sigs.k8s.io/guides/traffic-splitting/
Netflix A/B Testing: https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15
Proxy Discovery Cache: https://linkerd.io/2.16/tasks/configuring-proxy-discovery-cache/

링커드 활용 쿠버네티스의 무중단 배포

Ivan Porta — Thu, 14 Nov 2024 12:50:41 +0000

애플리케이션 제작을 시작하면 이전 환경에서의 애플리케이션과 자동화의 안정성과 상관없이 초조하기 마련이다. 예상치 못하게 주요 고객을 혼란하게 하는 것– 주요 고객을 떠나가게 할 수 있다는 것 -은 매우 위험하다. 따라서 많은 기업들은 여전히 정규 시간외 수동 배포를 하고 있으며, 모든 수동 단계에 대한 자세한 설명서를 따른다. 이렇게 하면 직접적인 비즈니스 영향을 최소화할 수 있는 반면 인적 오류에 취약하다. 엔지니어는 산만해지거나 주요 단계를 놓칠 수 있고, 작은 실수가 중요한 문제-자동화가 방지하고자 하는 것-를 야기할 수 있다.

반면에 페일페스트(fail-fast) 접근 방법을 적극적으로 수용하는 회사도 있다. 예를 들면, 넷플릭스는 제작 환경에서 시미언 아미(Simian Army)를 실행하여 모든 것이 예상대로 작동하도록 하고, 수작업자는 단계를 나누게 된다. 단, 이러한 신뢰 수준에 도달하려면 조직 성숙도와 시간이 필요하다.

최신 제작 배포 전략은 자동화와 지속적 인도 이외에 블루그린 배포, 카나리 배포 그리고 점진적 롤아웃 등의 배포 기법을 도입한 최신 도구를 사용하여 이러한 문제를 다룬다. 이러한 전략은 중단 시간을 줄이고, 제작 부하를 한층 부드럽고 확실히 이동시켜 준다. 관리자가 이러한 접근 방법을 채택하도록 설득하려면 시간이 걸리지만 개념 증명(POC)과 채택 계획을 통하여 해당 조직이 이러한 목표를 달성하고, 엔지니어와 관리자를 안심시킬 수 있다.

이 논문을 통하여 링커드를 활용하는 쿠버네티스 환경에서 카나리 출시, A/B 시험 그리고 블루그린 배포와 같은 최신 배포 전략을 실행하는 방법을 설명하고 논증한다.

트래픽 관리와 링커드

쿠버네티스는 기본적으로 게이트웨이 API의 HTTPRoute 자원을 통하여 타임아웃, 재시도 그리고 미러링과 같은 트래픽 관리 기능을 지원한다. 이 자원은 규칙과 매칭 조건을 정의하여 인커밍 트래픽을 처리할 백엔드 서비스를 결정한다. weight 필드를 사용하여 특정 백엔드로 송부된 요청의 일부를 명시하면 트래픽이 여러가지 버전이나 환경으로 나뉜다.

버전 2.14 이전에서 링커드 사용자는 httproutes.gateway.networking.k8s.io (구체적으로 말하면 httproutes.policy.linkerd.io) 사용자 지정 리소스 정의 (CRD) 다운스트림을 사용하여 요청을 라우트하는 방법을 링커드 사용자에게 알려야 했다. 링커드는 버전 2.14부터 기본적인 httproutes.gateway.networking.k8s.io로 지원을 확장하였다. 링커드 프록시는 사용 리소스와 상관없이 게이트웨이 API 또는 링커드 정책 HTTPRoute 리소스에 기반한 트래픽을 라우트하게 된다. 이러한 기능성은 gRPC 요청에도 적용된다.

주: 링커드는 설치 시 기본적으로 게이트웨이 API CRD 설치를 시도한다. 단, 해당 CRD가 클러스터에 이미 있으면 링커드 CRD 설치 시 헬름 차트나 CLI에서 enableHttpRoutes를false로 설정하여 이 단계를 생략하도록 링커드에 지시할 수 있다.

$ kubectl get crds | grep gateway
grpcroutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z
httproutes.gateway.networking.k8s.io       2024-09-25T01:01:18Z

이 논증에서는 NGINX를 인그레스 컨트롤러로 사용한다. NGINX 인그레스 컨트롤러는 기본적으로 인그레스에 명시된 서비스에 대한 종점 리소스를 검색하고, 트래픽을 pod의 IP 주소로 직접 보낸다. 단, 이 행위는 서비스 자체를 통하여 라우트된 트래픽에 적용되는 HTTPRoute 정책과 맞지 않는다. 이 문제를 해결하려면 NGINX 인그레스 콘트롤러를 설정하여 트래픽을 pod 종점으로 직접 송부하지 말고 서비스로 송부해야 한다. 인그레스 리소스에 nginx.ingress.kubernetes.io/service-upstream: "true" 표기를 추가하면 된다.

또한 링커드 프록시가 트래픽을 백엔드 서비스로 리디렉션하므로 그것을 인그레스 콘트롤러 pod로 주입해야 한다.

전반적인 트래픽 흐름은 다음과 같다.

사용자는 요청을 애플리케이션으로 송부한다.
인그레스 콘트롤러 pod에서 실행하는 링커드 프록시가 인바운드 트래픽을 가로채어 NGINX 인그레스 컨트롤러로 송부하여 처리한다.
인그레스 컨트롤러는 nginx.ingress.kubernetes.io/service-upstream: "true" 표기로 인하여 트래픽을/etc/nginx/nginx.conf에 위치한 업스트림 설정에 정의된 서비스로 송부한다.
발견 결과, 요청 그리고 링커드 목적지 서비스에서 검색된 연결 등의 인 메모리 상태에 기반한 목적지를 평가하는 링커드 프록시가 아웃바운드 트래픽을 가로챈다. 일정 타임아웃 기간 후에 캐쉬된 미사용 항목을 퇴거시킨다.
프록시는 목표가 결정되면 해당 라우팅 정책에 대한 링커드 정책 서비스를 조회하고 필요 시 해당 정책을 적용한다.
링커드 프록시는 최종적으로 요청을 정책이 정의한 백엔드로 송부한다(이 경우에는 서비스의 카나리 버전).

막후에서 무슨 일이 일어나는지 알았으므로 여러가지 가용 배포 방식과 추후 예상 상황에 대하여 알아보도록 한다.

카나리 배포

이 배포 전략은 제작 실행 중인 기존 안정적 버전과 함께 서비스(”카나리”라 함)의 신 버전을 배포하는 것을 포함한다. 트래픽의 일부를 카나리로 재전송한다. 개발팀은 이를 통하여 제작 트래픽으로 서비스를 신속하게 시험하고 최소한의 “폭발 반경”(변경으로 인하여 영향을 받는 사용자의 수)으로 문제를 확인할 수 있다. 해당 팀은 이러한 선별 단계에서 서비스로부터 주요 측정 기준도 수집한다. 이 결과에 따라 트래픽을 점진적으로 신 버전(예: 25%, 75%, 100%)으로 늘리거나 출시 중단을 결정할 수 있다.

쿠버네티스 게이트웨이 API를 사용하여 projects-vastaya-svc서비스를 대상으로 하는 트래픽이 두 서비스로 분리되는 카나리 배포를 실행하는HTTPRoute 설정의 예는 다음과 같다.

projects-vastaya-svc: 트래픽의 10%를 수신한다.
projects-canary-vastaya-svc: 트래픽의 90%를 수신한다.

apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 10
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 90

다음의 이미지에서 링커드 프록시가 트래픽을 두 서비스 모두에게 송부하는 것을 볼 수 있다. 서비스에 대한 인바운드 트래픽을 시각화하기 위하여 링커드의 viz 확장을 사용하고, 링커드를 카나리 배포와 안정적 배포 모두에게 주입하였다. 이를 통하여 해당 명령어를 사용하여 트래픽 분포를 관찰할 수 있었다.

linkerd viz top deploy/projects-canary-vastaya-dplmt -n vastaya

주: 트래픽 리디렉션을 하기 위하여 링커드 프록시를 목적지 pod에 주입할 필요가 없으나 서비스 성능에 대한 자세한 측정 기준을 수집하기 위하여 목적지 pod를 주입하였다.

블루그린 배포

블루그린 접근 방법은 카나리 배포와 유사하지만 더 과감한 접근 방법을 취한다. 트래픽의 증분 일부를 점진적으로 송부하는 대신에 구(블루) 버전과 신(그린) 버전을 모두 병행하여 실행한다. 단, 어떤 시점이든지 한 버전만 활성화되고 사용자 접근을 허용한다.

주요한 차이점은 필수 조정을 하는 동안 신 버전(그린)은 비활성화되고 사용자로부터 숨겨져서 안정성과 신뢰성을 확보한다는 것이다. 신 버전의 성능에 대한 확신이 있을 경우 단일 조정 전환으로 모든 트래픽을 교환한다. 이 접근 방법을 취하면 중단 시간을 최소화하고 문제 발견 시 신속한 롤백을 할 수 있다.

블루그린 전략은 사용자가 트래픽이 점진적으로 바뀔 때 두 버전 모두를 활발하게 접근하는 카나리 배포와 대조적으로 신 버전이 제작용으로 완전히 준비될 때까지 신 버전을 사용하지 않는다.

이 경우에는 트래픽 가중치를 0에서 1로 변경하여 블루그린 배포를 실행하고 모든 트래픽을 신 버전으로 송부하게 된다. HTTPRoute 설정의 예는 다음과 같다.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-split
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    - name: projects-canary-vastaya-svc
      port: 80
      weight: 1

A/B 시험

A/B 시험은 동일한 환경의 두 버전을 실행하여 전환율, 성능 그리고 사용자 참여도와 같은 측정 기준을 수집하는 실험 방법이다. 카나리 배포와 유사하게 서비스의 여러가지 버전을 비교할 수 있지만 대상 사용자 그룹으로부터 특정 자료를 수집하는 데 초점을 둔다.

A/B 시험에서 두 번째 버전(“B” 버전)은 위치, 장치 형태, 사용자 행동 또는 기타 요소와 같은 선결 기준이 정의한 한 개 이상의 사용자 그룹을 대상으로 한다. 이 방법은 사용자 경험(UX) 설계 시 널리 쓰인다. 예를 들면 친구에게는 보이지 않지만 넷플릭스에서는 보이는 것이나 애플리케이션 인터페이스의 미묘한 변화를 인지할 수 있을 것이다.

이 경우에 추가 필터를 HTTPRoute 에 추가하면 가능하다. 다음 설정에서 다음 사항을 실행하게 된다.

matches 섹션을 사용하여 로켈을 한국어(Accept-Language: ko.*) 로 설정하고 파이어폭스를 웹브라우저(User-Agent: .*Firefox.*)로 사용하는 사용자의 요청을 확인한다.
트래픽은 이 사용자에 대해서 안정적 서비스(projects-vastaya-svc)와 카나리 서비스(projects-canary-vastaya-svc)간에 균등하게 나뉘며, 각각은 트래픽의 50% 를 각각 수신한다.
기타 사용자에 대해서는 트래픽을 모두 안정적 서비스로 송부한다.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
    - matches:
      - headers:
        - name: "User-Agent"
          type: RegularExpression
          value: ".*Firefox.*"
        - name: Accept-Language
          type: RegularExpression
          value: "en-US.*" 
      backendRefs:
        - name: projects-vastaya-svc
          port: 80
          weight: 10
        - name: projects-canary-vastaya-svc
          port: 80
          weight: 90
    - backendRefs:
        - name: projects-vastaya-svc
          port: 80

이러한 설정의 실행을 통하여 대상 사용자의 50%를 카나리 버전으로 라우팅하고 나머지 사용자는 안정적 서비스를 계속 사용하게 하여 A/B 시험을 할 수 있다. 이를 통하여 특정 측정 기준을 수집하고, 지정 사용자 세그먼트에 대한 신 버전의 성능을 평가할 수 있다.

그림자 배포(미러드 배포)

미러드 배포로도 알려진 그림자 배포에서는 서비스의 신 버전이 백그라운드에서 실행하고 현실 트래픽의 복사본을 수신한다. 주(안정적) 서비스의 반응만을 고려하므로 사용자는 영향을 받지 않으며, 신 버전의 반응은 무시된다. 이 방법을 통하여 개발팀은 신 서비스를 제작 트래픽에 대하여 시험하여 신 서비스가 사용자에게 영향을 미치지 않고 현실 조건 하에서 어떻게 움직이는지 관찰할 수 있다.

링커드는 현재 이 기능을 완전하게 지원하지 않지만 개발팀은 해당 사항에 대하여 활발히 작업 중이다. GitHub 판을 통하여 상황을 파악할 수 있다(링커드 판 #11027).

이 기능이 사용 가능 시 게이트웨이를 설정하지 않고 다음 설정을 적용할 수 있으며, 링커드 프록시가 뒤처리를 하게 된다. projects-vastaya-svc 서비스로 송부된 트래픽을projects-canary-vastaya-svc 로 미러링 하지만 사용자는projects-vastaya-svc의 반응만을 고려하게 된다.

HTTPRoute설정의 예는 다음과 같다.

apiVersion: gateway.networking.k8s.io/v1beta1
kind: HTTPRoute
metadata:
  name: project-vastaya-traffic-split
  namespace: vastaya
spec:
  parentRefs:
    - name: projects-vastaya-svc
      group: core
      kind: Service
      namespace: vastaya
      port: 80
  rules:
  - backendRefs:
    - name: projects-vastaya-svc
      port: 80
      weight: 0
    filters:
    - type: RequestMirror
      requestMirror:
        backendRef:
          name: projects-canary-vastaya-svc
          port: 80

참고 자료

**넷플릭스와 카나리 배포:** [https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69](https://netflixtechblog.com/automated-canary-analysis-at-netflix-with-kayenta-3260bc7acc69)
**Linkerd 2.14 릴리즈 노트:** [https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.0](https://github.com/linkerd/linkerd2/releases/tag/stable-2.14.0)
**Linkerd를 활용한 인그레스 설정:** [https://linkerd.io/2.16/tasks/using-ingress/#nginx-community-version](https://linkerd.io/2.16/tasks/using-ingress/#nginx-community-version)
**Gateway API 트래픽 분할:** [https://gateway-api.sigs.k8s.io/guides/traffic-splitting/](https://gateway-api.sigs.k8s.io/guides/traffic-splitting/)
**넷플릭스 A/B 테스트:** [https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15](https://netflixtechblog.com/its-all-a-bout-testing-the-netflix-experimentation-platform-4e1ca458c15)
**프록시 디스커버리 캐시:** [https://linkerd.io/2.16/tasks/configuring-proxy-discovery-cache/](https://linkerd.io/2.16/tasks/configuring-proxy-discovery-cache/)

Optimize Resilience and Reduce Cross-Zone Expenses Using HAZL

Ivan Porta — Thu, 14 Nov 2024 12:43:34 +0000

Unplanned downtime, whether it’s caused by hardware failures, glitches, or cyberattacks, is every organization’s worst nightmare, no matter its size and sectors. Its can not only cause a lost revenue, but also drops in stock value, hit to customer satisfaction, trust and damage to the company’s reputation. According to a Oxford Economics survey, the downtime costs for Global 2000 companies is estimated around $400B annually, which means $200M per company per year, with an average of of $9,000 or $540,000 per hour.

Also, outages are more common that what you might think. In fact, according to a survey by Acronis in 2022 showed that 76% of companies experienced downtime. And let’s not forget Meta’s massive 2024 outage, which cost an estimated $100 million in lost revenue or the $34 million in missed sales for Amazon in 2021. This proves that, even with resiliency strategies in place, and thousands of engineers dedicated to avoiding downtime, it’s still a problem that need to be properly addressed.

To provide resiliency to their customers, and help them reduce the risk of unplanned downtime, all major cloud providers, for Azure to Tencent Cloud, offer spread their databases, Kubernetes clusters, and other resources across different data centers (or zones) within a region. These zones have independent power, cooling, and networking. So if one zone goes down, the application can keep running on the other zones.

Sounds perfect, right? Well, it’s close, but there’s a catch. Some of these Cloud Service Providers, like AWS, GCP (not Azure), have additional Data Transfer Costs for cross-Availability Zone communication.

What Are Data Transfer Costs for Cross-Availability Zone Communication?

As the name suggests, these costs come from data moving between resources in different Availability Zones, and are usually calculated per gigabyte ($/GB). While it might not seem like much at first glance, let’s look at some example of real-life traffic.

In October 2023, Coupang.com, one of the main e-commerce in South Korea, had around 127.6 million monthly visits. With the average page size at 4.97MB, and about 12 pages visited per session, the monthly traffic easily reach tens of petabytes of data. Even if only half of this traffic involves cross-zone communication, the cost for the transit of data between Availability Zone quickly reach hundreds of thousands of dollars in a single month.

Kubernetes’s native Topology Aware Routing (aka Topology Aware Hints)

Starting in version 1.21, Kubernetes introduced Topology Aware Hints to minimize cross-zone traffic within clusters. This routing strategy is built on EndpointSlices, which were first introduced in version 1.17 to improve the scalability of the traditional Endpoints resource. When a new Service is created, Kubernetes automatically generates EndpointSlices, breaking down the network endpoints into manageable chunks. This reduces the overhead on kube-proxy and overcome the size limitations of objects stored in etcd (max 1.5MB).

EndpointSlices don’t just improve performance, they also carry metadata like the zone information. This metadata is critical for Topology-Aware Routing because it allows Kubernetes to make decisions about routing traffic based on the topology of the cluster. This mecchanism is enabled by the annotation service.kubernetes.io/topology-mode on a Service, and will instructs the kube-proxy to filter the available endpoints according to topology hints provided by the EndpointSlice controller.

In the following example, traffic can be routed based on the zone metadata (koreacentral-1, koreacentral-2, koreacentral-3).

The related EndpointSlice manifest will be the following:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  ...
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Service
    name: tasks-vastaya-svc
addressType: IPv4
ports:
- name: http
  port: 80
  protocol: TCP
endpoints:
- addresses:
  - 10.244.3.74
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000000
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-rblxz
    namespace: vastaya
    uid: 8fddbf95-dac8-420c-b0ab-d5076f9f27e9
  zone: koreacentral-1
- addresses:
  - 10.244.2.181
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000001
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-cwshq
    namespace: vastaya
    uid: 8c82addd-1123-4810-ad21-0533e8cd15ee
  zone: koreacentral-2
- addresses:
  - 10.244.1.108
  - 10.244.1.110
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000002
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-dwxg2
    namespace: vastaya
    uid: b5128ae8-6615-41e6-97ec-8db9b81b588e
  zone: koreacentral-3

However, while Topology-Aware Routing helps reduce inter-zone traffic, it has some inherent limitations. Endpoint allocation is relatively static, meaning it doesn’t adapt to real-time conditions like traffic load, network latency, or service health beyond basic readiness and liveness probes. This can lead to imbalanced resource utilization, especially in dynamic environments where local endpoints are overwhelmed while remote ones remain underutilized.

This is where High Availability Zone-aware Load Balancing (HAZL) comes into play.

What is HAZL?

High Availability Zone-aware Load Balancing (HAZL) is a load balancer that leverages Topology-aware Routing, as well as the HTTP and gRPC traffic intercepted by the sidecar proxy running in meshed pods to load balance each request independently, routing to the best available backend based on current conditions. It operates at the request-level, unlike traditional connection-level load balancing, where all requests in a connection are sent to the same backend.

It also monitor the number of in-flight requests (requests waiting for resources or connections) referred as “load.” to the services, and handle the traffic between zones on a per-request basis. If load or latency spikes — signs that the system is under stress or unhealthy — HAZL adds additional endpoints from other zones. On the other hand, when the load decreases, HAZL remove those extra endpoints.

This adaptive approach fill the gaps of Topology-aware Routing allowing to a more controlled and more dynamic management of the cross-zone traffic, providing a balance between reducing latency, ensuring service reliability, and optimizing resource utilization.

HAZL is currently available only for Buoyant Enterprise for Linkerd and not for Linkerd Open Source.

What is Buoyant Enterprise for Linkerd?

Linkerd began its journey in the open-source world in 2016 and has since improved immensely. However, as corporations like Microsoft, Adidas, and Geico started incorporating Linkerd into their architectures, it became necessary to provide enterprise-level services and support that go beyond what is possible with open-source alone. This includes everything from Tailored Proofs of Concept, Software Bills-of-Materials for all components, a dedicated support channels by private support ticketing allowing them to have a direct point of contact instead of relying on public forums, Service Level Agreements, and more.

However, the Buoyant’s commitment to the open-source community is reflected in its pricing model. Anyone can try the enterprise features for non-production traffic, and companies with fewer than 50 employees can use Buoyant Enterprise for Linkerd in production for free, at any scale. Beyond that, there are different pricing tiers depending on the number of meshed pods and the specific features required.

Enough with the theory — let’s get our hands dirty and see HAZL in action.

Demonstration

In this demonstration, I will deploy the following infrastructure from scratch on an AKS cluster using Terraform. Then, I will install Prometheus, Linkerd Enterprise, and use Grafana to collect metrics of the traffic before and after enabling HAZL.

Infrastructure

Let’s start with the infrastructure. The following configuration will deploy:

Azure Kubernetes Cluster: This resource will have a default node pool where we will run Grafana, Prometheus, and the job to simulate traffic. This pool won’t have any availability zones assigned, as we want to keep the requests coming from the same region.
Cluster Node Pool: This pool will host the target services and pods and will have three availability zones, so Azure will automatically distribute the underlying VMs across these availability zones.
** Container Registry:** This is the resource where we will push our container images and from where the cluster will pull them, thanks to a role assignment to the kubelet identity with the AcrPull role.

provider "azurerm" {
  features {}
}

module "naming" {
  source  = "Azure/naming/azurerm"
  suffix = [ "training", "dev", "kr" ]
}

resource "azurerm_resource_group" "resource_group" {
  name     = "hazl-training-resources"
  location = "Korea Central"
}

resource "azurerm_kubernetes_cluster" "kubernetes_cluster" {
  name                = module.naming.kubernetes_cluster.name
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
  default_node_pool {
    name       = "default"
    node_count = 2
    vm_size    = "Standard_D2_v2"
    auto_scaling_enabled = true
  }
  identity {
    type = "SystemAssigned"
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "kubernetes_cluster_node_pool" {
  name                  = "hazltrainingnodepool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.kubernetes_cluster.id
  vm_size               = "Standard_DS2_v2"
  node_count            = 3
  zones                 = ["1", "2", "3"]
}

resource "azurerm_container_registry" "container_registry" {
  name                = module.naming.container_registry.name
  resource_group_name = azurerm_resource_group.resource_group.name
  location            = azurerm_resource_group.resource_group.location
  sku                 = "Premium"
  admin_enabled       = true
}

resource "azurerm_role_assignment" "role_assignment_cluster_container_registry" {
  principal_id                     = azurerm_kubernetes_cluster.kubernetes_cluster.kubelet_identity[0].object_id
  role_definition_name             = "AcrPull"
  scope                            = azurerm_container_registry.container_registry.id
  skip_service_principal_aad_check = true
}

Behind the scenes, Azure will set the topology.kubernetes.io/zone label on each node with the related zone and availability zone.

$ kubectl describe nodes | grep -e "Name:" -e "topology.kubernetes.io/zone"
Name: aks-agentpool-60539364-vmss000001
 topology.kubernetes.io/zone=0
Name: aks-hazlpoolha-33634351-vmss000000
 topology.kubernetes.io/zone=koreacentral-1
Name: aks-hazlpoolha-33634351-vmss000001
 topology.kubernetes.io/zone=koreacentral-2

Now that the infrastructure is in place, it’s time to start deploying the applications that we will use.

Install Buoyant Enterprise for Linkerd (BEL)

There are two ways to install BEL: via the operator or via CRDs and control plane Helm charts. The advantage of using the operator is that it will take care of pulling the configuration for both CRDs and the control plane for you, and manage the installation and upgrades automatically. Both methods require an active account on https://enterprise.buoyant.io/ as it will provide a license to use during the installation of the control plane.

As there is a lot of documentation about the operator online, in this demo, I will use Helm.

First, you will need to install the CRDs chart, which will install all the resource’s definitions necessary for Linkerd to work:

helm upgrade --install linkerd-enterprise-crds linkerd-buoyant/linkerd-enterprise-crds \
  --namespace linkerd \
  --create-namespace

Next, we will need to create a trust anchor certificate that will be used by the identity service to issue certificates and enable mTLS. In this case, I will use the step tool:

step certificate create root.linkerd.cluster.local ./certificates/ca.crt ./certificates/ca.key --profile root-ca --no-password --insecure
step certificate create identity.linkerd.cluster.local ./certificates/issuer.crt ./certificates/issuer.key --profile intermediate-ca --not-after 8760h --no-password --insecure --ca ./certificates/ca.crt --ca-key ./certificates/ca.key

Finally, we can install the control plane Helm chart, which will deploy all the roles, ConfigMaps, services, and components that make up Linkerd:

helm upgrade --install linkerd-enterprise-control-plane linkerd-buoyant/linkerd-enterprise-control-plane \
  --set buoyantCloudEnabled=false \
  --set license=$BUOYANT_LICENSE \
  -f ./helm/linkerd-enterprise/values.yaml \
  --set-file linkerd-control-plane.identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --namespace linkerd \
  --create-namespace

Install Linkerd-Viz

Linkerd-viz is an open-source extension that installs and auto-configures a Prometheus instance to scrape metrics from Linkerd. Additionally, it provides a dashboard that users can utilize to gain insights about the meshed pods in the cluster. It has a dedicated Helm chart that you can easily install with the following command:

helm upgrade --install linkerd-viz linkerd/linkerd-viz \
  --create-namespace \
  --namespace linkerd-viz

However, this extension only keeps metrics data for a brief window of time (6 hours) and does not persist data across restarts. Therefore, in this demo, we will install our own Prometheus instance and federate it with the Linkerd-viz Prometheus instance to persist the metrics.

Install Prometheus

By default, Prometheus provides a lot of metrics about the cluster and its resources. However, if you want to scrape additional information about the nodes, such as labels, you can modify the Prometheus configuration or install additional packages. To obtain the labels of the nodes and group the metrics by zone , we will install the kube-state-metrics, which will expose this infomations to the prometheus queries.

helm upgrade --install kube-state-metrics prometheus-community/kube-state-metrics \
  --set metricLabelsAllowlist.nodes=[*] \
  --create-namespace \
  --namespace monitoring

Next, install Prometheus using Helm:

helm upgrade --install prometheus prometheus-community/prometheus \
  --create-namespace \
  --namespace monitoring

Finally, we can federate our Prometheus instance with the Linkerd-viz instance so that data are copied from one Prometheus to another. This allows us to access metrics collected at the transport level, such as:

tcp_write_bytes_total: A counter of the total number of sent bytes. This is updated when the connection closes.
tcp_read_bytes_total: A counter of the total number of received bytes. This is updated when the connection closes.

To set up federation, add the following configuration to your Prometheus manifest file:

- job_name: 'linkerd'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: ['{{.Namespace}}']

  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_container_name
    action: keep
    regex: ^prometheus$

  honor_labels: true
  metrics_path: '/federate'

  params:
    'match[]':
      - '{job="linkerd-proxy"}'
      - '{job="linkerd-controller"}'

Additionally, apply an AuthorizationPolicy that will allow Prometheus to access the Linkerd-viz Prometheus metrics endpoint:

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: prometheus-admin-federate
  namespace: linkerd-viz
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: prometheus-admin
  requiredAuthenticationRefs:
    - group: policy.linkerd.io
      kind: NetworkAuthentication
      name: kubelet

If you have done everything correctly, you will be able to see the following target in Prometheus:

Now, all the metrics scraped by the Linkerd-viz Prometheus instance are available in our Prometheus instance.

Install and configure Grafana

Next, let’s install Grafana with the default configuration:

helm upgrade --install grafana grafana/grafana \
  --create-namespace \
  --namespace monitoring

After logging in, we need to add a new data source pointing to the Prometheus server running in the cluster. To do this:

Expand the Connections option from the side pane and click Add new connection.
Click Prometheus and enter the internal DNS endpoint of the Kubernetes cluster (http://prometheus-server.monitoring.svc.cluster.local) in the connection input.
Click the Save & Test button at the bottom to complete the setup.

Next, let’s create a new dashboard that will contain the visualizations we will use to monitor the traffic to and from our nodes using the previously created data source targeting the Prometheus server. The visualizations we need are the following:

CPU Usage per Kubernetes Node: This query will display the CPU usage percentage for each node. In this case, we expect that there will be a peak of CPU utilization in one of the nodes when we will trigger the jobs to simulate the traffic.

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

TCP Read Bytes Total (Outbound): This query shows the total number of bytes read over TCP connections for outbound traffic in the vastaya namespace, grouped by namespace, pod, instance, destination zone, and source zone. This metrics are collected by the Linkerd proxy.

sum by (namespace, pod, instance, dst_zone, src_zone) (
  tcp_read_bytes_total{direction="outbound", namespace="vastaya"}
)

Simulate the traffic with HAZL disabled

With this setup in place, we can proceed to trigger a job that will create 5 replicas, each one increasing the number of requests to the service every 10 seconds. To ensure that the pods are provisioned in the node pool without the application, we will also set a node affinity.

apiVersion: batch/v1
kind: Job
metadata:
  name: bot-get-project-report
  namespace: vastaya
spec:
  completions: 5          
  parallelism: 5          
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: agentpool
                operator: In
                values:
                - default
      containers:
      - name: project-creator
        image: curlimages/curl:7.78.0 
        command: ["/bin/sh", "-c"] 
        args:
        - |
          API_URL="http://projects.vastaya.svc.cluster.local/1/report"
          get_report() {
            local num_requests=$1
            echo "Getting $num_requests tasks..."
            for i in $(seq 1 $num_requests); do
              (
                echo "Getting task $i..."
                GET_RESPONSE=$(curl -s -X GET "$API_URL")
                echo $GET_RESPONSE
              ) &
            done
            wait
          }
          wait_time=10
          for num_requests in 5000 10000 15000; do
            echo "Running with $num_requests requests..."
            get_report $num_requests
            echo "Waiting for $wait_time seconds before increasing requests..."
            sleep $wait_time
          done
      restartPolicy: Never
  backoffLimit: 1

Since we haven’t enabled HAZL yet, Kubernetes will start directing the requests to pods running in different zones, resulting in an increase of cross-zone traffic.

Enable HAZL

Enabling HAZL with the operator is super easy. All we have to do is update the control plane values with the following command:

helm upgrade --install linkerd-enterprise-control-plane linkerd-buoyant/linkerd-enterprise-control-plane \
  --set buoyantCloudEnabled=false \
  --set license=$BUOYANT_LICENSE \
  --set-file linkerd-control-plane.identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --set controlPlaneConfig.destinationController.additionalArgs="{ -ext-endpoint-zone-weights }" \
  --set controlPlaneConfig.proxy.additionalEnv[0].name=BUOYANT_BALANCER_LOAD_LOW \
  --set controlPlaneConfig.proxy.additionalEnv[0].value="0.8" \
  --set controlPlaneConfig.proxy.additionalEnv[1].name=BUOYANT_BALANCER_LOAD_HIGH \
  --set controlPlaneConfig.proxy.additionalEnv[1].value="2.0" \
  --namespace linkerd \
  --create-namespace \
  -f ./helm/linkerd-enterprise/values.yaml

Simulate the traffic with HAZL enabled

After enabling HAZL, we recreate the job to simulate the traffic and as you can see, the cross-zone communication has been completely eliminated.

References

AWS Data Transfer Cost: https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region
Linkerd and External Prometheus: https://linkerd.io/2-edge/tasks/external-prometheus/
HAZL Official Documentation: https://docs.buoyant.io/buoyant-enterprise-linkerd/latest/features/hazl/
BEL OPfficial Documentation: https://docs.buoyant.io/buoyant-enterprise-linkerd/latest/installation/enterprise/
Demo Source Code: https://github.com/GTRekter/Vastaya