Ivan Porta

Posted on Nov 14, 2024 • Originally published at gtrekter.Medium

Optimize Resilience and Reduce Cross-Zone Expenses Using HAZL

#kubernetes #eks #aks

Unplanned downtime, whether it’s caused by hardware failures, glitches, or cyberattacks, is every organization’s worst nightmare, no matter its size and sectors. Its can not only cause a lost revenue, but also drops in stock value, hit to customer satisfaction, trust and damage to the company’s reputation. According to a Oxford Economics survey, the downtime costs for Global 2000 companies is estimated around $400B annually, which means $200M per company per year, with an average of of $9,000 or $540,000 per hour.

Also, outages are more common that what you might think. In fact, according to a survey by Acronis in 2022 showed that 76% of companies experienced downtime. And let’s not forget Meta’s massive 2024 outage, which cost an estimated $100 million in lost revenue or the $34 million in missed sales for Amazon in 2021. This proves that, even with resiliency strategies in place, and thousands of engineers dedicated to avoiding downtime, it’s still a problem that need to be properly addressed.

To provide resiliency to their customers, and help them reduce the risk of unplanned downtime, all major cloud providers, for Azure to Tencent Cloud, offer spread their databases, Kubernetes clusters, and other resources across different data centers (or zones) within a region. These zones have independent power, cooling, and networking. So if one zone goes down, the application can keep running on the other zones.

Sounds perfect, right? Well, it’s close, but there’s a catch. Some of these Cloud Service Providers, like AWS, GCP (not Azure), have additional Data Transfer Costs for cross-Availability Zone communication.

What Are Data Transfer Costs for Cross-Availability Zone Communication?

As the name suggests, these costs come from data moving between resources in different Availability Zones, and are usually calculated per gigabyte ($/GB). While it might not seem like much at first glance, let’s look at some example of real-life traffic.

In October 2023, Coupang.com, one of the main e-commerce in South Korea, had around 127.6 million monthly visits. With the average page size at 4.97MB, and about 12 pages visited per session, the monthly traffic easily reach tens of petabytes of data. Even if only half of this traffic involves cross-zone communication, the cost for the transit of data between Availability Zone quickly reach hundreds of thousands of dollars in a single month.

Kubernetes’s native Topology Aware Routing (aka Topology Aware Hints)

Starting in version 1.21, Kubernetes introduced Topology Aware Hints to minimize cross-zone traffic within clusters. This routing strategy is built on EndpointSlices, which were first introduced in version 1.17 to improve the scalability of the traditional Endpoints resource. When a new Service is created, Kubernetes automatically generates EndpointSlices, breaking down the network endpoints into manageable chunks. This reduces the overhead on kube-proxy and overcome the size limitations of objects stored in etcd (max 1.5MB).

EndpointSlices don’t just improve performance, they also carry metadata like the zone information. This metadata is critical for Topology-Aware Routing because it allows Kubernetes to make decisions about routing traffic based on the topology of the cluster. This mecchanism is enabled by the annotation service.kubernetes.io/topology-mode on a Service, and will instructs the kube-proxy to filter the available endpoints according to topology hints provided by the EndpointSlice controller.

In the following example, traffic can be routed based on the zone metadata (koreacentral-1, koreacentral-2, koreacentral-3).

The related EndpointSlice manifest will be the following:

apiVersion: discovery.k8s.io/v1
kind: EndpointSlice
metadata:
  ...
  ownerReferences:
  - apiVersion: v1
    blockOwnerDeletion: true
    controller: true
    kind: Service
    name: tasks-vastaya-svc
addressType: IPv4
ports:
- name: http
  port: 80
  protocol: TCP
endpoints:
- addresses:
  - 10.244.3.74
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000000
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-rblxz
    namespace: vastaya
    uid: 8fddbf95-dac8-420c-b0ab-d5076f9f27e9
  zone: koreacentral-1
- addresses:
  - 10.244.2.181
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000001
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-cwshq
    namespace: vastaya
    uid: 8c82addd-1123-4810-ad21-0533e8cd15ee
  zone: koreacentral-2
- addresses:
  - 10.244.1.108
  - 10.244.1.110
  conditions:
    ready: true
    serving: true
    terminating: false
  nodeName: aks-hazlpoolha-33634351-vmss000002
  targetRef:
    kind: Pod
    name: tasks-vastaya-dplmt-68cd4dd76c-dwxg2
    namespace: vastaya
    uid: b5128ae8-6615-41e6-97ec-8db9b81b588e
  zone: koreacentral-3

However, while Topology-Aware Routing helps reduce inter-zone traffic, it has some inherent limitations. Endpoint allocation is relatively static, meaning it doesn’t adapt to real-time conditions like traffic load, network latency, or service health beyond basic readiness and liveness probes. This can lead to imbalanced resource utilization, especially in dynamic environments where local endpoints are overwhelmed while remote ones remain underutilized.

This is where High Availability Zone-aware Load Balancing (HAZL) comes into play.

What is HAZL?

High Availability Zone-aware Load Balancing (HAZL) is a load balancer that leverages Topology-aware Routing, as well as the HTTP and gRPC traffic intercepted by the sidecar proxy running in meshed pods to load balance each request independently, routing to the best available backend based on current conditions. It operates at the request-level, unlike traditional connection-level load balancing, where all requests in a connection are sent to the same backend.

It also monitor the number of in-flight requests (requests waiting for resources or connections) referred as “load.” to the services, and handle the traffic between zones on a per-request basis. If load or latency spikes — signs that the system is under stress or unhealthy — HAZL adds additional endpoints from other zones. On the other hand, when the load decreases, HAZL remove those extra endpoints.

This adaptive approach fill the gaps of Topology-aware Routing allowing to a more controlled and more dynamic management of the cross-zone traffic, providing a balance between reducing latency, ensuring service reliability, and optimizing resource utilization.

HAZL is currently available only for Buoyant Enterprise for Linkerd and not for Linkerd Open Source.

What is Buoyant Enterprise for Linkerd?

Linkerd began its journey in the open-source world in 2016 and has since improved immensely. However, as corporations like Microsoft, Adidas, and Geico started incorporating Linkerd into their architectures, it became necessary to provide enterprise-level services and support that go beyond what is possible with open-source alone. This includes everything from Tailored Proofs of Concept, Software Bills-of-Materials for all components, a dedicated support channels by private support ticketing allowing them to have a direct point of contact instead of relying on public forums, Service Level Agreements, and more.

However, the Buoyant’s commitment to the open-source community is reflected in its pricing model. Anyone can try the enterprise features for non-production traffic, and companies with fewer than 50 employees can use Buoyant Enterprise for Linkerd in production for free, at any scale. Beyond that, there are different pricing tiers depending on the number of meshed pods and the specific features required.

Enough with the theory — let’s get our hands dirty and see HAZL in action.

Demonstration

In this demonstration, I will deploy the following infrastructure from scratch on an AKS cluster using Terraform. Then, I will install Prometheus, Linkerd Enterprise, and use Grafana to collect metrics of the traffic before and after enabling HAZL.

Infrastructure

Let’s start with the infrastructure. The following configuration will deploy:

Azure Kubernetes Cluster: This resource will have a default node pool where we will run Grafana, Prometheus, and the job to simulate traffic. This pool won’t have any availability zones assigned, as we want to keep the requests coming from the same region.
Cluster Node Pool: This pool will host the target services and pods and will have three availability zones, so Azure will automatically distribute the underlying VMs across these availability zones.
** Container Registry:** This is the resource where we will push our container images and from where the cluster will pull them, thanks to a role assignment to the kubelet identity with the AcrPull role.

provider "azurerm" {
  features {}
}

module "naming" {
  source  = "Azure/naming/azurerm"
  suffix = [ "training", "dev", "kr" ]
}

resource "azurerm_resource_group" "resource_group" {
  name     = "hazl-training-resources"
  location = "Korea Central"
}

resource "azurerm_kubernetes_cluster" "kubernetes_cluster" {
  name                = module.naming.kubernetes_cluster.name
  location            = azurerm_resource_group.resource_group.location
  resource_group_name = azurerm_resource_group.resource_group.name
  default_node_pool {
    name       = "default"
    node_count = 2
    vm_size    = "Standard_D2_v2"
    auto_scaling_enabled = true
  }
  identity {
    type = "SystemAssigned"
  }
}

resource "azurerm_kubernetes_cluster_node_pool" "kubernetes_cluster_node_pool" {
  name                  = "hazltrainingnodepool"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.kubernetes_cluster.id
  vm_size               = "Standard_DS2_v2"
  node_count            = 3
  zones                 = ["1", "2", "3"]
}

resource "azurerm_container_registry" "container_registry" {
  name                = module.naming.container_registry.name
  resource_group_name = azurerm_resource_group.resource_group.name
  location            = azurerm_resource_group.resource_group.location
  sku                 = "Premium"
  admin_enabled       = true
}

resource "azurerm_role_assignment" "role_assignment_cluster_container_registry" {
  principal_id                     = azurerm_kubernetes_cluster.kubernetes_cluster.kubelet_identity[0].object_id
  role_definition_name             = "AcrPull"
  scope                            = azurerm_container_registry.container_registry.id
  skip_service_principal_aad_check = true
}

Behind the scenes, Azure will set the topology.kubernetes.io/zone label on each node with the related zone and availability zone.

$ kubectl describe nodes | grep -e "Name:" -e "topology.kubernetes.io/zone"
Name: aks-agentpool-60539364-vmss000001
 topology.kubernetes.io/zone=0
Name: aks-hazlpoolha-33634351-vmss000000
 topology.kubernetes.io/zone=koreacentral-1
Name: aks-hazlpoolha-33634351-vmss000001
 topology.kubernetes.io/zone=koreacentral-2

Now that the infrastructure is in place, it’s time to start deploying the applications that we will use.

Install Buoyant Enterprise for Linkerd (BEL)

There are two ways to install BEL: via the operator or via CRDs and control plane Helm charts. The advantage of using the operator is that it will take care of pulling the configuration for both CRDs and the control plane for you, and manage the installation and upgrades automatically. Both methods require an active account on https://enterprise.buoyant.io/ as it will provide a license to use during the installation of the control plane.

As there is a lot of documentation about the operator online, in this demo, I will use Helm.

First, you will need to install the CRDs chart, which will install all the resource’s definitions necessary for Linkerd to work:

helm upgrade --install linkerd-enterprise-crds linkerd-buoyant/linkerd-enterprise-crds \
  --namespace linkerd \
  --create-namespace

Next, we will need to create a trust anchor certificate that will be used by the identity service to issue certificates and enable mTLS. In this case, I will use the step tool:

step certificate create root.linkerd.cluster.local ./certificates/ca.crt ./certificates/ca.key --profile root-ca --no-password --insecure
step certificate create identity.linkerd.cluster.local ./certificates/issuer.crt ./certificates/issuer.key --profile intermediate-ca --not-after 8760h --no-password --insecure --ca ./certificates/ca.crt --ca-key ./certificates/ca.key

Finally, we can install the control plane Helm chart, which will deploy all the roles, ConfigMaps, services, and components that make up Linkerd:

helm upgrade --install linkerd-enterprise-control-plane linkerd-buoyant/linkerd-enterprise-control-plane \
  --set buoyantCloudEnabled=false \
  --set license=$BUOYANT_LICENSE \
  -f ./helm/linkerd-enterprise/values.yaml \
  --set-file linkerd-control-plane.identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --namespace linkerd \
  --create-namespace

Install Linkerd-Viz

Linkerd-viz is an open-source extension that installs and auto-configures a Prometheus instance to scrape metrics from Linkerd. Additionally, it provides a dashboard that users can utilize to gain insights about the meshed pods in the cluster. It has a dedicated Helm chart that you can easily install with the following command:

helm upgrade --install linkerd-viz linkerd/linkerd-viz \
  --create-namespace \
  --namespace linkerd-viz

However, this extension only keeps metrics data for a brief window of time (6 hours) and does not persist data across restarts. Therefore, in this demo, we will install our own Prometheus instance and federate it with the Linkerd-viz Prometheus instance to persist the metrics.

Install Prometheus

By default, Prometheus provides a lot of metrics about the cluster and its resources. However, if you want to scrape additional information about the nodes, such as labels, you can modify the Prometheus configuration or install additional packages. To obtain the labels of the nodes and group the metrics by zone , we will install the kube-state-metrics, which will expose this infomations to the prometheus queries.

helm upgrade --install kube-state-metrics prometheus-community/kube-state-metrics \
  --set metricLabelsAllowlist.nodes=[*] \
  --create-namespace \
  --namespace monitoring

Next, install Prometheus using Helm:

helm upgrade --install prometheus prometheus-community/prometheus \
  --create-namespace \
  --namespace monitoring

Finally, we can federate our Prometheus instance with the Linkerd-viz instance so that data are copied from one Prometheus to another. This allows us to access metrics collected at the transport level, such as:

tcp_write_bytes_total: A counter of the total number of sent bytes. This is updated when the connection closes.
tcp_read_bytes_total: A counter of the total number of received bytes. This is updated when the connection closes.

To set up federation, add the following configuration to your Prometheus manifest file:

- job_name: 'linkerd'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names: ['{{.Namespace}}']

  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_container_name
    action: keep
    regex: ^prometheus$

  honor_labels: true
  metrics_path: '/federate'

  params:
    'match[]':
      - '{job="linkerd-proxy"}'
      - '{job="linkerd-controller"}'

Additionally, apply an AuthorizationPolicy that will allow Prometheus to access the Linkerd-viz Prometheus metrics endpoint:

apiVersion: policy.linkerd.io/v1alpha1
kind: AuthorizationPolicy
metadata:
  name: prometheus-admin-federate
  namespace: linkerd-viz
spec:
  targetRef:
    group: policy.linkerd.io
    kind: Server
    name: prometheus-admin
  requiredAuthenticationRefs:
    - group: policy.linkerd.io
      kind: NetworkAuthentication
      name: kubelet

If you have done everything correctly, you will be able to see the following target in Prometheus:

Now, all the metrics scraped by the Linkerd-viz Prometheus instance are available in our Prometheus instance.

Install and configure Grafana

Next, let’s install Grafana with the default configuration:

helm upgrade --install grafana grafana/grafana \
  --create-namespace \
  --namespace monitoring

After logging in, we need to add a new data source pointing to the Prometheus server running in the cluster. To do this:

Expand the Connections option from the side pane and click Add new connection.
Click Prometheus and enter the internal DNS endpoint of the Kubernetes cluster (http://prometheus-server.monitoring.svc.cluster.local) in the connection input.
Click the Save & Test button at the bottom to complete the setup.

Next, let’s create a new dashboard that will contain the visualizations we will use to monitor the traffic to and from our nodes using the previously created data source targeting the Prometheus server. The visualizations we need are the following:

CPU Usage per Kubernetes Node: This query will display the CPU usage percentage for each node. In this case, we expect that there will be a peak of CPU utilization in one of the nodes when we will trigger the jobs to simulate the traffic.

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

TCP Read Bytes Total (Outbound): This query shows the total number of bytes read over TCP connections for outbound traffic in the vastaya namespace, grouped by namespace, pod, instance, destination zone, and source zone. This metrics are collected by the Linkerd proxy.

sum by (namespace, pod, instance, dst_zone, src_zone) (
  tcp_read_bytes_total{direction="outbound", namespace="vastaya"}
)

Simulate the traffic with HAZL disabled

With this setup in place, we can proceed to trigger a job that will create 5 replicas, each one increasing the number of requests to the service every 10 seconds. To ensure that the pods are provisioned in the node pool without the application, we will also set a node affinity.

apiVersion: batch/v1
kind: Job
metadata:
  name: bot-get-project-report
  namespace: vastaya
spec:
  completions: 5          
  parallelism: 5          
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: agentpool
                operator: In
                values:
                - default
      containers:
      - name: project-creator
        image: curlimages/curl:7.78.0 
        command: ["/bin/sh", "-c"] 
        args:
        - |
          API_URL="http://projects.vastaya.svc.cluster.local/1/report"
          get_report() {
            local num_requests=$1
            echo "Getting $num_requests tasks..."
            for i in $(seq 1 $num_requests); do
              (
                echo "Getting task $i..."
                GET_RESPONSE=$(curl -s -X GET "$API_URL")
                echo $GET_RESPONSE
              ) &
            done
            wait
          }
          wait_time=10
          for num_requests in 5000 10000 15000; do
            echo "Running with $num_requests requests..."
            get_report $num_requests
            echo "Waiting for $wait_time seconds before increasing requests..."
            sleep $wait_time
          done
      restartPolicy: Never
  backoffLimit: 1

Since we haven’t enabled HAZL yet, Kubernetes will start directing the requests to pods running in different zones, resulting in an increase of cross-zone traffic.

Enable HAZL

Enabling HAZL with the operator is super easy. All we have to do is update the control plane values with the following command:

helm upgrade --install linkerd-enterprise-control-plane linkerd-buoyant/linkerd-enterprise-control-plane \
  --set buoyantCloudEnabled=false \
  --set license=$BUOYANT_LICENSE \
  --set-file linkerd-control-plane.identityTrustAnchorsPEM=./certificates/ca.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.crtPEM=./certificates/issuer.crt \
  --set-file linkerd-control-plane.identity.issuer.tls.keyPEM=./certificates/issuer.key \
  --set controlPlaneConfig.destinationController.additionalArgs="{ -ext-endpoint-zone-weights }" \
  --set controlPlaneConfig.proxy.additionalEnv[0].name=BUOYANT_BALANCER_LOAD_LOW \
  --set controlPlaneConfig.proxy.additionalEnv[0].value="0.8" \
  --set controlPlaneConfig.proxy.additionalEnv[1].name=BUOYANT_BALANCER_LOAD_HIGH \
  --set controlPlaneConfig.proxy.additionalEnv[1].value="2.0" \
  --namespace linkerd \
  --create-namespace \
  -f ./helm/linkerd-enterprise/values.yaml

Simulate the traffic with HAZL enabled

After enabling HAZL, we recreate the job to simulate the traffic and as you can see, the cross-zone communication has been completely eliminated.

References

AWS Data Transfer Cost: https://aws.amazon.com/ec2/pricing/on-demand/#Data_Transfer_within_the_same_AWS_Region
Linkerd and External Prometheus: https://linkerd.io/2-edge/tasks/external-prometheus/
HAZL Official Documentation: https://docs.buoyant.io/buoyant-enterprise-linkerd/latest/features/hazl/
BEL OPfficial Documentation: https://docs.buoyant.io/buoyant-enterprise-linkerd/latest/installation/enterprise/
Demo Source Code: https://github.com/GTRekter/Vastaya

Forem