Cláudio Filipe Lima Rapôso

Posted on May 31

Surviving Global Vendor Outages: Federated Cellular Architecture with EKS, AKS, and Istio

#azure #aws #terraform #python

Monolithic multi-region architectures inherently rely on vendor specific global control planes. When a catastrophic degradation strikes an underlying identity service or networking fabric within a single cloud provider, all regional partitions fail concurrently. Relying exclusively on Amazon Web Services (AWS) or Microsoft Azure caps the maximum theoretical availability of a platform to the operational integrity of that single vendor. Implementing a federated multicloud cellular architecture resolves this existential risk. By orchestrating isolated Kubernetes partitions across Amazon EKS and Azure AKS utilizing a cross-cloud service mesh, engineering teams construct a routing matrix that survives global vendor outages. This topology isolates fault domains at the hypervisor level, leveraging dynamic BGP routing and proxy based mutual TLS to establish a resilient, vendor agnostic fabric. This guarantees execution continuity for high throughput workloads when single cloud availability zones evaporate.

Prerequisites

Deploying a federated multicloud mesh requires deep expertise in advanced networking and container orchestration. The infrastructure state requires Terraform version 1.7.0 or higher, initialized with the hashicorp/aws provider version 5.40.0 and the hashicorp/azurerm provider version 3.90.0. For automating identity plane injection, Python 3.12 is required along with the kubernetes Python client version 29.0.0. The architecture relies on Istio version 1.21.0 or higher configured for multi-primary, multi-network deployments. Administrative access to provision AWS Transit Gateways, Azure Virtual Network Gateways, and BGP Autonomous System Numbers (ASNs) is mandatory.

Step-by-Step Implementation

Establishing the Cross-Cloud BGP Backbone

We construct the physical network bridge between the cloud providers by provisioning an IPsec Site-to-Site VPN peering an AWS Transit Gateway directly to an Azure Virtual Network Gateway. The architectural justification for this overlay network is absolute routing determinism. Relying on public internet routing for cross-cloud cellular communication introduces unacceptable latency variance and exposes internal telemetry to external interception. A dedicated IPsec tunnel secured by pre-shared keys ensures encrypted, predictable packet delivery. By enabling dynamic BGP routing over this connection, the network becomes self-healing. If an AWS Availability Zone experiences total network partition, the BGP daemon automatically withdraws the compromised routes. The Azure AKS cells instantly recognize the topology change and redirect traffic to surviving AWS subnets without manual operator intervention.

resource "aws_customer_gateway" "azure_gateway" {
  bgp_asn    = 65515
  ip_address = azurerm_public_ip.azure_vpn_ip.ip_address
  type       = "ipsec.1"
  tags = {
    Name = "Azure-AKS-Boundary"
  }
}

resource "aws_vpn_connection" "multicloud_mesh_vpn" {
  customer_gateway_id = aws_customer_gateway.azure_gateway.id
  transit_gateway_id  = aws_ec2_transit_gateway.mesh_tgw.id
  type                = aws_customer_gateway.azure_gateway.type
  static_routes_only  = false

  tunnel1_preshared_key = var.high_entropy_ipsec_key
  tunnel2_preshared_key = var.high_entropy_ipsec_key
}

resource "azurerm_local_network_gateway" "aws_tgw_local" {
  name                = "aws-tgw-boundary"
  resource_group_name = azurerm_resource_group.multicloud.name
  location            = azurerm_resource_group.multicloud.location
  gateway_address     = aws_vpn_connection.multicloud_mesh_vpn.tunnel1_address

  bgp_settings {
    asn                 = 64512
    bgp_peering_address = aws_vpn_connection.multicloud_mesh_vpn.tunnel1_cgw_inside_address
  }
}

How do we establish cryptographic trust between microservices operating in AWS EKS and Azure AKS when each vendor utilizes mutually exclusive certificate authorities for their internal managed identities?

Federating Identity via Unified Root CA

We establish cryptographic trust by abstracting the identity plane entirely away from the cloud vendors, deploying a multi-primary Istio Service Mesh anchored by a unified, offline Root Certificate Authority (CA). The absolute architectural necessity here is decoupling zero-trust authentication from IAM and Azure AD. An AWS IAM role cannot natively authenticate against an Azure workload identity. By provisioning a shared Root CA, we generate intermediate certificates specifically for the Istio control planes operating in both EKS and AKS. When the Envoy sidecar proxy in AWS EKS initiates a connection to the Envoy proxy in Azure AKS, both proxies present TLS certificates signed by their respective intermediates. Because both intermediates chain back to the identical offline Root CA, the proxies validate the mutual TLS (mTLS) handshake successfully. This strictly isolates the domain execution from the vendor specific identity silos.

import base64
import os
from kubernetes import client, config
from kubernetes.client.rest import ApiException

def inject_intermediate_ca(cluster_context: str, cert_path: str, key_path: str, root_cert_path: str) -> None:
    config.load_kube_config(context=cluster_context)
    v1 = client.CoreV1Api()

    with open(cert_path, 'rb') as cert_file, open(key_path, 'rb') as key_file, open(root_cert_path, 'rb') as root_file:
        cert_data = base64.b64encode(cert_file.read()).decode('utf-8')
        key_data = base64.b64encode(key_file.read()).decode('utf-8')
        root_data = base64.b64encode(root_file.read()).decode('utf-8')

    secret_manifest = client.V1Secret(
        metadata=client.V1ObjectMeta(
            name="cacerts",
            namespace="istio-system"
        ),
        type="Opaque",
        data={
            "ca-cert.pem": cert_data,
            "ca-key.pem": key_data,
            "root-cert.pem": root_data,
            "cert-chain.pem": cert_data
        }
    )

    try:
        v1.create_namespaced_secret(namespace="istio-system", body=secret_manifest)
        print(f"Successfully injected intermediate CA into {cluster_context}")
    except ApiException as e:
        if e.status == 409:
            v1.replace_namespaced_secret(name="cacerts", namespace="istio-system", body=secret_manifest)
            print(f"Updated existing intermediate CA in {cluster_context}")
        else:
            raise RuntimeError(f"Failed to inject CA: {e.reason}")

if __name__ == "__main__":
    inject_intermediate_ca("aws-eks-admin", "certs/aws-ca-cert.pem", "certs/aws-ca-key.pem", "certs/root-cert.pem")
    inject_intermediate_ca("azure-aks-admin", "certs/azure-ca-cert.pem", "certs/azure-ca-key.pem", "certs/root-cert.pem")

When an EKS cell experiences CPU starvation and begins dropping requests, how does the service mesh autonomously redirect the active request payload to a healthy AKS cell without surfacing the HTTP 503 error to the end user?

Locality Load Balancing and Cross-Cloud Fallback

The mesh autonomously redirects failing requests by utilizing Istio locality load balancing combined with outlier detection circuit breakers. Cellular architecture dictates that network traffic must remain within its origin cell to minimize blast radius and reduce cross-network egress data transfer costs. However, when the primary AWS EKS cell degrades, the mesh must act as a highly responsive failover mechanism. We configure a DestinationRule that defines the Azure AKS deployment as a secondary failover locality. Simultaneously, we configure outlier detection to scan for consecutive HTTP 5xx errors. If an AWS pod begins failing, the local Envoy proxy ejects that specific pod from the load balancing pool. If the entire EKS locality exhausts its healthy endpoints, Envoy transparently reroutes the pending HTTP payload over the BGP IPsec backbone to the healthy Azure AKS locality. The client application experiences a slight latency increase but receives a successful HTTP 200 response, entirely masking the AWS infrastructure failure.

resource "kubernetes_manifest" "multicloud_failover_policy" {
  manifest = {
    "apiVersion" = "networking.istio.io/v1beta1"
    "kind"       = "DestinationRule"
    "metadata" = {
      "name"      = "transaction-service-routing"
      "namespace" = "enterprise-core"
    }
    "spec" = {
      "host" = "transaction-service.enterprise-core.svc.cluster.local"
      "trafficPolicy" = {
        "outlierDetection" = {
          "consecutive5xxErrors" = 3
          "interval"             = "10s"
          "baseEjectionTime"     = "30s"
          "maxEjectionPercent"   = 100
        }
        "loadBalancer" = {
          "localityLbSetting" = {
            "enabled" = true
            "failover" = [
              {
                "from" = "us-east-1"
                "to"   = "eastus"
              }
            ]
          }
        }
      }
    }
  }
}

When automated failover successfully masks a cross-cloud reroute, how do platform operators identify the silent degradation of the primary AWS network before the Azure fallback partition reaches total capacity exhaustion?

Common Troubleshooting

Platform operators identify silent degradation by monitoring specific proxy telemetry anomalies, though misconfigurations often mask these signals. When establishing the cross-cloud BGP backbone, the AWS Site-to-Site VPN state may remain in Active status rather than transitioning to Established. This specifically indicates a failure in the IKE phase 1 or phase 2 proposal negotiation. Verify that the encryption algorithms and Diffie-Hellman groups configured on the AWS Transit Gateway exactly match the custom IPsec policy applied to the Azure Virtual Network Gateway.

If the BGP backbone is healthy but cross-cloud requests return HTTP 503 Service Unavailable with the URX (Upstream Retry Limit Exceeded) flag in the Istio proxy logs, the mTLS handshake is failing. This frequently occurs when the intermediate CA certificates injected into the istio-system namespace expire or when the SPIFFE IDs of the workloads do not match the expected trust domain. Inspect the Envoy sidecar logs for TLS error: Secret is not supplied by SDS and verify that the cacerts secret was properly read by the Istiod control plane during pod startup.

Conclusion

Federating Kubernetes clusters across AWS and Azure utilizing an Istio service mesh delivers an impenetrable execution environment. By bridging EKS and AKS with a dynamic BGP overlay network and enforcing identity through a vendor agnostic Root CA, architectures can seamlessly survive the total loss of a cloud provider's regional control plane. Organizations adopting this advanced cellular model should proceed to implement GitOps methodologies using tools like ArgoCD. This ensures that application deployment states are synchronized instantaneously across the multicloud matrix, preventing configuration drift from compromising the Azure fallback paths during critical failover events.

References

Garg, S., & Beda, J. (2022). Advanced Kubernetes networking: Routing, security, and multi-cluster patterns. O'Reilly Media.

Posta, C., & Malina, Rinat. (2023). Istio in action: Managing microservices with the Istio service mesh. Manning Publications.

DEV Community