Cláudio Filipe Lima Rapôso

Posted on Jul 3

Multicloud Service Mesh Federation: Orchestrating Cross-Cloud Zero-Trust Traffic via Istio and SPIFFE/SPIRE

#azure #aws #terraform #serverless

Bridging internal microservices across distinct cloud providers introduces severe identity verification and traffic encryption challenges. This technical guide details the implementation of a federated service mesh topology spanning Amazon EKS and Azure AKS. By anchoring workload identity to the SPIFFE/SPIRE framework and orchestrating cross-cloud mTLS via Istio Multi-Primary gateways, engineering teams establish an authenticated, low-latency communication plane that eliminates the risks of shared secrets or public network exposure.

Introduction

Modern distributed systems frequently distribute workloads across multiple cloud ecosystems to capitalize on specialized managed services or enforce strict regional redundancy. However, establishing secure, low-latency communication between a microservice executing within Amazon Elastic Kubernetes Service (Amazon EKS) and a dependent workload inside Azure Kubernetes Service (Azure AKS) exposes fundamental architectural flaws in traditional networking patterns. Relying on public API gateways introduces unnecessary latency and expands the external attack surface, while maintaining permanent site-to-site VPNs or dedicated express circuits incurs massive operational complexity and fails to guarantee application-layer security. We problematize these traditional network-centric approaches, refuting the assumption that network perimeter security equates to workload authorization.

To achieve a true zero-trust posture across cloud boundaries, organizations must decouple security from the underlying network topology by implementing a federated service mesh. By integrating Istio Multi-Primary architectures on separate networks with the Secure Production Identity Framework for Enterprise (SPIFFE) and its runtime implementation (SPIRE), engineering teams can establish a unified cryptographic identity plane. Workloads in AWS and Azure receive short-lived, verifiable identities that allow them to initiate mutually authenticated TLS (mTLS) connections directly across provider boundaries, ensuring that every cross-cloud request is encrypted, authenticated, and authorized regardless of the transit network.

Prerequisites

Deploying a federated multicloud service mesh demands exhaustive familiarity with Kubernetes networking primitives, mutual TLS mechanics, and cryptographic identity generation. The infrastructure layer is automated utilizing Terraform version 1.7.0 or higher, leveraging the HashiCorp AWS, AzureRM, and Helm providers. The Kubernetes environments require Amazon EKS and Azure AKS running version 1.29 or later, with Istio version 1.21 or higher installed via Helm. Identity federation necessitates the deployment of SPIRE version 1.9.0 within both clusters, utilizing OpenID Connect (OIDC) federation to bootstrap the initial trust relationships between the SPIRE servers and the cloud-native identity providers.

Step-by-Step Implementation

Establishing the Unified SPIFFE/SPIRE Identity Plane

The foundational requirement of a cross-cloud zero-trust architecture is the complete elimination of shared cryptographic keys or long-lived static credentials. If an EKS pod must present an AWS IAM access key to verify its identity to an AKS consumer, the system inherits a severe vulnerability through credential rotation overhead and exposure risks. We resolve this identity challenge by deploying SPIRE as the root of trust across both cloud environments.

The SPIRE Server within AWS EKS is configured to validate the identity of local pods utilizing the native Kubernetes workload attestor, verifying characteristics such as the namespace, service account name, and pod UUID. Concurrently, the SPIRE Server within Azure AKS performs symmetric attestation for local workloads. Crucially, we federate the two distinct SPIRE installations by establishing a SPIFFE trust bundle exchange. Each SPIRE server continuously retrieves and trusts the public cryptographic keys of its counterpart, allowing workloads in AWS to validate SPIFFE IDs issued by the Azure cluster seamlessly.

# Helm deployment of the SPIRE Server on Amazon EKS with OIDC enabling trust federation
resource "helm_release" "spire_server_aws" {
  name             = "spire"
  repository       = "https://spiffe.github.io/helm-charts-hardened"
  chart            = "spire"
  namespace        = "spire-system"
  create_namespace = true

  set {
    name  = "server.configuration.trustDomain"
    value = "aws.enterprise.internal"
  }

  set {
    name  = "server.configuration.federation.azure.trustDomain"
    value = "azure.enterprise.internal"
  }

  set {
    name  = "server.configuration.federation.azure.bundleUrl"
    value = "https://spire.azure.enterprise.internal/v1/fedbundle"
  }
}

Once the SPIRE infrastructure is federated and capable of issuing globally verifiable identities, how do we integrate this identity plane with the Istio service mesh to handle automated certificate rotation?

Configuring Istio to Utilize SPIRE-Issued Workload Certificates

We inject the SPIRE-generated cryptographic identities directly into the Istio data plane by modifying the Envoy sidecar configuration to communicate with the SPIRE Agent via the Envoy Secret Discovery Service (SDS) API. By default, Istio utilizes its own internal certificate authority (Istiod) to issue workload certificates. In a multicloud deployment, this creates a split-brain scenario where the AWS Envoy proxies cannot validate certificates generated by the Azure Istiod control plane without complex, manual root CA synchronization.

By replacing the native Istio CA with the SPIRE Agent sidecar Unix Domain Socket, Envoy bypasses Istiod for credential generation. When a pod initializes, the SPIRE Agent attests the container, generates a highly ephemeral, short-lived X.509 certificate containing the workload's unique SPIFFE ID (such as spiffe://aws.enterprise.internal/ns/core/sa/payment-service), and streams it into Envoy memory via SDS.

# IstioOperator configuration enforcing Envoy to use SPIRE SDS socket for identity
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
metadata:
  namespace: istio-system
  name: istio-spire-integration
spec:
  meshConfig:
    trustDomain: aws.enterprise.internal
  values:
    sidecarInjectorWebhook:
      injectedAnnotations:
        custom.spiffe.io/spire-agent-socket: "/run/spire/sockets/agent.sock"
  components:
    ingressGateways:
    - name: istio-ingressgateway
      enabled: true
    egressGateways:
    - name: istio-egressgateway
      enabled: true

With the data plane proxies utilizing federated SPIRE certificates, how do we configure the cross-cloud routing layers to safely pass this mTLS traffic over the public internet or private interconnects without breaking TLS SNI matching?

Engineering Cross-Cloud Traffic Routing via Istio East-West Gateways

We secure and route cross-cloud traffic by deploying dedicated Istio East-West gateways within both clusters, configuring them to execute SNI-based passthrough routing. Exposing internal cluster endpoints directly to the external network perimeter introduces profound security vulnerabilities. Instead, we route all outbound cross-cloud requests through an Egress Gateway in the source cluster, which directs the traffic across the network perimeter to an Ingress Gateway acting as an East-West entry point in the destination cluster.

The gateways are configured to read the Server Name Indication (SNI) header within the TLS handshake, matching the target service's fully qualified domain name (such as order-service.core.svc.azure.enterprise.internal). Crucially, the gateways do not terminate the TLS connection; they act as Layer 4 reverse proxies, passing the encrypted bytes directly to the target pod where the final mTLS decryption and SPIFFE identity validation occur.

# Istio Gateway configuration for East-West cross-cloud routing
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
  name: cross-cloud-gateway
  namespace: istio-system
spec:
  selector:
    istio: eastwestgateway
  servers:
  - port:
      number: 15443
      name: tls
      protocol: TLS
    tls:
      mode: AUTO_PASSTHROUGH
    hosts:
    - "*.azure.enterprise.internal"
---
# DestinationRule enforcing mTLS and targeting the Azure infrastructure gateway
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: route-to-azure-mesh
  namespace: core
spec:
  host: "*.azure.enterprise.internal"
  trafficPolicy:
    tls:
      mode: ISTIO_MUTUAL
      sni: "cross-cloud-mesh-routing"

Common Troubleshooting

The most pervasive failure mode in federated multicloud service meshes is the misalignment of trust bundles between the isolated SPIRE server instances. If a microservice executing in AWS attempts to connect to an Azure workload and the Envoy sidecar logs display CERT_SIGNING_FAILED or TLS_ERR_BAD_CERTIFICATE, the SPIRE servers have failed to exchange their cryptographic trust bundles. Operators must verify that the public endpoint exposing the SPIRE federation bundle (the /v1/fedbundle path) is fully accessible across the cloud perimeter and protected by valid, globally trusted certificates. If the automated synchronization loop fails due to strict firewall rules, operators must establish an out-of-band automation pipeline using Terraform to periodically export the public signing keys of the AWS SPIRE server and manually append them to the Azure SPIRE server's trust store.

Another severe operational complication involves strict MTU limits over cross-cloud networks. When Envoy wraps traffic in mTLS and appends large SPIFFE cryptographic identity chains to the TLS handshake, the packet size frequently exceeds the standard 1500-byte Maximum Transmission Unit (MTU) of generic internet transit networks or virtual private gateways. This discrepancy results in quiet packet truncation, causing TCP handshakes to stall and time out during the initial mTLS negotiation phase while basic ping tests pass successfully. To remediate this issue, engineers must configure an explicit EnvoyFilter or modify the underlying Kubernetes CNI network configurations to clamp the TCP Maximum Segment Size (MSS) down to 1300 bytes, guaranteeing that heavily decorated cryptographic packets traverse the multicloud routing nodes without fragmentation or dropping.

Conclusion

Federating Istio and SPIFFE/SPIRE across Amazon EKS and Azure AKS establishes an exceptionally robust, network-agnostic communication plane that perfectly exemplifies zero-trust architectural paradigms. Decoupling authorization from network topology ensures that a total compromise of an intermediate router, load balancer, or cloud network configuration cannot allow an attacker to spoof workload identity or intercept confidential transactional payloads. As platform scale expands, engineering teams should evaluate the implementation of WebAssembly (WASM) plugins within the Envoy proxies. This strategy allows for the dynamic injection of complex, localized compliance and data-masking policies directly at the mesh perimeter, optimizing security controls before the cross-cloud payload ever crosses the network boundary.

References

Sallam, A. (2020). Zero trust networks with SPIFFE and SPIRE: Securing infrastructure identity. O'Reilly Media.

Wierenga, J. (2022). Service mesh architecture: Deep dive into Istio, Linkerd, and Consul. Packt Publishing.deration failures frequently block the autoscaler from reading the metric source. If the ScaledObject remains in a Pending state and the KEDA operator logs indicate AccessDenied when calling sqs:GetQueueAttributes, the IAM trust policy is misconfigured. You must verify that the TriggerAuthentication resource points to a Kubernetes service account that is explicitly annotated with the correct AWS IAM Role ARN. Furthermore, the IAM Role's trust relationship must exactly match the OIDC provider URL of the EKS cluster and the specific namespace and service account name utilized by KEDA.

Another severe issue occurs when the Python applications scale massively, but the distributed traces appear fragmented or entirely missing in the observability backend. This indicates a failure in the OTEL Collector's memory limiter. If the ingestion rate from the Python pods exceeds the collector's ability to compress and export the batches, the memory_limiter processor will aggressively drop incoming spans to prevent a catastrophic Out Of Memory (OOM) crash of the collector pod. You must monitor the otelcol_processor_dropped_spans metric and horizontally scale the OTEL Collector deployment or increase the memory allocation in the Terraform manifest to handle the telemetry throughput matching your peak KEDA replica count.

Conclusion

Integrating KEDA and OpenTelemetry via a unified Terraform pipeline establishes an operationally mature, highly observable multicloud computing tier. By replacing manual metric configurations with codified Kubernetes manifests and standardizing trace propagation across Python workloads, engineering teams ensure that extreme scaling events are executed deterministically and transparently. As the architecture evolves, organizations should leverage the aggregated tracing data exported by the OTEL Collector to refine the KEDA polling intervals and cooldown periods, establishing a continuous feedback loop that perfectly optimizes compute expenditure against localized processing latency.

References

Shatzmiller, I. (2021). Event-driven autoscaling in Kubernetes with KEDA. CNCF Community Publications.

W3C. (2021). W3C trace context: A standard for distributed tracing context propagation. World Wide Web Consortium. https://www.w3.org/TR/trace-context/

DEV Community

Multicloud Service Mesh Federation: Orchestrating Cross-Cloud Zero-Trust Traffic via Istio and SPIFFE/SPIRE

Introduction

Prerequisites

Step-by-Step Implementation

Establishing the Unified SPIFFE/SPIRE Identity Plane

Configuring Istio to Utilize SPIRE-Issued Workload Certificates

Engineering Cross-Cloud Traffic Routing via Istio East-West Gateways

Common Troubleshooting

Conclusion

References

Conclusion

References

Top comments (0)