Architecting Zero Trust Multicloud Identity: Federating SPIFFE/SPIRE Across AWS and Azure

#azur #aws #terraform #python

Rigid network segregation and localized perimeter security topologies fail when attempting to unify communication between microservices distributed across multiple cloud providers. When infrastructure engineers extend communication buses between Amazon Web Services (AWS) and Microsoft Azure, implicit trust based on CIDR blocks exposes the traffic to severe interception vulnerabilities. Relying solely on IPsec tunnels or dedicated connections does not mitigate the risk of lateral movement if a single container is compromised. Security in multicloud environments requires a federated Service Mesh founded strictly on the Zero Trust paradigm. By implementing an identity federation based on the Secure Production Identity Framework for Everyone (SPIFFE) specification managed by SPIRE, organizations replace network address validation with cryptographic workload attestation. This topology ensures that every microservice possesses a universally verifiable identity, enabling the execution of end-to-end mutual TLS (mTLS) across isolated cloud boundaries without relying on vulnerable network perimeters.

Prerequisites

Constructing a federated identity infrastructure requires compliance with modern infrastructure as code specifications and container management paradigms. The orchestration environment demands Amazon Elastic Kubernetes Service (EKS) and Azure Kubernetes Service (AKS) clusters running version 1.29 or higher. Infrastructure automation utilizes Terraform version 1.7.0 or later, integrated with the HashiCorp AWS Provider version 5.40.0 and the AzureRM Provider version 3.90.0. The identity federation layer requires the deployment of SPIRE (SPIFFE Runtime Environment) version 1.9.0, supplemented by the pyspiffe library version 0.3.0 for Python 3.12 applications requiring direct integration with the Workload API.

Step-by-Step Implementation

Provisioning the Infrastructure of Mutual Trust

Anchoring a Zero Trust identity across multiple cloud providers relies on creating an infrastructure capable of validating the physical and cryptographic properties of workloads at runtime. Instead of issuing manual, long-lived certificates, we deploy SPIRE Server instances in both clouds, configuring them to operate in a key federation regime. The AWS server attests the environment utilizing the AWS IAM Roles for Service Accounts (IRSA) mechanism, while the Azure server validates workloads via Azure Workload Identity. The two instances exchange their JSON Web Key Sets (JWKS) continuously. This exchange allows a cluster running in Azure to autonomously validate whether a SPIFFE token presented by a container originating from AWS was signed by a legitimate authority. We provision the foundational AWS attestation roles utilizing Terraform to ensure the SPIRE server has the precise permissions required to interrogate the EC2 metadata API.

resource "aws_iam_role" "spire_server_attestation" {
  name = "spire-server-node-attestor-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRoleWithWebIdentity"
        Effect = "Allow"
        Principal = {
          Federated = var.aws_eks_oidc_provider_arn
        }
        Condition = {
          StringEquals = {
            "${var.aws_eks_oidc_url}:sub" = "system:serviceaccount:spire:spire-server"
          }
        }
      }
    ]
  })
}

resource "aws_iam_policy" "spire_node_resolver" {
  name        = "spire-node-resolver-policy"
  description = "Permits SPIRE Server to validate EC2 instance metadata for EKS nodes"

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = [
          "ec2:DescribeInstances",
          "autoscaling:DescribeAutoScalingGroups"
        ]
        Resource = "*"
        Action   = [
          "ec2:DescribeInstances",
          "autoscaling:DescribeAutoScalingGroups"
        ]
      }
    ]
  })
}

resource "aws_iam_role_policy_attachment" "spire_attestor_bind" {
  role       = aws_iam_role.spire_server_attestation.name
  policy_arn = aws_iam_policy.spire_node_resolver.arn
}

Once the foundational attestation infrastructure is established within the cloud providers, how do we translate this validation into explicit SPIRE configurations to enable the exchange of cryptographic keys between AWS and Azure?

Configuring SPIFFE Identity Federation

We translate mutual trust by declaring structured configuration manifests that define the strict boundaries of the Trust Domains. The SPIRE server located in AWS manages the aws.enterprise.internal domain, while Azure governs azure.enterprise.internal. The SPIRE Server configuration must explicitly point to the HTTP endpoint exposing the remote domain's public keys. This security alignment guarantees that when a financial processing microservice running in Azure AKS attempts to communicate with an inventory microservice in Amazon EKS, the cryptographic signature of the X.509 SPIFFE Verifiable Identity Document (SVID) can be validated locally without requiring a high-latency round trip back to the origin cloud. We deploy this configuration deterministically using a Terraform Kubernetes ConfigMap resource, ensuring the federation rules are immutable and version controlled.

resource "kubernetes_config_map" "spire_server_config" {
  metadata {
    name      = "spire-server"
    namespace = "spire"
  }

  data = {
    "server.conf" = <<-EOT
      server {
          bind_address = "0.0.0.0"
          bind_port = "8081"
          trust_domain = "aws.enterprise.internal"
          data_dir = "/run/spire/data"
          log_level = "INFO"
      }
      plugins {
          DataStore "sql" {
              plugin_data {
                  database_type = "postgres"
                  connection_string = "host=spire-db.internal user=spire dbname=spire sslmode=disable"
              }
          }
          NodeAttestor "aws_iid" {
              plugin_data {
                  cluster_name = "production-eks-cluster"
              }
          }
          KeyManager "memory" {
              plugin_data {}
          }
      }
      federation {
          bundle_endpoint_profile "https_spiffe" {
              endpoint_spiffe_id = "spiffe://azure.enterprise.internal/spire/server"
              endpoint_url = "https://spire-bundle.azure.enterprise.internal/spiffe/v1/bundle"
          }
      }
    EOT
  }
}

With the trust domains federated and the servers communicating reliably, how do the applications utilize these dynamic identities at runtime to ensure traffic isolation through rigorous authorization policies?

Direct Workload API Integration for mTLS

The practical enforcement of Zero Trust is achieved by integrating the application directly with the local SPIRE Agent via the SPIFFE Workload API. While sidecar proxies like Envoy are common, highly secure or performance critical Python microservices can fetch their cryptographic identities directly from the local UNIX domain socket provided by SPIRE. The Python application invokes the pyspiffe library, which communicates with the local agent to retrieve the X.509 SVID and the federated trust bundle. When the AWS application initiates an HTTPS request to the Azure endpoint, it injects its SVID into the TLS context. During the handshake, the Azure microservice validates the expiration of the certificate and cross-references the SPIFFE ID contained within the Subject Alternative Name (SAN) against a strict access control list. If the received identifier is spiffe://aws.enterprise.internal/ns/finance/sa/payment-processor, and the local policy only permits connections from the audit domain, the Python socket immediately terminates the connection.

import os
import ssl
import urllib.request
from pyspiffe.workloadapi import default_x509_source
from pyspiffe.spiffe_id.spiffe_id import SpiffeId

# The SPIFFE Workload API UNIX socket is mounted by the SPIRE Agent
os.environ["SPIFFE_ENDPOINT_SOCKET"] = "unix:///run/spire/sockets/agent.sock"
AZURE_TARGET_ENDPOINT = "https://transactions.azure.enterprise.internal/api/v1/process"
EXPECTED_AZURE_SPIFFE_ID = SpiffeId.parse("spiffe://azure.enterprise.internal/ns/production/sa/api-gateway")

def execute_federated_mtls_request(payload: bytes) -> None:
    # Fetch the dynamic SVID and Trust Bundles from the local SPIRE Agent
    with default_x509_source() as x509_source:
        svid = x509_source.get_x509_svid()
        trust_bundle = x509_source.get_bundle_for_trust_domain(EXPECTED_AZURE_SPIFFE_ID.trust_domain)

        # Construct a highly restricted SSL Context utilizing the fetched SPIFFE material
        context = ssl.create_default_context(cadata=trust_bundle.x509_authorities()[0].public_bytes())
        context.load_cert_chain(
            certfile=svid.cert_chain[0].public_bytes(),
            keyfile=svid.private_key.private_bytes()
        )
        context.check_hostname = False 
        context.verify_mode = ssl.CERT_REQUIRED

        req = urllib.request.Request(
            AZURE_TARGET_ENDPOINT,
            data=payload,
            headers={"Content-Type": "application/json"},
            method="POST"
        )

        try:
            # The underlying socket utilizes the SPIFFE mTLS context
            with urllib.request.urlopen(req, context=context) as response:
                print(f"Federated request successful. Status: {response.status}")

                # Retrieve and manually validate the peer's SPIFFE ID from the SAN
                peer_cert = response.fp.raw._sock.getpeercert()
                san_entries = peer_cert.get('subjectAltName', [])
                peer_spiffe_ids = [val for key, val in san_entries if key == 'URI']

                if str(EXPECTED_AZURE_SPIFFE_ID) not in peer_spiffe_ids:
                    raise PermissionError(f"Unauthorized peer identity: {peer_spiffe_ids}")

        except Exception as e:
            print(f"mTLS Connection failed or identity rejected: {str(e)}")

if __name__ == "__main__":
    execute_federated_mtls_request(b'{"transaction_value": 5000}')

What occurs operationally in the production environment when the federation keys fall out of synchronization due to severe cross-cloud network routing instability?

Common Troubleshooting

The desynchronization of Trust Bundles between federated domains is the primary cause of inter-cloud mTLS connection failures, typically manifesting as TLS handshake errors with the SSL roulette: unknown CA directive. If this degradation occurs, inspect the SPIRE Server logs for resolution failures on the external federation endpoint. If the public network route or the VPN experiences packet loss, the SPIRE Server will cease updating the partner cloud's keys. To circumvent this behavior and mitigate downtime caused by short timeout windows, adjust the bundle_endpoint_refresh_interval parameter in the federation manifest to a minimum of 15 minutes, and ensure the local network bus possesses active DNS persistence rules for the public endpoints of the SPIRE servers.

Another frequent operational error stems from the rejection during node attestation of newly provisioned compute nodes within Amazon EKS or Azure AKS autoscaling groups. When a new node scales up, the SPIRE Agent installed as a DaemonSet makes a call to the cloud provider's API to validate the physical properties of the machine. If the IAM policy or the Managed Identity associated with the SPIRE Server lacks explicit permissions to execute instance metadata read actions (ec2:DescribeInstances), the attestation will fail, and the workloads on that specific node will never receive their legitimate SPIFFE identities. Systematically review the IAM roles attached to the SPIRE Server infrastructure, guaranteeing that listing and description actions are active and free from excessively restrictive resource scopes.

Conclusion

Federating cryptographic identities via SPIFFE/SPIRE resolves the latent fragility of managing multicloud security policies based exclusively on traditional network perimeters. By decoupling identity from the underlying AWS and Azure infrastructure, the architecture guarantees end-to-end cryptographic isolation and enforces an authentic Zero Trust posture. As the operational mesh expands, organizations should integrate Open Policy Agent (OPA) tools directly coupled with the workload execution path. This evolution enables detailed context evaluation of requests in real time, allowing the validation of regulatory payload parameters at the exact moment the cryptographic identity is inspected at the transport layer.

References

Khan, A. (2021). Solving the bottom turtle: a deep dive into secure production identity for workloads. Scytale, Inc.

Spiffe Creative Authors. (2020). SPIFFE: Secure Production Identity Framework for Everyone Specification. Cloud Native Computing Foundation. https://github.com/spiffe/spiffe