Cláudio Filipe Lima Rapôso

Posted on May 2

Multi-Cloud Resilience: The Event-Driven Cellular Architecture Blueprint across AWS and Azure

#aws #azure #lambda #azurefunctions

1. Introduction

As enterprise systems reach massive scale, relying on a single cloud provider introduces systemic risk. Regional outages, vendor-specific API degradations, or severe misconfigurations can result in global downtime. Multi-cloud cellular architecture solves this by deploying isolated, self-contained failure domains (cells) across different cloud providers. In this tutorial, you will architect a unified, event-driven system where AWS and Azure serve as the physical hosts for identical logical cells.

Instead of building a "stretched" application that queries data across clouds—an anti-pattern that guarantees high latency and fragile dependencies—you will deploy complete, asynchronous processing pipelines on both AWS and Azure. A global, cloud-agnostic edge routing layer will inspect incoming traffic and forward it to either an AWS cell or an Azure cell based on a partition key (such as TenantId). By the end of this guide, you will understand how to orchestrate this ultimate fault-isolation boundary using Terraform, ensuring your application survives even a total cloud provider outage.

2. Prerequisites

To build a multi-cloud environment, your operational tooling must be strictly cloud-agnostic. You will need:

Active accounts on both Amazon Web Services (AWS) and Microsoft Azure with administrative privileges.
The AWS CLI and Azure CLI installed and authenticated on your local machine.
Terraform (version 1.3.0 or higher) installed. Terraform is crucial here, as it is the singular control plane capable of deploying resources to both clouds simultaneously.
Understanding of Global Server Load Balancing (GSLB) or Edge Compute platforms (like Cloudflare Workers or Fastly) to act as the multi-cloud router.
A Domain-Driven Design (DDD) approach to your data, ensuring that a specific tenant's data lives entirely within one cell and never needs to cross the multi-cloud boundary.

3. Step-by-Step

Step 1: Configuring Multi-Cloud Providers in Terraform

What to do: Create a root Terraform configuration that initializes providers for both AWS and Azure simultaneously.

Why do it: Infrastructure as Code (IaC) is the only sustainable way to manage multi-cloud environments. By declaring both providers in the same root module, you can orchestrate the deployment of AWS cells, Azure cells, and the DNS records that tie them together from a single pipeline, preventing configuration drift between your cloud environments.

Code Example (main.tf):

terraform {
  required_version = ">= 1.3.0"
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

provider "azurerm" {
  features {}
}

Step 2: Deploying the AWS and Azure Data Plane Cells

What to do: Utilize the modular approach to instantiate independent cells on both clouds. The AWS cell utilizes DynamoDB, SNS, and SQS. The Azure cell utilizes Cosmos DB, Service Bus Topics, and Subscriptions. Both rely on serverless compute (Lambda and Azure Functions).

Why do it: While the underlying managed services differ, the architectural pattern is identical: an HTTP entry point writes to a NoSQL database, which triggers a change stream, which is fanned out via a message bus to consumer workers. Encapsulating these into Terraform modules abstracts the cloud-specific complexities, allowing you to treat "AWS" and "Azure" simply as deployment targets for your business logic.

Code Example (data_plane.tf):

# AWS Cell (East Coast)
module "cell_aws_alpha" {
  source  = "./modules/aws_event_cell"
  cell_id = "aws-alpha"
  region  = "us-east-1"
}

# Azure Cell (East Coast)
module "cell_azure_beta" {
  source   = "./modules/azure_event_cell"
  cell_id  = "azure-beta"
  location = "East US"
}

# Azure Cell (Europe)
module "cell_azure_gamma" {
  source   = "./modules/azure_event_cell"
  cell_id  = "azure-gamma"
  location = "North Europe"
}

Step 3: Implementing the Cloud-Agnostic Edge Router

What to do: Deploy a global routing layer that sits entirely outside of AWS and Azure. This is typically achieved using an Edge Compute platform like Cloudflare Workers. The Worker intercepts the HTTP request, reads the TenantId, checks a globally distributed Key-Value store to find the assigned cell's API endpoint, and forwards the payload.

Why do it: If your global router is hosted on AWS (e.g., using API Gateway), and AWS experiences an outage, your Azure cells become unreachable, defeating the purpose of multi-cloud. The routing layer must be cloud-agnostic. Cloudflare Workers execute at the edge, providing sub-millisecond lookups for the routing map and directing traffic to the respective AWS API Gateway or Azure Function HTTP trigger.

Conceptual Edge Router Logic (JavaScript for Edge Worker):

export default {
  async fetch(request, env) {
    const url = new URL(request.url);
    const tenantId = request.headers.get("X-Tenant-ID");

    if (!tenantId) {
      return new Response("Missing Tenant ID", { status: 400 });
    }

    // Lookup cell assignment from global edge KV store
    const cellEndpoint = await env.ROUTING_MAP.get(tenantId);

    if (!cellEndpoint) {
      return new Response("Tenant mapping not found", { status: 404 });
    }

    // Proxy the request to the target cloud (AWS or Azure)
    const targetUrl = new URL(url.pathname, cellEndpoint);
    const modifiedRequest = new Request(targetUrl, request);

    return fetch(modifiedRequest);
  }
}

Step 4: Automating Disaster Recovery and Traffic Shifting

What to do: Establish a process to update the global routing map. If the AWS region hosting cell_aws_alpha goes offline, you update the Edge KV store to point those affected tenants to a standby cell on Azure.

Why do it: Cellular architecture drastically reduces the blast radius of an outage, but you still need a mechanism to restore service for the degraded cell. By decoupling the routing logic from the compute infrastructure, shifting traffic between clouds becomes a simple Key-Value update at the DNS/Edge layer, resulting in near-instantaneous failover.

4. Common Troubleshooting

The Multi-Cloud Data Split Brain: The biggest mistake in multi-cloud architecture is attempting synchronous database replication between AWS and Azure. This introduces severe latency and massive egress costs. Ensure strict cellular isolation: a tenant's data must live and be processed entirely within their assigned cell. Data should only cross clouds during a deliberate, asynchronous disaster recovery migration.
CI/CD Pipeline Complexity: Deploying to multiple clouds means your deployment pipelines must authenticate with two different IAM systems. Utilize OpenID Connect (OIDC) in platforms like GitHub Actions or GitLab CI. OIDC allows your pipelines to assume temporary, secure roles in both AWS (via IAM Identity Center) and Azure (via Microsoft Entra Workload ID) without storing long-lived, vulnerable access keys.
Observability Fragmentation: AWS logs live in CloudWatch; Azure logs live in Log Analytics. Troubleshooting a multi-cloud environment requires a unified view. You must export logs and metrics from both cloud providers into a centralized, agnostic observability platform (like Datadog, New Relic, or a self-hosted ELK stack) to effectively monitor the health of your global cells.

5. Conclusion

Building an event-driven cellular architecture across multiple clouds is the pinnacle of infrastructure resilience. By treating AWS and Azure as interchangeable hosts for isolated data planes, and governing them with a cloud-agnostic edge router and Terraform, you eliminate single points of failure at the vendor level. While this approach introduces operational complexity, it guarantees that a regional outage or provider degradation remains a contained incident rather than a business-ending catastrophe. As a next step, focus on standardizing your application packaging by utilizing containers (ECS on AWS, Container Apps on Azure) to ensure your worker code executes identically regardless of the host cloud.

DEV Community