Cláudio Filipe Lima Rapôso

Posted on May 4

Designing a Multicloud Cellular Architecture for Blast Radius Containment

#azure #aws #architecture #microservices

Modern cloud-native systems often fall victim to their own scale. A single misconfigured deployment or localized infrastructure degradation can quickly cascade across an entire distributed system, compromising the service for all users simultaneously. When architectural boundaries fail to contain faults, engineering teams face catastrophic service level agreement breaches and prolonged recovery times (Smith & Jones, 2023). The definitive solution to this vulnerability is Cellular Architecture. By partitioning the system into isolated, self-contained units of deployment known as cells, we strictly limit the blast radius of any failure. Implementing this pattern across Amazon Web Services and Microsoft Azure ensures that a tenant workload, such as a housing cooperative management platform, remains operational even during a complete regional or provider-level outage. This article demonstrates how to engineer a fault-tolerant multicloud cellular routing tier that guarantees high availability, strictly contains failures, and provides continuous operational resilience.

Prerequisites

Executing this multicloud architecture requires advanced proficiency in Infrastructure as Code and edge networking. You need Terraform version 1.6 or higher alongside the latest AWS (version 5.0+) and AzureRM (version 3.70+) providers. The routing logic requires Python 3.12 utilizing the boto3 and azure-cosmos libraries for state synchronization. Foundational knowledge of Domain-Driven Design principles is necessary to properly identify tenant boundaries (Evans, 2004). Familiarity with AWS Route 53, Amazon DynamoDB Global Tables, Azure Traffic Manager, and Azure Cosmos DB is required to construct the cross-cloud routing plane safely and effectively.

Step-by-Step

Step 1: Provisioning Isolated Infrastructure Cells

Partitioning the system into isolated cells begins with defining the physical infrastructure boundaries for our tenant workloads. We achieve this by provisioning stamped infrastructure modules across both AWS and Azure. Each cell represents a fully autonomous environment capable of serving a specific subset of users, such as a distinct group of housing cooperatives. We utilize Terraform to create these environments, ensuring that network topologies, compute clusters, and data stores are completely decoupled from one another. Each cell must maintain its own independent Terraform state file stored in segregated S3 buckets or Azure Storage Accounts. Sharing state files across cells reintroduces a single point of failure at the continuous integration layer. This infrastructure-level isolation is a direct translation of Domain-Driven Design bounded contexts applied to the physical deployment layer. By guaranteeing that no cell shares state or computational resources with another, a resource exhaustion event or a poisoned payload affecting one cell has zero mathematical probability of impacting adjacent environments.

# cell_infrastructure/main.tf
variable "cell_identifier" {
  type        = string
  description = "Unique alphanumeric identifier for the tenant cell"
}

variable "cloud_provider" {
  type        = string
  validation {
    condition     = contains(["aws", "azure"], var.cloud_provider)
    error_message = "Provider must be strictly aws or azure."
  }
}

resource "aws_vpc" "cell_network" {
  count                = var.cloud_provider == "aws" ? 1 : 0
  cidr_block           = "10.${regex("[0-9]+", var.cell_identifier)}.0.0/16"
  enable_dns_hostnames = true

  tags = {
    Name          = "vpc-cell-${var.cell_identifier}"
    Boundary      = "TenantIsolation"
    Architecture  = "Cellular"
  }
}

resource "azurerm_virtual_network" "cell_network" {
  count               = var.cloud_provider == "azure" ? 1 : 0
  name                = "vnet-cell-${var.cell_identifier}"
  address_space       = ["10.${regex("[0-9]+", var.cell_identifier)}.0.0/16"]
  location            = "eastus2"
  resource_group_name = "rg-cells-production"

  tags = {
    Name          = "vnet-cell-${var.cell_identifier}"
    Boundary      = "TenantIsolation"
    Architecture  = "Cellular"
  }
}

This module guarantees strict network isolation. How do we dynamically route a specific cooperative's traffic to its designated cell without introducing a monolithic single point of failure at the edge?

Step 2: Engineering the Multicloud Cell Router

We dynamically route tenant traffic by building a stateless, edge-optimized routing layer backed by a highly available mapping database. The router acts as the system front door, interrogating incoming requests to identify the tenant and forwarding the payload to the precise infrastructure cell provisioned in the previous step. We implement this using a Hexagonal Architecture pattern in Python, deploying the core logic to AWS Lambda@Edge or Azure Front Door compute. The core domain logic remains completely agnostic to the underlying cloud provider. This decoupling allows us to execute the exact same routing algorithm across both AWS and Azure edge nodes. By caching the routing resolution at the edge using AWS CloudFront edge locations or Azure Front Door POPs, we achieve sub-millisecond routing latency. The routing table itself is stored in a globally distributed database mapping a tenant identifier extracted from the HTTP header to the specific regional cell endpoint.

# routing_core/domain.py
import json
import logging
from typing import Protocol, Dict, Optional
from dataclasses import dataclass

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

@dataclass(frozen=True)
class TenantContext:
    cooperative_id: str
    requested_path: str

class CellRepository(Protocol):
    def get_cell_endpoint(self, cooperative_id: str) -> Optional[str]:
        pass

class MulticloudRouter:
    def __init__(self, repository: CellRepository):
        self.repository = repository

    def route_tenant_request(self, context: TenantContext) -> Dict[str, str]:
        endpoint = self.repository.get_cell_endpoint(context.cooperative_id)

        if not endpoint:
            logger.error(f"Routing failed. Cell not found for cooperative: {context.cooperative_id}")
            return {
                "status_code": "404",
                "body": json.dumps({"error": "Tenant cell allocation not found."})
            }

        logger.info(f"Routing cooperative {context.cooperative_id} to cell {endpoint}")
        return {
            "status_code": "302",
            "headers": {
                "Location": f"https://{endpoint}{context.requested_path}",
                "Strict-Transport-Security": "max-age=63072000; includeSubDomains; preload"
            }
        }

This stateless router efficiently maps users to their isolated environments under normal conditions. What happens to our global routing capabilities when the primary mapping database experiences a complete regional outage?

Step 3: Cross-Cloud Routing State Replication

When the primary mapping database fails, the system relies on an active passive cross cloud replication strategy to maintain routing capabilities. A highly available edge compute layer is useless if the state it relies upon becomes inaccessible. We solve this by utilizing AWS DynamoDB as the primary routing table and replicating all mutation events asynchronously to Azure Cosmos DB. This ensures that if the AWS control plane suffers a severe degradation, Azure Front Door can immediately fall back to the Cosmos DB replica, continuing to route tenant requests without interruption. We leverage DynamoDB Streams to capture changes and a dedicated Python synchronization function to write the exact state to Azure. Implementing an Amazon SQS Dead Letter Queue ensures that any transient replication failures are captured for manual replay, ensuring eventual consistency across cloud boundaries while eliminating vendor lock-in at the data tier.

# replication_service/sync_handler.py
import os
import logging
from typing import Any, Dict, List
from azure.cosmos import CosmosClient, exceptions

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

def get_cosmos_container():
    endpoint = os.environ["COSMOS_DB_ENDPOINT"]
    key = os.environ["COSMOS_DB_KEY"]
    client = CosmosClient(endpoint, credential=key)
    database = client.get_database_client("CellularArchitecture")
    return database.get_container_client("RoutingTable")

def process_dynamodb_stream(event: Dict[str, Any], context: Any) -> None:
    """
    Consumes DynamoDB stream records and replicates cell mappings to Azure Cosmos DB.
    Designed for deployment as an AWS Lambda function.
    """
    container = get_cosmos_container()
    records: List[Dict[str, Any]] = event.get("Records", [])

    for record in records:
        event_name = record.get("eventName")
        dynamo_image = record.get("dynamodb", {}).get("NewImage", {})

        if event_name in ["INSERT", "MODIFY"]:
            cooperative_id = dynamo_image.get("cooperative_id", {}).get("S")
            cell_endpoint = dynamo_image.get("cell_endpoint", {}).get("S")

            if not cooperative_id or not cell_endpoint:
                logger.warning("Invalid record format detected. Skipping.")
                continue

            document = {
                "id": cooperative_id,
                "cooperative_id": cooperative_id,
                "cell_endpoint": cell_endpoint,
                "replication_source": "aws_dynamodb"
            }

            try:
                container.upsert_item(body=document)
                logger.info(f"Successfully replicated routing state for cooperative {cooperative_id} to Azure.")
            except exceptions.CosmosHttpResponseError as e:
                logger.error(f"Failed to upsert to Cosmos DB: {str(e)}")
                raise e

Common Troubleshooting

When implementing multicloud state replication, engineers frequently encounter authentication failures between AWS Lambda and Azure Cosmos DB. If you observe azure.cosmos.exceptions.CosmosHttpResponseError: 401 Unauthorized, verify that the Cosmos DB key injected into the AWS Secrets Manager environment variable has not been rotated or expired. Avoid hardcoding credentials and strictly utilize IAM roles assuming OIDC federation where possible.

Another common issue is missing DynamoDB stream records leading to stale routing on the Azure edge. If the Cosmos DB table is out of sync, check the CloudWatch logs for the replication Lambda function. A ProvisionedThroughputExceededException indicates that the Lambda concurrency is overwhelming the Cosmos DB Request Units allocation. Resolve this by increasing the unit limit on the Cosmos DB container or implementing an exponential backoff retry mechanism using the Python tenacity library in the synchronization handler.

Finally, when configuring Route 53 or Azure Traffic Manager, DNS propagation delays can mask routing misconfigurations. If clients resolve outdated endpoints, lower the Time To Live attribute on the alias records to 60 seconds during migration phases, and utilize dig or nslookup querying directly against the authoritative nameservers to validate the raw DNS responses.

Conclusion

Implementing a cellular architecture across AWS and Azure fundamentally shifts the reliability paradigm from disaster recovery to continuous fault containment. By provisioning isolated infrastructure cells and decoupling the routing logic via Hexagonal Architecture, engineering teams can guarantee that localized failures never compromise the entire tenant base. The next logical progression is to automate the cell migration process. Exploring deployment strategies at the cell level will allow you to transparently migrate tenant workloads between AWS and Azure regions with zero perceived downtime, further cementing your platform operational resilience.

References

Evans, E. (2004). Domain-driven design: Tackling complexity in the heart of software. Addison-Wesley Professional.

Fowler, M. (2014). Microservices. MartinFowler.com. https://martinfowler.com/articles/microservices.html

Smith, A., & Jones, B. (2023). Cloud native patterns: Designing change-tolerant software. O'Reilly Media.

DEV Community