DEV Community

Cláudio Filipe Lima Rapôso
Cláudio Filipe Lima Rapôso

Posted on

Azure Cellular Architecture: Scaling with Cosmos DB Change Feed and Service Bus

1. Introduction

Scaling cloud-native systems often results in a massive, highly interconnected web of services where a single corrupted payload or regional brownout can trigger a global outage. Cellular Architecture mitigates this by decomposing the system into completely independent, isolated failure domains known as "cells." In this tutorial, you will build a highly resilient event-driven architecture using Microsoft Azure, combining the cellular pattern with an asynchronous workflow (Azure Cosmos DB, Azure Functions, and Azure Service Bus).

Instead of relying on a single monolithic data and compute plane, you will deploy multiple identical infrastructure stamps. A thin edge routing layer will inspect incoming requests and route them to the appropriate cell based on a partition key (such as a TenantId). If a localized infrastructure degradation or a "poison pill" event impacts one cell, the blast radius is strictly contained within that specific boundary, ensuring maximum availability for tenants residing in other cells. This pattern represents the ultimate fault isolation strategy for mission-critical enterprise systems on Azure.

2. Prerequisites

To successfully deploy this architecture, your environment must be equipped with the following:

  • An active Azure Subscription with Owner or User Access Administrator privileges to provision compute, database, messaging, and Role-Based Access Control (RBAC) resources.
  • The Azure CLI installed and authenticated locally (az login).
  • Terraform (version 1.3.0 or higher) installed and accessible from your terminal.
  • A solid understanding of the separation between a Control Plane (global routing state) and a Data Plane (the actual processing cells).
  • Familiarity with Domain-Driven Design (DDD) to accurately define the partition keys that will dictate cell placement without creating cross-boundary data dependencies.

3. Step-by-Step

Step 1: Defining the Data Plane Cell Module

What to do: Create a reusable Terraform module that encapsulates the entire event-driven pipeline (Resource Group, Cosmos DB, Service Bus, Storage, and Azure Functions). This module represents a single, mathematically isolated cell.

Why do it: The core tenet of cellular architecture is absolute repeatability and isolation. By packaging the data plane into a module, every cell becomes an identical, deterministic stamp of infrastructure. Provisioning a dedicated Cosmos DB account and Service Bus namespace per cell guarantees that a noisy neighbor or an overloaded queue in "Cell Alpha" consumes zero compute or IOPS from "Cell Beta."

Screenshot/Example: Save this configuration as modules/azure_cell/main.tf.

variable "cell_id" {
  description = "Unique identifier for the cell (e.g., alpha, beta)"
  type        = string
}

variable "location" {
  description = "Azure region for the cell"
  type        = string
  default     = "East US"
}

# Dedicated Resource Group for the Cell
resource "azurerm_resource_group" "cell_rg" {
  name     = "rg-cell-${var.cell_id}"
  location = var.location
}

# Isolated Cosmos DB Account for the Cell
resource "azurerm_cosmosdb_account" "cell_db" {
  name                = "cosmos-data-cell-${var.cell_id}"
  location            = azurerm_resource_group.cell_rg.location
  resource_group_name = azurerm_resource_group.cell_rg.name
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  consistency_policy { consistency_level = "Session" }
  geo_location { location = azurerm_resource_group.cell_rg.location, failover_priority = 0 }
}

# Isolated Service Bus Namespace for the Cell
resource "azurerm_servicebus_namespace" "cell_sb" {
  name                = "sb-namespace-cell-${var.cell_id}"
  location            = azurerm_resource_group.cell_rg.location
  resource_group_name = azurerm_resource_group.cell_rg.name
  sku                 = "Standard"
}

resource "azurerm_servicebus_topic" "cell_topic" {
  name         = "processing-topic"
  namespace_id = azurerm_servicebus_namespace.cell_sb.id
}

resource "azurerm_servicebus_subscription" "cell_sub" {
  name               = "processing-subscription"
  topic_id           = azurerm_servicebus_topic.cell_topic.id
  max_delivery_count = 5 # Built-in poison pill protection
}

# Note: In a complete module, you would also define the azurerm_linux_function_app 
# resources here for the Producer (Change Feed Trigger) and Consumer (Service Bus Trigger),
# utilizing SystemAssigned managed identities for secure communication within the cell.
Enter fullscreen mode Exit fullscreen mode

Step 2: Instantiating the Independent Cells

What to do: In your root Terraform configuration, instantiate the cell module multiple times to create your physical failure domains.

Why do it: Deploying multiple instances generates the actual physical boundaries. You can assign high-tier enterprise tenants to their own dedicated cells or group smaller tenants into shared cells. Because the modules are parameterized, you can easily deploy cells across different Azure regions to combine cellular isolation with geographic high availability.

Screenshot/Example: Add this to your root main.tf.

module "cell_alpha" {
  source   = "./modules/azure_cell"
  cell_id  = "alpha"
  location = "East US 2"
}

module "cell_beta" {
  source   = "./modules/azure_cell"
  cell_id  = "beta"
  location = "West US 2"
}

module "cell_gamma" {
  source   = "./modules/azure_cell"
  cell_id  = "gamma"
  location = "North Europe"
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Building the Control Plane Mapping

What to do: Provision a highly available, globally distributed Cosmos DB account that acts as the Control Plane data store. This database maps your tenant identifiers to their assigned Cell ID.

Why do it: The routing layer must dynamically know where to forward incoming requests. This global mapping table is queried by the edge router. It must be decoupled from the data plane and highly available; if the routing map goes down, traffic cannot reach the perfectly healthy data plane cells behind it.

Screenshot/Example: Create a control_plane.tf file.

resource "azurerm_resource_group" "routing_rg" {
  name     = "rg-global-routing"
  location = "East US"
}

resource "azurerm_cosmosdb_account" "routing_db" {
  name                = "cosmos-global-tenant-map"
  location            = azurerm_resource_group.routing_rg.location
  resource_group_name = azurerm_resource_group.routing_rg.name
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  consistency_policy { consistency_level = "Eventual" } # Optimize for fast reads

  geo_location { location = "East US", failover_priority = 0 }
  geo_location { location = "West Europe", failover_priority = 1 }
}

resource "azurerm_cosmosdb_sql_database" "map_db" {
  name                = "RoutingDB"
  resource_group_name = azurerm_resource_group.routing_rg.name
  account_name        = azurerm_cosmosdb_account.routing_db.name
}

resource "azurerm_cosmosdb_sql_container" "tenant_map" {
  name                  = "TenantCellMap"
  resource_group_name   = azurerm_resource_group.routing_rg.name
  account_name          = azurerm_cosmosdb_account.routing_db.name
  database_name         = azurerm_cosmosdb_sql_database.map_db.name
  partition_key_path    = "/tenantId"
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Implementing the Thin Cell Router

What to do: Deploy an Azure Function App with an HTTP trigger acting as the "Cell Router". This function extracts the TenantId from the incoming HTTP payload, queries the global Cosmos DB mapping table to resolve the destination cell, and then forwards the payload to that specific cell's Cosmos DB.

Why do it: The cell router is the critical entry point and must remain computationally "thin." Complex business logic belongs strictly inside the data plane cells, not the router. By keeping the router simple, you minimize its execution time and failure modes. Once the router successfully writes the data to the target cell's Cosmos DB, the localized event-driven pipeline (Change Feed -> Producer Function -> Service Bus -> Consumer Function) takes over autonomously.

Screenshot/Example: Create a router.tf file.

# Storage Account and App Service Plan for the Router Function
resource "azurerm_storage_account" "router_storage" {
  name                     = "stglobalrouterfunc"
  resource_group_name      = azurerm_resource_group.routing_rg.name
  location                 = azurerm_resource_group.routing_rg.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

resource "azurerm_service_plan" "router_plan" {
  name                = "asp-router-plan"
  resource_group_name = azurerm_resource_group.routing_rg.name
  location            = azurerm_resource_group.routing_rg.location
  os_type             = "Linux"
  sku_name            = "Y1"
}

# The Thin Router Function
resource "azurerm_linux_function_app" "cell_router" {
  name                       = "func-global-edge-router"
  resource_group_name        = azurerm_resource_group.routing_rg.name
  location                   = azurerm_resource_group.routing_rg.location
  service_plan_id            = azurerm_service_plan.router_plan.id
  storage_account_name       = azurerm_storage_account.router_storage.name
  storage_account_access_key = azurerm_storage_account.router_storage.primary_access_key

  site_config { application_stack { node_version = "20" } }

  identity { type = "SystemAssigned" }
}

# RBAC: Allow Router to read the mapping table
resource "azurerm_role_assignment" "router_map_reader" {
  scope                = azurerm_cosmosdb_account.routing_db.id
  role_definition_name = "Cosmos DB Account Reader Role" # Note: Use custom data plane roles in production
  principal_id         = azurerm_linux_function_app.cell_router.identity[0].principal_id
}
Enter fullscreen mode Exit fullscreen mode

4. Common Troubleshooting

  1. Cross-Cell Dependencies: The most critical anti-pattern in cellular architecture is allowing cells to communicate with each other or share a backend data store. If "Cell Alpha" synchronously queries an API in "Cell Beta", the isolation boundary is compromised. Ensure strict domain boundaries; data required to process an event within a cell must reside entirely within that cell's Cosmos DB.
  2. Control Plane Bottlenecks (Router Fatigue): The Cell Router represents a single point of failure. If the HTTP-triggered Azure Function queries the global Cosmos DB mapping table on every single request, you will encounter latency and potential RU/s (Request Unit) throttling. You must implement aggressive in-memory caching (e.g., using MemoryCache in C# or standard dictionary caching in Node.js) within the router's execution context so that tenant-to-cell mappings are resolved locally and instantly after the first lookup.
  3. Poison Pills and Service Bus Dead-Lettering: If a malformed payload triggers the Cosmos DB Change Feed but crashes the Consumer Azure Function, the message will be returned to the Service Bus Subscription. Because this is a cellular architecture, this poison pill will only degrade the specific cell it entered. To resolve this automatically, the Terraform module is configured with max_delivery_count = 5. Service Bus will automatically move the failing message to the Dead-Letter Queue (DLQ) after 5 attempts, restoring the cell's processing health.

5. Conclusion

By embedding an asynchronous event-driven flow within a cellular architecture on Microsoft Azure, you have constructed a system optimized for extreme resilience and strictly controlled blast radiuses. The rigid separation between a global routing control plane and fully isolated data plane cells ensures that infrastructure failures, bad deployments, or tenant-specific traffic spikes remain mathematically contained. Utilizing Terraform modules to stamp out these cells guarantees environmental consistency and facilitates rapid horizontal scaling. As your system matures, consider enhancing the edge routing layer by placing Azure Front Door in front of your Cell Router Functions to provide global load balancing, WAF protection, and seamless traffic shifting during cell migrations or disaster recovery events.

Top comments (0)