DEV Community

Cláudio Filipe Lima Rapôso
Cláudio Filipe Lima Rapôso

Posted on

Practical Guide: Building a Cell-Based Architecture on Azure with Terraform and Python

1. Introduction

As cloud applications scale to serve global audiences, relying on a single centralized infrastructure stack introduces critical vulnerabilities. A localized failure or a noisy neighbor can trigger a systemic outage. Cell-Based Architecture mitigates this by partitioning the system into isolated, identical, and self-contained units called "cells." By routing specific tenants or users to dedicated cells, you constrain the blast radius of any degradation strictly to that cell, preserving the availability of the rest of the system.

By the end of this tutorial, you will be able to design and provision a cellular architecture on Microsoft Azure. We will utilize Azure Front Door as the global ingress, Azure Functions (Python) for dynamic traffic routing and compute, and Azure Cosmos DB for isolated state management. Mastering this pattern on Azure not only hardens your current workloads but also reinforces foundational multicloud principles, as the concepts of decoupled routing and state isolation are directly transferable across different cloud providers in a unified enterprise landscape.

2. Prerequisites

To execute the configurations and code in this tutorial, ensure you have the following tools and access levels:

  • An active Microsoft Azure subscription with Owner or Contributor permissions to create Resource Groups, Cosmos DB accounts, Azure Functions, and Azure Front Door.
  • Terraform CLI (version 1.0 or higher) installed locally for provisioning the Infrastructure as Code (IaC).
  • Python (version 3.9 or higher) installed locally, along with the Azure Functions Core Tools for local development and packaging.
  • Azure CLI installed and authenticated (az login) on your local environment.
  • Fundamental understanding of partitioning concepts and Terraform HCL syntax.

3. Step-by-Step

Before diving into the infrastructure code, let's visualize the execution flow. The sequence diagram below details how a global request is securely intercepted, evaluated, and forwarded to a strictly isolated cellular stack within Azure.

Sequence Diagram

3.1 Defining the Cell Blueprint (Terraform Module)

What to do: Create a reusable Terraform module representing a single, isolated "Cell." This includes a dedicated Storage Account, Application Service Plan, Azure Function (Worker), and a Cosmos DB database.

Why do it: The primary rule of cellular architecture is absolute consistency across environments. By encapsulating the infrastructure in a Terraform module, you ensure that every generated cell is an exact replica, preventing configuration drift and simplifying horizontal scaling.

Example:
Create a directory named modules/cell and add a main.tf file.

# modules/cell/main.tf
variable "location" { type = string }
variable "resource_group_name" { type = string }
variable "cell_id" { type = string }

resource "azurerm_cosmosdb_account" "cell_db" {
  name                = "cosmos-cell-${var.cell_id}"
  location            = var.location
  resource_group_name = var.resource_group_name
  offer_type          = "Standard"
  kind                = "GlobalDocumentDB"

  consistency_policy {
    consistency_level = "Session"
  }

  geo_location {
    location          = var.location
    failover_priority = 0
  }
}

resource "azurerm_cosmosdb_sql_database" "cell_sqldb" {
  name                = "app-state"
  resource_group_name = var.resource_group_name
  account_name        = azurerm_cosmosdb_account.cell_db.name
}

resource "azurerm_storage_account" "cell_storage" {
  name                     = "stcell${var.cell_id}"
  resource_group_name      = var.resource_group_name
  location                 = var.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

resource "azurerm_service_plan" "cell_plan" {
  name                = "plan-cell-${var.cell_id}"
  location            = var.location
  resource_group_name = var.resource_group_name
  os_type             = "Linux"
  sku_name            = "Y1" # Consumption plan
}

resource "azurerm_linux_function_app" "cell_worker" {
  name                       = "func-worker-${var.cell_id}"
  location                   = var.location
  resource_group_name        = var.resource_group_name
  service_plan_id            = azurerm_service_plan.cell_plan.id
  storage_account_name       = azurerm_storage_account.cell_storage.name
  storage_account_access_key = azurerm_storage_account.cell_storage.primary_access_key

  site_config {
    application_stack {
      python_version = "3.11"
    }
  }

  app_settings = {
    "CELL_ID"             = var.cell_id
    "COSMOS_DB_ENDPOINT"  = azurerm_cosmosdb_account.cell_db.endpoint
  }
}

output "function_default_hostname" {
  value = azurerm_linux_function_app.cell_worker.default_hostname
}
Enter fullscreen mode Exit fullscreen mode

3.2 Stamping Out Multiple Cells

What to do: In your root Terraform configuration, define the base resource group and iterate over a collection of identifiers to deploy multiple cells simultaneously.

Why do it: This translates the theoretical module into physical, isolated environments. Using a for_each loop allows you to effortlessly scale from two cells to fifty cells simply by updating a local variable, abstracting the complexity of managing parallel infrastructures.

Example:
In your root directory, create main.tf.

# main.tf
terraform {
  required_providers {
    azurerm = {
      source  = "hashicorp/azurerm"
      version = "~> 3.0"
    }
  }
}

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "rg" {
  name     = "rg-cellular-architecture"
  location = "East US"
}

locals {
  cells = ["alpha", "beta"]
}

module "isolated_cells" {
  source   = "./modules/cell"
  for_each = toset(local.cells)

  cell_id             = each.key
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
}
Enter fullscreen mode Exit fullscreen mode

3.3 Developing the Cell Router in Python

What to do: Write the Azure Function code in Python that will act as the global ingress router. It must inspect incoming HTTP requests, determine the appropriate tenant mapping, and proxy the request to the correct cell.

Why do it: The router abstracts the internal topology from the client. Applications interact with a single API endpoint, oblivious to the fact that their request is being routed to cell-alpha or cell-beta. This dynamic mapping is what allows you to perform live migrations of tenants between cells to balance load without downtime.

Example:
Create the Python code for your router function (__init__.py inside your Azure Function folder).

import logging
import json
import os
import urllib.request
import azure.functions as func

# In production, fetch this from Cosmos DB (Tenant Registry)
CELL_ENDPOINTS = {
    "alpha": os.environ.get("CELL_ALPHA_URL", "https://func-worker-alpha.azurewebsites.net"),
    "beta": os.environ.get("CELL_BETA_URL", "https://func-worker-beta.azurewebsites.net")
}

def get_target_cell(tenant_id: str) -> str:
    # Mocking a Cosmos DB registry lookup
    # A real implementation would query the global metadata database
    if tenant_id.startswith("A"):
        return "alpha"
    return "beta"

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Global Router processing a request.')

    try:
        req_body = req.get_json()
        tenant_id = req_body.get('tenant_id')

        if not tenant_id:
            return func.HttpResponse(
                "Missing partition key: tenant_id", 
                status_code=400
            )

        target_cell = get_target_cell(tenant_id)
        target_url = f"{CELL_ENDPOINTS[target_cell]}/api/process"

        # Proxy the request to the isolated cell
        data = json.dumps(req_body).encode('utf-8')
        proxy_req = urllib.request.Request(
            target_url, 
            data=data,
            headers={'Content-Type': 'application/json'}
        )

        with urllib.request.urlopen(proxy_req) as response:
            cell_response = response.read().decode('utf-8')

        return func.HttpResponse(
            json.dumps({
                "status": "success",
                "x_routed_to": target_cell,
                "data": json.loads(cell_response)
            }),
            mimetype="application/json",
            status_code=200
        )

    except Exception as e:
        logging.error(f"Routing error: {str(e)}")
        return func.HttpResponse("Internal Server Error", status_code=500)
Enter fullscreen mode Exit fullscreen mode

3.4 Provisioning the Global Routing Layer

What to do: Deploy the central routing Azure Function and configure an Azure Front Door profile to sit in front of it.

Why do it: Azure Front Door acts as a secure, globally distributed entry point. It absorbs DDoS attacks at the edge, provides WAF capabilities, and ensures that the Global Router Function is protected from direct, unauthenticated internet exposure.

Example:
Append the routing infrastructure to your root main.tf.

# Shared storage for the Global Router
resource "azurerm_storage_account" "router_storage" {
  name                     = "stglobalrouter"
  resource_group_name      = azurerm_resource_group.rg.name
  location                 = azurerm_resource_group.rg.location
  account_tier             = "Standard"
  account_replication_type = "LRS"
}

resource "azurerm_service_plan" "router_plan" {
  name                = "plan-global-router"
  location            = azurerm_resource_group.rg.location
  resource_group_name = azurerm_resource_group.rg.name
  os_type             = "Linux"
  sku_name            = "Y1"
}

resource "azurerm_linux_function_app" "global_router" {
  name                       = "func-global-router-gateway"
  location                   = azurerm_resource_group.rg.location
  resource_group_name        = azurerm_resource_group.rg.name
  service_plan_id            = azurerm_service_plan.router_plan.id
  storage_account_name       = azurerm_storage_account.router_storage.name
  storage_account_access_key = azurerm_storage_account.router_storage.primary_access_key

  site_config {
    application_stack {
      python_version = "3.11"
    }
  }

  app_settings = {
    "CELL_ALPHA_URL" = "https://${module.isolated_cells["alpha"].function_default_hostname}"
    "CELL_BETA_URL"  = "https://${module.isolated_cells["beta"].function_default_hostname}"
  }
}
Enter fullscreen mode Exit fullscreen mode

Run terraform init, terraform plan, and terraform apply to deploy the entire multi-cell architecture.

4. Common Troubleshooting

Transitioning to a cellular architecture requires a shift in how you manage state and traffic. Here are common issues you may encounter:

  1. Cosmos DB Throttling (HTTP 429) in a Specific Cell:

    • Problem: One cell begins rejecting requests, while others operate normally. This usually indicates a "noisy neighbor" – a tenant whose workload has suddenly spiked, exhausting the Request Units (RUs) provisioned for that specific cell's database.
    • Solution: Verify the metrics in Azure Monitor. If a tenant has outgrown the shared cell, you must execute a live migration. Update the Global Tenant Registry (Metadata DB) to point that specific tenant_id to a newly provisioned, dedicated cell, seamlessly redirecting their traffic.
  2. Latency Overhead at the Routing Layer:

    • Problem: Requests take significantly longer because they must pass through Front Door, the Router Function, and finally the Cell Function.
    • Solution: The router logic must be aggressively optimized. Implement caching at the Global Router level using Azure Cache for Redis so the function doesn't need to query Cosmos DB for the tenant mapping on every single request.
  3. Cold Starts in Azure Functions:

    • Problem: The first request to a specific cell takes several seconds to execute.
    • Solution: Since we used the Consumption Plan (Y1) for cost-effectiveness in this tutorial, cold starts are expected. For production workloads, switch the Application Service Plan to Premium (EP1 or higher) to keep instances pre-warmed, ensuring consistent low-latency responses.

5. Conclusion

Building a Cell-Based Architecture on Azure transforms monolithic vulnerabilities into contained, manageable units. By utilizing Terraform, we established a reproducible baseline for isolated environments, and implemented a Python routing layer to dynamically proxy traffic.

This decoupling ensures that massive traffic spikes or bad deployments are restricted to single domains. As you mature this architecture, focus on automating the "Tenant Migration" process—moving live data between Cosmos DB instances without downtime—and standardizing your Terraform modules to ensure this pattern can be rapidly adopted across multiple cloud environments.

Top comments (0)