Cláudio Filipe Lima Rapôso

Posted on Jun 8

Architecting Multicloud CQRS: Decoupling Read and Write Domains Across AWS and Azure

#azure #aws #terraform #python

Monolithic relational databases choke under the read-heavy demands of globally distributed enterprise applications. When thousands of concurrent clients query complex aggregations, the primary transactional store experiences severe CPU contention, risking the integrity of write operations. Attempting to solve this by scaling up the primary database only defers the bottleneck and inextricably couples read traffic to the write domain's availability. If the single database degrades, the entire platform becomes unresponsive. Command Query Responsibility Segregation (CQRS) resolves this structural flaw by physically separating the write model from the read model. By orchestrating an event-driven data pipeline across Amazon Managed Streaming for Apache Kafka (MSK) and Microsoft Azure Event Hubs, engineering teams can project asynchronous materialized views optimized strictly for localized read performance. This multicloud topology guarantees that aggressive query patterns never impact transactional throughput, ensuring absolute resilience and low latency data access across disparate vendor networks.

Prerequisites

Implementing cross-cloud CQRS demands mastery of event sourcing, stream processing, and distributed network engineering. The infrastructure provisioning layer requires Terraform version 1.7.0 or higher, utilizing the HashiCorp AWS Provider version 5.40.0 and the AzureRM Provider version 3.90.0. The projection workers demand Python 3.12, supplemented by the confluent-kafka library version 2.3.0 and azure-cosmos version 4.5.1. A unified identity plane utilizing OpenID Connect (OIDC) is necessary for secure, cross-boundary authentication. Administrative access to configure Virtual Network peering or dedicated IPsec VPN tunnels between the AWS Virtual Private Cloud and the Azure Virtual Network is mandatory to guarantee encrypted transit.

Step-by-Step Implementation

Architecting the Immutable Write Log with Amazon MSK

Establishing a robust CQRS write model requires an append-only, immutable ledger to record every domain state transition securely. We provision Amazon MSK to act as the central nervous system for the write domain. The architectural justification resides in the necessity for strict event ordering and indefinite retention capabilities. When a domain service processes a command, it does not mutate a traditional database table. Instead, it publishes a domain event to a specific Kafka topic. MSK guarantees that these events are replicated across three distinct Availability Zones, providing exceptional durability. By isolating the write operations to this streaming platform, the system achieves massive write throughput entirely unencumbered by complex SQL joins, index recalculations, or table locks.

resource "aws_msk_cluster" "enterprise_event_store" {
  cluster_name           = "enterprise-write-domain"
  kafka_version          = "3.5.1"
  number_of_broker_nodes = 3

  broker_node_group_info {
    instance_type = "kafka.m5.large"
    client_subnets = [
      aws_subnet.private_az1.id,
      aws_subnet.private_az2.id,
      aws_subnet.private_az3.id
    ]
    security_groups = [aws_security_group.msk_internal.id]
    storage_info {
      ebs_storage_info {
        volume_size = 1000
      }
    }
  }

  client_authentication {
    sasl {
      iam = true
    }
  }

  tags = {
    Architecture = "CQRS"
    Domain       = "WriteModel"
  }
}

How do we replicate these highly sensitive, immutable events to an external cloud boundary without exposing the internal MSK brokers to the unpredictable latency and security threats of the public internet?

Bridging the Multicloud Boundary via MirrorMaker 2

We bridge the isolated network boundaries by deploying Apache Kafka MirrorMaker 2 within an Amazon Elastic Container Service (ECS) Fargate cluster, pushing events asynchronously to Azure Event Hubs over a dedicated IPsec tunnel. The core architectural reasoning for this component is mitigating vendor lock-in while establishing a secondary read locality. Event Hubs natively supports the Kafka protocol, allowing MirrorMaker 2 to treat the Azure destination exactly like a standard Kafka cluster. By executing this replication process asynchronously, the AWS write domain remains completely insulated from any performance degradation occurring within the Azure network. The primary write operation acknowledges success immediately upon persisting to MSK. Subsequently, MirrorMaker 2 acts as an autonomous consumer, fetching batches of events and handling the eventual consistency delivery to the Azure environment securely.

resource "azurerm_eventhub_namespace" "multicloud_replica" {
  name                = "enterprise-read-replica"
  location            = azurerm_resource_group.core.location
  resource_group_name = azurerm_resource_group.core.name
  sku                 = "Standard"
  capacity            = 4

  tags = {
    Architecture = "CQRS"
    Domain       = "ReadModel"
  }
}

resource "azurerm_eventhub" "transaction_events" {
  name                = "transactions"
  namespace_name      = azurerm_eventhub_namespace.multicloud_replica.name
  resource_group_name = azurerm_resource_group.core.name
  partition_count     = 32
  message_retention   = 7
}

If the event stream successfully traverses the multicloud boundary, how do the Azure computing units process this continuous influx of raw event data into highly optimized, localized read models without falling behind the throughput velocity?

Hydrating Materialized Views with Hexagonal Consumers

We construct the optimized read models by deploying horizontally scalable Python consumers that interpret the event stream and upsert document structures into Azure Cosmos DB. The architectural imperative here is the strict separation of concerns utilizing Hexagonal Architecture. The Python consumer acts purely as a driving adapter, polling the Event Hub via the Kafka protocol. It passes the raw event payload into a core domain projection service. This domain service computes the current state, pre-calculating the exact data shapes required by the frontend client interfaces, and passes the result to an outbound port connected to Cosmos DB. This methodology ensures the read database contains heavily denormalized, ready-to-serve JSON documents. When a client performs a read operation, Cosmos DB executes a point-read query in single-digit milliseconds, entirely bypassing runtime computational overhead.

import json
import os
from confluent_kafka import Consumer, KafkaError
from azure.cosmos import CosmosClient
from typing import Dict, Any

EVENTHUB_CONN_STR = os.environ["EVENTHUB_KAFKA_CONN_STR"]
COSMOS_ENDPOINT = os.environ["COSMOS_ENDPOINT"]
COSMOS_KEY = os.environ["COSMOS_KEY"]

cosmos_client = CosmosClient(COSMOS_ENDPOINT, credential=COSMOS_KEY)
database = cosmos_client.get_database_client("ReadModels")
container = database.get_container_client("TransactionViews")

kafka_conf = {
    'bootstrap.servers': EVENTHUB_CONN_STR,
    'security.protocol': 'SASL_SSL',
    'sasl.mechanisms': 'PLAIN',
    'sasl.username': '$ConnectionString',
    'sasl.password': os.environ["EVENTHUB_SASL_KEY"],
    'group.id': 'projection-worker-group-v1',
    'auto.offset.reset': 'earliest'
}

def project_transaction_view(event_payload: Dict[str, Any]) -> None:
    view_id = event_payload.get("account_id")
    amount = float(event_payload.get("amount", 0))

    try:
        current_view = container.read_item(item=view_id, partition_key=view_id)
        current_view["total_balance"] += amount
        current_view["transaction_count"] += 1
        current_view["last_updated"] = event_payload.get("timestamp")
        container.upsert_item(body=current_view)
    except Exception:
        new_view = {
            "id": view_id,
            "account_id": view_id,
            "total_balance": amount,
            "transaction_count": 1,
            "last_updated": event_payload.get("timestamp")
        }
        container.create_item(body=new_view)

def start_projection_worker():
    consumer = Consumer(kafka_conf)
    consumer.subscribe(['transactions'])

    print("Started Cosmos DB projection worker...")
    try:
        while True:
            msg = consumer.poll(1.0)
            if msg is None:
                continue
            if msg.error():
                if msg.error().code() != KafkaError._PARTITION_EOF:
                    print(f"Consumer error: {msg.error()}")
                continue

            payload = json.loads(msg.value().decode('utf-8'))
            project_transaction_view(payload)

    except KeyboardInterrupt:
        pass
    finally:
        consumer.close()

if __name__ == "__main__":
    start_projection_worker()

What structural recovery mechanism exists when a faulty code deployment introduces a malformed projection schema, silently corrupting the materialized view inside Cosmos DB for several hours before detection?

Replaying the Stream for Schema Evolution and Recovery

We rectify structural corruption by treating the materialized view as entirely ephemeral and executing a deterministic event replay from the primary immutable log. The fundamental advantage of event sourcing is that application state derives entirely from the history of events. When data corruption occurs, or when a new business requirement dictates a different data structure, developers deploy a corrected version of the Python consumer logic bound to a new group.id configuration. This new consumer group ignores the corrupted offset markers and begins reading the event log from the absolute beginning (auto.offset.reset = 'earliest'). It rapidly projects the entire historical dataset into a fresh, isolated Cosmos DB container using the corrected schema logic. Once the new projection catches up to the current stream head, the routing layer seamlessly transitions client read traffic to the new container. The corrupted container is subsequently dropped, allowing for massive structural changes and recovery operations with zero downtime.

Common Troubleshooting

When establishing cross-cloud Kafka replication, authentication failures between MirrorMaker 2 and Azure Event Hubs are frequent. If the MirrorMaker logs display SASL authentication failed, ensure that the connection string provided to the destination configuration utilizes the $ConnectionString literal as the username, and the primary key of the Event Hubs Shared Access Policy as the password. Event Hubs enforces strict Kafka protocol validation, and utilizing misconfigured SASL/PLAIN mechanisms will result in silent connection drops.

Another critical issue surfaces as extreme consumer lag within the Python projection workers during high-throughput ingestion spikes. If Cosmos DB begins returning HTTP 429 TooManyRequests errors, the Event Hub consumer will continuously block and retry, causing the offset lag to increase exponentially. This indicates that the Request Unit (RU) capacity provisioned on the Cosmos DB container is insufficient for the write velocity of the event stream. You must scale the container throughput dynamically using Azure Autoscale rules, or implement aggressive batching within the Python azure-cosmos client to commit multiple materialized views in a single transactional batch operation, significantly reducing the API call frequency.

Conclusion

Orchestrating a cross-cloud CQRS architecture using Amazon MSK and Azure Event Hubs establishes a deeply resilient data ecosystem. By segregating the immutable write operations from highly concurrent read demands, engineering teams can guarantee transactional integrity while providing localized, sub-millisecond query performance across vendor networks. Moving forward, teams scaling this pattern should implement Schema Registry integrations on the AWS origin. Enforcing strict Protobuf or Avro schemas before events enter the MSK topics ensures that schema evolution is mathematically validated, preventing malformed payloads from ever propagating across the multicloud boundary.

References

Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. O'Reilly Media.

Stopford, B. (2018). Designing event-driven systems: Concepts and patterns for streaming services with Apache Kafka. O'Reilly Media.

DEV Community