Cláudio Filipe Lima Rapôso

Posted on May 13

Implementing Multicloud Data Sharding with Hexagonal Storage Adapters

#azure #aws #terraform #python

Data residency requirements and regional compliance laws such as GDPR or LGPD often force architectures to fragment data across multiple cloud providers and geographic boundaries. When a centralized database becomes a legal or performance bottleneck, engineering teams frequently resort to manual data duplication or brittle synchronization scripts. This fragmentation leads to inconsistent state, increased latency for cross-border users, and a massive expansion of the security attack surface. The definitive solution to this complexity is a Multicloud Data Sharding layer orchestrated through Hexagonal Architecture. By decoupling the domain entity from the underlying storage mechanism, we allow the application to route queries to specific cloud-native databases based on tenant metadata. This approach ensures that a Brazilian housing cooperative's data remains within Azure's South Brazil region while a European counterpart resides in AWS Frankfurt, all while maintaining a single, unified codebase and strictly adhering to data sovereignty mandates.

Prerequisites

Building this sharding plane requires Terraform version 1.8.0 or higher to manage cross-cloud provider states effectively. You must have the AWS provider (version 5.45+) and AzureRM provider (version 3.95+) initialized. The application logic requires Python 3.12 using the SQLAlchemy 2.0 library for database abstraction and pydantic for strict data validation. Advanced knowledge of Domain-Driven Design (DDD) is essential to identify the correct sharding keys within your bounded contexts. You will also need active Service Principals for Azure and IAM Roles for AWS with permissions to manage RDS and Azure SQL instances.

Step-by-Step

Defining Geographic Sharding Boundaries

The first step in a multicloud sharding strategy involves the physical provisioning of isolated database clusters that serve as regional shards. We use Terraform to define these resources, ensuring that each database is configured with provider-specific best practices such as AWS Aurora for high-performance clusters and Azure SQL Database for serverless flexibility. Architectural integrity is maintained by ensuring that no shard shares a common network or credentials. Each shard must be tagged with a unique identifier that corresponds to a geographic region or a regulatory jurisdiction. This physical isolation prevents "noisy neighbor" effects and ensures that a regional outage in one cloud provider does not cascade into a global failure. We utilize a standardized naming convention to facilitate dynamic discovery during the application runtime.

# database_shards/main.tf
variable "shard_configs" {
  type = map(object({
    provider = string
    region   = string
    tier     = string
  }))
}

resource "aws_db_instance" "shard_aws" {
  for_each               = { for k, v in var.shard_configs : k => v if v.provider == "aws" }
  identifier             = "db-shard-${each.key}"
  engine                 = "postgres"
  instance_class         = each.value.tier
  allocated_storage      = 20
  db_name                = "tenant_data"
  publicly_accessible    = false
  skip_final_snapshot    = true
  vpc_security_group_ids = [aws_security_group.db_access.id]

  tags = {
    ShardID    = each.key
    Regulatory = "GDPR_Compliant"
  }
}

resource "azurerm_mssql_database" "shard_azure" {
  for_each       = { for k, v in var.shard_configs : k => v if v.provider == "azure" }
  name           = "db-shard-${each.key}"
  server_id      = azurerm_mssql_server.primary.id
  collation      = "SQL_Latin1_General_CP1_CI_AS"
  max_size_gb    = 10
  read_scale     = false
  sku_name       = "S0"

  tags = {
    ShardID    = each.key
    Regulatory = "LGPD_Compliant"
  }
}

This infrastructure provides the storage backbone for our distributed data. How does the application decide which cloud to query for a specific tenant without hardcoding environment-specific logic into the core domain?

Constructing Hexagonal Storage Adapters

We maintain architectural purity by implementing the Hexagonal Architecture pattern, where the domain logic interacts only with a storage port (an abstract interface). We define a StoragePort protocol in Python that specifies the required data operations, such as saving or retrieving a housing cooperative's records. We then develop specific adapters for AWS and Azure that implement this protocol using cloud-native drivers. The application uses a factory pattern to inject the correct adapter at runtime based on the tenant's regional metadata. This decoupling ensures that the core business logic remains agnostic of the underlying database engine or cloud provider. If a specific region requires a migration from AWS to Azure, only the adapter configuration changes while the core domain remains untouched, preserving the stability of the business rules.

# domain/ports.py
from typing import Protocol, Dict, Any
from abc import abstractmethod

class CooperativeRepositoryPort(Protocol):
    @abstractmethod
    def save(self, data: Dict[str, Any]) -> bool:
        pass

    @abstractmethod
    def get_by_id(self, cooperative_id: str) -> Dict[str, Any]:
        pass

# infrastructure/adapters.py
import sqlalchemy
from sqlalchemy.orm import sessionmaker

class PostgresShardAdapter:
    def __init__(self, connection_string: str):
        self.engine = sqlalchemy.create_engine(connection_string)
        self.Session = sessionmaker(bind=self.engine)

    def save(self, data: Dict[str, Any]) -> bool:
        with self.Session() as session:
            # Operational logic for PostgreSQL in AWS
            session.execute(sqlalchemy.text(
                "INSERT INTO cooperatives (id, name, region) VALUES (:id, :name, :region)"
            ), data)
            session.commit()
            return True

class AzureSqlShardAdapter:
    def __init__(self, connection_string: str):
        self.engine = sqlalchemy.create_engine(connection_string)
        self.Session = sessionmaker(bind=self.engine)

    def save(self, data: Dict[str, Any]) -> bool:
        with self.Session() as session:
            # Operational logic for MS SQL in Azure
            session.execute(sqlalchemy.text(
                "INSERT INTO cooperatives (id, name, region) VALUES (:id, :name, :region)"
            ), data)
            session.commit()
            return True

The use of ports and adapters ensures that the domain logic is protected from infrastructure churn. If a shard becomes unavailable due to a provider-wide outage, how does the system maintain partial availability for other regions while attempting a recovery?

Implementing Resilient Shard Resolution

Resilience in a sharded multicloud environment depends on a robust shard resolution mechanism that utilizes a globally distributed metadata store. We implement a ShardResolver that queries a lightweight mapping table, often stored in a highly available global service such as AWS DynamoDB Global Tables or Azure Cosmos DB. This resolver identifies the target cloud, the specific region, and the connection credentials for a given cooperative ID. To ensure high performance, these mappings are cached in memory using a TTL-based strategy. When a request enters the system, the resolver determines the correct shard and provides the corresponding adapter to the domain service. This architecture supports partial failure; if the AWS Frankfurt shard is offline, only the European tenants are affected while Brazilian tenants on Azure continue to operate without degradation. This granular isolation is the primary benefit of the cellular sharding pattern.

# application/services.py
from typing import Dict, Optional

class ShardResolver:
    def __init__(self, metadata_store: Dict[str, str]):
        # In production, this would be a DynamoDB or CosmosDB call
        self.metadata_store = metadata_store

    def resolve_adapter(self, cooperative_id: str) -> Optional[CooperativeRepositoryPort]:
        shard_info = self.metadata_store.get(cooperative_id)
        if not shard_info:
            return None

        # Logic to return PostgresShardAdapter or AzureSqlShardAdapter
        # based on shard_info['provider']
        if shard_info['provider'] == 'aws':
            return PostgresShardAdapter(shard_info['dsn'])
        return AzureSqlShardAdapter(shard_info['dsn'])

class CooperativeService:
    def __init__(self, resolver: ShardResolver):
        self.resolver = resolver

    def create_cooperative(self, cooperative_id: str, payload: Dict[str, Any]):
        adapter = self.resolver.resolve_adapter(cooperative_id)
        if adapter:
            return adapter.save(payload)
        raise ConnectionError(f"No shard found for ID {cooperative_id}")

This resolution logic ensures that traffic is always directed to the legally and geographically correct shard. How can we optimize the cross-shard reporting process when executives require a global view of data that is physically segregated across two different cloud providers?

Common Troubleshooting

A frequent issue in multicloud sharding is the exhaustion of connection pools in the application layer. Since each shard requires a distinct connection pool, an application pod connecting to fifty shards may consume an excessive amount of memory and file descriptors. To solve this, implement a dynamic pool recycler that closes idle connections to infrequently accessed shards or utilize a database proxy such as AWS RDS Proxy or Azure SQL Database Proxy to multiplex connections efficiently.

Another critical failure point is the inconsistency between the global shard metadata and the actual state of the database shards. If a shard is migrated from AWS to Azure but the metadata store is not updated atomically, the application will experience "dead-end" routing errors. We mitigate this by using a two-phase commit or a saga pattern for shard migrations, ensuring that the metadata is updated only after the target database is fully synchronized and validated.

Lastly, latency spikes can occur if the application is running in AWS but the metadata store is only in Azure. Always implement local caching of shard metadata within the application process and utilize a globally replicated database for the shard map to ensure the resolver always reads from the nearest regional endpoint.

Conclusion

Multicloud data sharding through Hexagonal Architecture provides a sophisticated framework for managing data sovereignty and architectural resilience. By isolating storage concerns into specific adapters and utilizing a global resolver, you ensure that your platform can scale across providers without sacrificing domain clarity or compliance. The next advanced step is to implement cross-shard distributed queries using a federated engine like Presto or Trino. This allows for analytical reporting across AWS and Azure shards without moving large volumes of sensitive data, providing a unified view of a geographically dispersed system.

References

Fowler, M. (2002). Patterns of enterprise application architecture. Addison-Wesley Professional.

Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. O'Reilly Media.

Microsoft. (2024). Data sharding pattern. Azure Architecture Center. https://learn.microsoft.com/en-us/azure/architecture/patterns/sharding

Post, G. (2023). Global data residency: Architectural strategies for multicloud compliance. Journal of Cloud Computing Research.

DEV Community