Cláudio Filipe Lima Rapôso

Posted on May 12

Engineering Multicloud Consensus: Implementing Distributed Locking with Fencing Tokens

#azure #aws #terraform #python

Distributed systems operating across Amazon Web Services and Microsoft Azure often encounter critical failures when attempting to manage shared global state. In a multicloud environment, where a command service in AWS and a worker process in Azure must coordinate access to a shared resource, such as a high-value financial ledger or a limited inventory pool, traditional local concurrency controls are insufficient. Relying on eventual consistency in these scenarios introduces the risk of race conditions, where simultaneous writes lead to data corruption or double-spending events. These failures result in direct revenue loss and a fundamental compromise of data integrity. The definitive architectural solution is the implementation of a Cross-Cloud Distributed Lock Manager (DLM) utilizing a high-consistency semaphore store. By using a centralized authority with conditional update capabilities, engineering teams can enforce mutual exclusion across cloud boundaries, ensuring that only one process operates on a protected resource at any given time (Kleppmann, 2017).

Prerequisites

To implement this distributed locking mechanism, you require Terraform version 1.7 or higher alongside the latest AWS (version 5.30+) and AzureRM (version 3.80+) providers. The implementation uses Python 3.12 with the boto3 library for interacting with AWS DynamoDB and azure-cosmos for potential lease management in Azure. You must have established cross-cloud network connectivity, either via Site-to-Site VPN or a third-party interconnect, and configured IAM roles for Service Accounts (IRSA) on the AWS side and Managed Identities on the Azure side. Deep familiarity with the concepts of optimistic concurrency control and the CAP theorem is essential for understanding the trade-offs involved in consensus-based locking (Brewer, 2000).

Step-by-Step

Step 1: Provisioning the Global Lock Table

The foundation of a multicloud distributed lock is a single, authoritative source of truth that supports atomic, conditional writes. We utilize Amazon DynamoDB with Global Tables to act as this semaphore store because it provides a strongly consistent read option and the attribute_not_exists conditional expression, which is required for atomic lock acquisition. This table serves as the global registry where lock identifiers are mapped to the owning process and an expiration timestamp. We use Terraform to define this infrastructure, ensuring the table is replicated across regions to maintain availability. The primary key is the resource identifier, and we enable the Time to Live (TTL) feature to provide an infrastructure-level safety net for expired leases. This centralized approach is an architectural necessity because consensus cannot be reliably achieved across cloud boundaries without a shared, high-availability state store.

# infrastructure/lock_store.tf
resource "aws_dynamodb_table" "global_lock_manager" {
  name             = "cross-cloud-distributed-locks"
  billing_mode     = "PAY_PER_REQUEST"
  hash_key         = "resource_id"
  stream_enabled   = true
  stream_view_type = "NEW_AND_OLD_IMAGES"

  attribute {
    name = "resource_id"
    type = "S"
  }

  ttl {
    attribute_name = "lease_expiry"
    enabled        = true
  }

  tags = {
    Domain       = "DistributedConsensus"
    Architecture = "Multicloud-DLM"
  }

  # Ensure replication for multicloud availability
  replica {
    region_name = "us-east-1"
  }
  replica {
    region_name = "eu-west-1"
  }
}

This infrastructure provides the atomic primitive needed to store the lock state. How do we prevent a scenario where a worker process in Azure acquires a lock but subsequently crashes, leaving the resource permanently inaccessible to other workers in AWS?

Step 2: Implementing the Lease Pattern with Heartbeats

We solve the problem of abandoned locks by implementing the Lease pattern, where a lock is not a permanent state but a time-bound grant. The worker process must acquire the lock for a specific duration and continuously renew it through a heartbeat mechanism. In our Python implementation, we use a Hexagonal Architecture approach to define a LockManager that executes a conditional put_item call. If the resource_id already exists and the current time is less than the lease_expiry, the acquisition fails, preventing concurrent access. If the acquisition succeeds, the worker receives a lease. We wrap the domain logic in a background thread that periodically updates the lease_expiry timestamp in DynamoDB. This heartbeat ensures that as long as the worker is healthy and the network is functional, the lock remains held, but if the process terminates, the lease naturally expires, allowing other nodes to compete for the resource (Vernon, 2013).

# distributed_locking/lock_manager.py
import boto3
import time
from botocore.exceptions import ClientError
from dataclasses import dataclass

@dataclass
class LockLease:
    resource_id: str
    owner_id: str
    lease_duration: int
    fencing_token: int

class DynamoDBLockManager:
    def __init__(self, table_name: str, owner_id: str):
        self.dynamodb = boto3.resource('dynamodb')
        self.table = self.dynamodb.Table(table_name)
        self.owner_id = owner_id

    def acquire_lock(self, resource_id: str, duration: int = 30) -> LockLease:
        """
        Attempts to acquire a lease. Implements the 'fencing token' concept
        by incrementing a version number on every successful acquisition.
        """
        expiry = int(time.time()) + duration
        try:
            # Atomic conditional write: succeed only if item doesn't exist 
            # OR the existing lease has already expired.
            self.table.put_item(
                Item={
                    'resource_id': resource_id,
                    'owner_id': self.owner_id,
                    'lease_expiry': expiry,
                    'fencing_token': int(time.time() * 1000) # Millisecond epoch as token
                },
                ConditionExpression="attribute_not_exists(resource_id) OR lease_expiry < :now",
                ExpressionAttributeValues={':now': int(time.time())}
            )
            return LockLease(resource_id, self.owner_id, duration, int(time.time() * 1000))
        except ClientError as e:
            if e.response['Error']['Code'] == 'ConditionalCheckFailedException':
                print(f"Lock acquisition failed for {resource_id}. Resource busy.")
                return None
            raise

The lease pattern effectively handles process failures by allowing time-based recovery. What occurs if a process in Azure experiences a long garbage collection pause, its lease expires, another process in AWS acquires the lock, and then the original Azure process wakes up and attempts to complete its write?

Step 3: Enforcing Integrity with Fencing Tokens

To prevent the "delayed write" problem caused by execution pauses or network lag, we must implement Fencing Tokens. A fencing token is a monotonically increasing number generated by the lock service during every successful acquisition. In our implementation, we use a millisecond-precision epoch as the token. When a worker performs a write to the protected resource (e.g., updating a record in Azure Cosmos DB), it must include this token. The database then enforces a rule where it only accepts a write if the provided token is greater than the token of the last successful write. This ensures that even if an old worker attempts a write after its lease has expired and been reacquired, the write will be rejected by the data store. This mechanism provides a final layer of safety that guards against the limitations of distributed time and process scheduling.

Common Troubleshooting

A frequent issue in multicloud locking is clock skew between AWS and Azure regions. If the system clock on an Azure VM is significantly behind the AWS DynamoDB clock, a lease might be perceived as expired by the DLM while the worker still believes it is valid. To mitigate this, always use the DynamoDB server-side time where possible or implement a "clock drift safety margin" by subtracting a few seconds from the lease duration on the client side.

Another common failure point is the ProvisionedThroughputExceededException on the DynamoDB lock table. If hundreds of microservices are constantly polling for a lock on a highly contested resource, the table will throttle requests. Instead of aggressive polling, implement a truncated exponential backoff strategy. If a lock acquisition fails, the worker should wait for a randomized interval before retrying to prevent a thundering herd problem that exhausts your IOPS budget.

Finally, ensure that the Azure worker identities have the dynamodb:PutItem and dynamodb:GetItem permissions via an OIDC-federated IAM role. A common error is a 403 Forbidden response occurring when the worker attempts to acquire the lock due to a misconfigured Trust Relationship on the AWS IAM role side.

Conclusion

Implementing distributed locking across AWS and Azure is a fundamental requirement for maintaining data consistency in high-stakes multicloud architectures. By combining atomic conditional writes in DynamoDB, time-bound leases with heartbeats, and fencing tokens at the storage layer, you create a robust defense against race conditions and process failures. This pattern ensures that shared resources remain consistent regardless of the execution environment. As a next step, consider implementing the Redlock algorithm across three distinct regions or cloud providers to further eliminate the single-point-of-failure risk associated with a single cloud's control plane.

References

Brewer, E. A. (2000). Towards robust distributed systems. Proceedings of the Nineteenth Annual ACM Symposium on Principles of Distributed Computing.

Kleppmann, M. (2017). Designing data-intensive applications: The big ideas behind reliable, scalable, and maintainable systems. O'Reilly Media.

Lampson, B. W. (1996). How to build a highly available system using consensus. In Distributed Algorithms (pp. 1-17). Springer.

Vernon, V. (2013). Implementing domain-driven design. Addison-Wesley Professional.

Generating slides ...

DEV Community