Cluster Stickiness for Large Distributed Applications

#cloudcomputing #clustering #scaling #partitions

Overview

Every software product company begins its journey from humble origins. However, success often brings its own set of challenges. As the organization grows, so does the user base, often at an exponential rate. This surge in users can strain the system, causing previously smooth database queries to slow down or even timeout. Such issues can severely impact user experience and, consequently, lead to revenue losses.

At this crucial juncture, Software Architects and leaders face the daunting task of devising effective scaling strategies. The goal is to ensure that data retrieval remains as swift as it was during the early stages, despite the burgeoning user base. Drawing from my professional experience in navigating such exponential scaling events, this article delves into best practices for overcoming these challenges by establishing cluster level stickiness and localizing user generated traffic to specific clusters and avoiding global database lookups. As an added bonus, this approach also helps with the compliance with data residency regulations enforced by governments across the world.

The Problem

Figure 1: Simple DB setup

In the beginning, a simple setup like the one shown in Figure 1 is usually enough to get started. As the number of users start increasing and the application starts going global, the DB fetch starts getting slower. This can be easily solved by creating indexes on specific columns within tables in the database until the DB size hits a certain limit. Beyond that, updating very large indexes and querying them starts to become a resource intensive and slow activity.

The Solution

The obvious solution to this problem is to partition the database. A good partitioning strategy for databases containing data about users from across the globe is to partition the database into region and/or user type based clusters, where each cluster contains the same database structure as all the clusters but contains only a subset of all the data. This can be bundled with a metadata store that serves as an index for the data housed in each cluster. This will look something like this,

Figure 2: Clustered Architecture

While this is an effective strategy for partitioning the data, it has a limitation. For every request that the load balancer receives, it will need to query the metadata store to identify the correct target cluster for the request and route it appropriately. This again is a resource intensive activity.

To address this, we can start by defining a simple structure to the metadata store. It can be an index of (username, cluster_id) pairs. This can be populated at the time the user registers to use the software or the first time the user’s data is written to a partition in a cluster. Once the user’s cluster has been identified, a first party cookie with the name of the cluster can be dropped onto the user’s browser and this can be used by the load balancer for subsequent requests to directly route the user’s requests to the correct cluster without looking up the metadata store. Figure 3 illustrates these sequences of actions.

Figure 3: Assigning user to a cluster and looking up the user from the cluster

Limitations

While implementing this solution speeds up the queries and improves the user experience, it also comes with several limitations. The prominent limitations are listed below,

Increased Maintenance Overhead: With a simple setup, the software architecture is straightforward and it is easy to modify, fix and replace. However the cost of maintenance goes up as the architecture gets more complex, due to the need to now maintain multiple database servers per partition and groups of servers per cluster.
Partition Rebalancing and Cluster Re-assignment: The partitions corresponding to locations where the software application is most popular tend to grow large in comparison to other partitions. As a result, the users from the popular areas start receiving the slowest responses. To avoid this, new local partitions will need to be created and the existing records are required to be rebalanced. As a part of this effort the metadata store will also need to be updated. It is natural for the user experience to be inconsistent or even broken during these rebalancing efforts. Similarly, when the cluster determination criteria for a user changes, for example, if a user moves from one country to another, the user will need to be moved to a different cluster corresponding to the new country. This also requires updating the metadata store and leads to additional processing power consumption.
Need for Experts: While a simple architecture can easily be maintained by most software engineers, experts with advanced database skills and advanced software architectural skills will be required to maintain complex architectures. Finding and retaining them could be challenging for organizations.

Conclusion

This article intentionally focuses on a specific aspect of database cluster setup for the sake of clarity and brevity. It acknowledges that other crucial elements, implementing replication and fault tolerance mechanisms, addressing replication lags, and employing caching strategies, are equally vital but inherently nuanced and architecture-specific.

Having experienced the need for rapid scaling firsthand in my career, I've crafted this article to offer a concise blueprint for swiftly developing the infrastructure necessary for scaling. Just as there are multiple ways to bake a cake, there exist various scaling strategies. The method outlined herein reflects my own repeatable approach, which I believe can be effectively utilized by Software Architects worldwide.

DEV Community

Cluster Stickiness for Large Distributed Applications

Overview

The Problem

The Solution

Limitations

Conclusion

Top comments (0)

Read next

Kubernetes PODs with global IPv6

AWS Artifact for Security Compliance Reports

Introducing k8s-pod-cpu-stressor: Simplify CPU Load Testing in Kubernetes

Unpopular opinion about "Moving back to on-prem"