DEV Community

Alex Rivers
Alex Rivers

Posted on

Consistent Hashing: The Complete Developer Guide

Consistent Hashing: The Complete Developer Guide

Consistent Hashing: The Complete Developer Guide

====================================================================

What Is It?


Imagine you're at a restaurant with a huge buffet. You want to find your favorite dish, but the buffet is so big that you can't possibly scan every single item. A clever solution would be to have a map of the buffet, where each section corresponds to a specific range of dishes. When you want to find a dish, you look at the map, find the section that corresponds to your dish, and head straight to that section. This way, you don't have to scan the entire buffet.

Consistent hashing works similarly. Instead of a physical buffet, we have a large set of data (e.g., a distributed database or a cache). We want to map this data to a set of servers (or nodes) in a way that allows us to efficiently find the data we need. Consistent hashing is an algorithm that maps data to nodes in a way that minimizes the number of nodes that need to be updated when the system changes (e.g., when a node is added or removed).

Technically, consistent hashing is a hashing technique that maps a set of keys (e.g., data) to a set of nodes (e.g., servers) in a way that satisfies the following properties:

  • Consistency: The hash function maps each key to a node in a consistent manner, meaning that the same key will always map to the same node.
  • Load balancing: The hash function distributes the keys across the nodes in a balanced manner, meaning that each node receives approximately the same number of keys.
  • Scalability: The hash function can handle changes to the set of nodes (e.g., adding or removing nodes) without requiring a complete rebalancing of the keys.

How It Works Under the Hood


Consistent hashing uses a combination of hash functions and a data structure called a ring (or circle). The ring is a circular data structure that represents the range of possible hash values. Each node is assigned a range of hash values on the ring, and each key is hashed to a value on the ring. The node responsible for a key is the node whose range includes the hash value of the key.

Here's a step-by-step overview of the process:

  1. Node placement: Each node is assigned a range of hash values on the ring. This is typically done by hashing the node's ID (e.g., its IP address) to a value on the ring.
  2. Key hashing: Each key is hashed to a value on the ring using a hash function.
  3. Node selection: The node responsible for a key is the node whose range includes the hash value of the key.

To illustrate this process, let's consider a simple example with three nodes (A, B, and C) and three keys (1, 2, and 3). We'll use a simple hash function that maps each key to a value on the ring.

  +---------------+
  |  Node A  | 0-33  |
  +---------------+
  |  Node B  | 34-66  |
  +---------------+
  |  Node C  | 67-100 |
  +---------------+
Enter fullscreen mode Exit fullscreen mode

Suppose we want to store the key 1 using consistent hashing. We hash the key 1 to a value on the ring, say 20. Since 20 falls within the range of Node A (0-33), we store the key 1 on Node A.

Now, let's add a new node (D) to the system. We need to update the ranges of the existing nodes to accommodate the new node.

  +---------------+
  |  Node A  | 0-25  |
  +---------------+
  |  Node B  | 26-50  |
  +---------------+
  |  Node C  | 51-75  |
  +---------------+
  |  Node D  | 76-100 |
  +---------------+
Enter fullscreen mode Exit fullscreen mode

We need to rebalance the keys across the nodes. Since the key 1 is still hashed to 20, it still falls within the range of Node A. However, the key 2 (hashed to 40) now falls within the range of Node B, and the key 3 (hashed to 80) now falls within the range of Node D.

Real-World Use Cases


1. Distributed Caching

Consistent hashing is widely used in distributed caching systems, such as Memcached and Redis. When a new node is added to the system, the cache is rebalanced across the nodes using consistent hashing.

import hashlib

def consistent_hash(key, nodes):
    hash_value = int(hashlib.md5(key.encode()).hexdigest(), 16)
    for node in nodes:
        node_hash = int(hashlib.md5(node.encode()).hexdigest(), 16)
        if hash_value <= node_hash:
            return node
    return nodes[0]

nodes = ['Node A', 'Node B', 'Node C']
key = 'my_key'
node = consistent_hash(key, nodes)
print(f"Key {key} is stored on {node}")
Enter fullscreen mode Exit fullscreen mode

2. Load Balancing

Consistent hashing can be used to distribute incoming requests across multiple servers in a load balancing system.

import hashlib

def consistent_hash(ip, servers):
    hash_value = int(hashlib.md5(ip.encode()).hexdigest(), 16)
    for server in servers:
        server_hash = int(hashlib.md5(server.encode()).hexdigest(), 16)
        if hash_value <= server_hash:
            return server
    return servers[0]

servers = ['Server A', 'Server B', 'Server C']
ip = '192.168.1.100'
server = consistent_hash(ip, servers)
print(f"IP {ip} is routed to {server}")
Enter fullscreen mode Exit fullscreen mode

3. Distributed Databases

Consistent hashing is used in distributed databases, such as Apache Cassandra and Amazon DynamoDB, to map data to nodes in a way that minimizes the number of nodes that need to be updated when the system changes.

When To Use It vs When NOT To


Use Case Description Consistent Hashing
Distributed caching Cache data across multiple nodes
Load balancing Distribute incoming requests across multiple servers
Distributed databases Map data to nodes in a way that minimizes updates
Small datasets Data fits in memory on a single node
Static data Data does not change frequently
High latency High latency between nodes

Common Misconceptions


1. Consistent Hashing is a Hash Function

Consistent hashing is not a hash function itself, but rather a technique that uses hash functions to map keys to nodes.

2. Consistent Hashing Requires a Fixed Number of Nodes

Consistent hashing can handle changes to the number of nodes, including adding or removing nodes.

3. Consistent Hashing is Only Used in Distributed Systems

While consistent hashing is widely used in distributed systems, it can also be used in non-distributed systems to improve load balancing and caching.

Performance Characteristics


  • Time complexity: O(log n), where n is the number of nodes.
  • Space complexity: O(n), where n is the number of nodes.

Related Concepts


TL;DR


  • Consistent hashing is a technique that maps keys to nodes in a way that minimizes the number of nodes that need to be updated when the system changes.
  • It uses a combination of hash functions and a ring data structure to achieve load balancing and scalability.
  • Consistent hashing is widely used in distributed systems, including caching, load balancing, and distributed databases.
  • It has a time complexity of O(log n) and a space complexity of O(n), where n is the number of nodes.

Recommended Tools


Recommended Tools

Top comments (0)