At the fist what is Hashing ?
Hashing is a technique or process of mapping keys, and values into the hash table by using a hash function. It is done for faster access to elements. The efficiency of mapping depends on the efficiency of the hash function used.
Let a hash function H(x) maps the value x at the index x%10 in an Array. For example if the list of values is [11,12,13,14,15] it will be stored at positions {1,2,3,4,5} in the array or Hash table respectively.
Scaling Out: Distributed Hashing
Now that we have discussed hashing, we’re ready to look into distributed hashing.
In some situations, it may be necessary or desirable to split a hash table into several parts, hosted by different servers. One of the main motivations for this is to bypass the memory limitations of using a single computer, allowing for the construction of arbitrarily large hash tables (given enough servers).
In such a scenario, the objects (and their keys) are distributed among several servers, hence the name like shading database or distributed caching .
Such setups consist of a pool of caching servers that host many key/value pairs and are used to provide fast access to data originally stored (or computed) elsewhere. For example, to reduce the load on a database server and at the same time improve performance, an application can be designed to first fetch data from the cache servers, and only if it’s not present there—a situation known as cache miss—resort to the database, running the relevant query and caching the results with an appropriate key, so that it can be found next time it’s needed.
how does distribution take place? What criteria are used to determine which keys to host in which servers?
The simplest way is to take the hash modulo of the number of servers. That is, server = hash(key) mod N, where N is the size of the pool. To store or retrieve a key, the client first computes the hash, applies a modulo N operation, and uses the resulting index to contact the appropriate server (probably by using a lookup table of IP addresses). Note that the hash function used for key distribution must be the same one across all clients, but it need not be the same one used internally by the caching servers.
The Rehashing Problem
This distribution scheme is simple, intuitive, and works fine. That is, until the number of servers changes. What happens if one of the servers crashes or becomes unavailable? Keys need to be redistributed to account for the missing server, of course. The same applies if one or more new servers are added to the pool;keys need to be redistributed to include the new servers. This is true for any distribution scheme, but the problem with our simple modulo distribution is that when the number of servers changes, most hashes modulo N will change, so most keys will need to be moved to a different server. So, even if a single server is removed or added, all keys will likely need to be rehashed into a different server.
From our previous example, if we removed server C, we’d have to rehash all the keys using hash modulo 2 instead of hash modulo 3.
Note that all key locations changed, not only the ones from server C.
In the typical use case we mentioned before (caching), this would mean that, all of a sudden, the keys won’t be found because they won’t yet be present at their new location.
So, most queries will result in misses, and the original data will likely need retrieving again from the source to be rehashed, thus placing a heavy load on the origin server(s) (typically a database). This may very well degrade performance severely and possibly crash the origin servers.
Consistent Hashing
The solution of that is Consistent Hashing We need a distribution scheme that does not depend directly on the number of servers, so that, when adding or removing servers, the number of keys that need to be relocated is minimized.consistent hashing is the solution, and was first described in an academic paper from 1997.
Consistent Hashing is a distributed hashing scheme that operates independently of the number of servers or objects in a distributed hash table by assigning them a position on an abstract circle, or hash ring. This allows servers and objects to scale without affecting the overall system.
Imagine we mapped the hash output range onto the points of a circle. That means that the minimum possible hash value, zero, would correspond to an angle of zero, the maximum possible value (some big integer we’ll call INT_MAX) would correspond to an angle of 2𝝅 radians (or 360 degrees), and all other hash values would linearly fit somewhere in between. So, we could take a key, compute its hash, and find out where it lies on the circle’s edge. Assuming an INT_MAX of 1010 to get the server which we need to get we will compute degree for the value and find which server have the range for this degree.
We also can have some operations on our Consistent Hashing
Adding a node: Suppose we have A,B,C and we add a new node D in the ring by calculating the hash. Only those keys will be redistributed whose values lie between the D and C. Now they will not point towards A, they will point towards D and this will avoid rearrange all nodes
Removing a node: Suppose we remove a node C in our ring. Only those keys will be redistributed whose values lies between C and the B. Now they will not point towards C, they will point towards A.
What happens if a machine leaves?
With consistent hashing we're assuming that machines can leave and join over time. If a machine leaves, won't we lose data?
To avoid this, we'll usually have machines act as backups for each other. One strategy is to have each machine replicate the data stored on the machine behind of it, giving us a backup copy.
Again, this is the sort of implementation detail you'll want to mention to your interviewer, even if you won't draw out the entire implementation.
Conclusion
Consistent Hashing is one of the most important algorithms to help us horizontally scale and manage any distributed system. The algorithm does not only work in shaded systems but also finds its application in load balancing, data partitioning, managing server-based sticky sessions, routing algorithms, and many more. A lot of databases owe their scale, performance, and ability to handle the humongous load to Consistent Hashing.
Top comments (0)