DEV Community

Mustafa Veysi Soyvural
Mustafa Veysi Soyvural

Posted on

I Built Consistent Hashing From Scratch in Go — Here's What I Learned

Consistent Hashing with Virtual Nodes in Go

If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name.

I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about.

The five-minute version

You have 5 cache servers. You route keys with hash(key) % 5. Life is good.

Then traffic spikes and you add a 6th server. Now it's hash(key) % 6. Sounds harmless, right?

Here's what actually happens:

Before:  hash("user:1001") % 5 = 3  → Server C
After:   hash("user:1001") % 6 = 1  → Server A  ← moved!
Enter fullscreen mode Exit fullscreen mode

That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The request hits your database.

And it's not just one key. ~83% of all keys remap. With 100K cached items, that's 83,000 simultaneous cache misses hammering your database. People call this a "thundering herd" or "cache stampede." I call it a bad Monday.

What consistent hashing actually does

Instead of hash % N, imagine putting everything on a circle — a ring from 0 to 2^32.

Nodes get hashed onto the ring. Keys get hashed onto the ring. To find which node owns a key, you start at the key's position and walk clockwise until you bump into a node.

                    0
                    │
           Node-C ──┤
                    │
      2^32 ────────┼──────── 2^8
                    │
                    ├── "user:1001" → walks clockwise → Node-A
                    │
           Node-A ──┤
                    │
                   2^16
                    │
           Node-B ──┤
Enter fullscreen mode Exit fullscreen mode

When you add a node, it only takes over the arc between itself and the previous node. Everything else stays put. The math works out to roughly K/N keys moving (K = total keys, N = new node count). So with 100K keys and 6 nodes, about 17K keys move instead of 83K.

I measured this in my implementation:

Strategy          │ Keys Remapped (5→6 nodes, 100K keys)
──────────────────│─────────────────────────────────────
Modulo (hash%N)   │  83,803  (83.8%)
Consistent Hash   │  15,723  (15.7%)
Enter fullscreen mode Exit fullscreen mode

That's not a marginal improvement. That's the difference between "the site went down" and "nobody noticed."

The virtual node trick

There's a catch with the basic approach. If you have 5 nodes, you have 5 points on the ring. And 5 points on a circle don't spread evenly — some nodes end up owning huge arcs while others get almost nothing.

I measured this too. With 5 nodes and 50K keys, one node got 32,014 keys while another got 1,855. That's a 17:1 ratio. Terrible.

The fix is virtual nodes. Instead of putting each physical node at one position, you put it at 150 positions. Node-A becomes Node-A#0, Node-A#1, ... Node-A#149, each hashed to a different spot on the ring.

Here's what that does to the distribution:

Virtual Nodes Std Deviation Worst Node (ratio to fair share)
1 11,353 3.20x
50 4,601 1.83x
150 2,824 1.47x
500 976 1.17x

150 is the sweet spot most production systems use. Good enough distribution, not too much memory overhead.

What happens when a node dies

This is where it gets interesting for real-world systems.

I built a cache simulator on top of the hash ring — basically a miniature distributed Memcached. Five "Redis nodes," 10,000 keys. Then I killed one.

redis-1     2,197 keys  ██████████████████████
redis-2     1,731 keys  █████████████████
redis-3     1,559 keys  ███████████████
redis-4     1,730 keys  █████████████████
redis-5     2,783 keys  ████████████████████████████

redis-3 goes down!

Cache hit rate: 84.4%
Cache misses:   1,559 (exactly the keys redis-3 had)
Enter fullscreen mode Exit fullscreen mode

Exactly 1,559 misses — the precise number of keys that were on redis-3. Zero keys from other nodes were affected. That's the whole point: the blast radius of a failure is limited to the failed node.

In production, you'd replicate each key to the next 2-3 nodes on the ring, so even that node's keys survive. But that's a story for another post.

Things I got wrong the first time

Hash collisions between virtual nodes. With 150 vnodes per physical node and a 32-bit hash space, collisions are rare but not impossible. My first version silently overwrote one node's position with another's, which corrupted the ring in subtle ways. The fix: check before inserting and skip collisions.

Locking granularity. The ring needs a sync.RWMutex because reads (key lookups) vastly outnumber writes (node additions/removals). My first version used a regular Mutex and paid for it under concurrent load.

Silent error swallowing. When the ring is empty, GetNode returns an error. My first version of the cache simulator just ignored that error and kept going, silently dropping writes. In a real system, that's how you get data loss that takes hours to diagnose.

Testing the hard stuff

Unit tests are straightforward. The interesting part is testing the mathematical guarantees under chaos.

I wrote five "catastrophic tests" that simulate production nightmares:

Mass failure: Remove 5 of 10 nodes simultaneously. Verify that only the dead nodes' keys remapped. Result: 0 keys from surviving nodes moved. Perfect isolation.

Rapid churn: 20 random add/remove cycles. Each individual operation should remap roughly K/N keys. Result: every cycle stayed within 1.5x of the theoretical bound.

Concurrent chaos: 20 goroutines doing lookups while 5 goroutines add and remove nodes for 2 seconds straight, with Go's race detector on. Result: 2.4M reads, zero data races, 0% error rate.

These aren't pass/fail tests — they're property-based tests that verify the algorithm's mathematical guarantees hold under pressure.

The implementation

The full code is on GitHub: soyvural/consistent-hashing

It's about 800 lines of Go with zero external dependencies. The core is a sorted slice of uint32 positions with binary search for O(log n) lookups. Hash function is pluggable via an interface — ships with FNV-1a, MD5, and CRC32.

# try it yourself
git clone https://github.com/soyvural/consistent-hashing.git
cd consistent-hashing
go run cmd/demo/main.go
Enter fullscreen mode Exit fullscreen mode

The demo prints all the numbers I quoted in this post. You can tweak the vnode count, swap hash functions, and see exactly how the distribution changes.

Where consistent hashing shows up

Once you know what to look for, it's everywhere:

  • Memcached clients use it to route keys to servers
  • Amazon DynamoDB uses it for data partitioning (described in their 2007 Dynamo paper)
  • Apache Cassandra uses it for its token ring
  • Load balancers use it for sticky sessions that survive backend scaling

The common thread: any time you need to spread work across a changing set of workers without reshuffling everything.

Further reading

If you want to go deeper:


The full source code is at github.com/soyvural/consistent-hashing. Stars welcome if you found this useful.

Top comments (0)