Ever tried to scale an application beyond a single server? Then you know the pain:
Which server stores which data?
How do I keep replicas in sync?
What happens when a node crashes?
How do I rebalance when adding nodes?
I spent months building distributed systems and kept reimplementing the same coordination logic. So I extracted it into ClusterKit — a Go library that handles all the cluster coordination while you keep full control over storage and replication.
What Problem Does This Solve?
Let's say you're building a key-value store. On a single server, it's simple:
func Set(key, value string) {
db[key] = value // Done!
}
But when you need to scale to multiple servers, suddenly you need:
Partitioning — Which server stores "user:123"?
Replication — How do I keep 3 copies of the data?
Routing — Where should clients send requests?
Failover — What if the primary server dies?
Rebalancing — How do I redistribute data when adding servers?
You could use Redis Cluster or Cassandra, but that locks you into their storage model. Or you could build it from scratch... and spend months on infrastructure instead of your application.
ClusterKit solves this by handling ONLY the coordination layer.
The Core Concept
ClusterKit tells you WHERE data should go. You decide HOW to store it.
Here's the complete API (seriously, this is all of it):
// Find which partition owns a key
partition, _ := ck.GetPartition("user:123")
// Get the primary node for this partition
primary := ck.GetPrimary(partition)
// Get replica nodes
replicas := ck.GetReplicas(partition)
// Am I responsible for this data?
isPrimary := ck.IsPrimary(partition)
isReplica := ck.IsReplica(partition)
// Get all nodes (primary + replicas)
nodes := ck.GetNodes(partition)
// What's my node ID?
myID := ck.GetMyNodeID()
That's it. Seven methods. No complex config files, no YAML manifests, no XML schemas.
Show Me Code!
Let's build a distributed key-value store in ~60 lines:
package main
import (
"fmt"
"sync"
"github.com/skshohagmiah/clusterkit"
)
type KVStore struct {
ck *clusterkit.ClusterKit
data map[string]string
mu sync.RWMutex
}
func (kv *KVStore) Set(key, value string) error {
// 1. Ask ClusterKit where this key belongs
partition, err := kv.ck.GetPartition(key)
if err != nil {
return err
}
// 2. Am I the primary for this partition?
if kv.ck.IsPrimary(partition) {
// YES - Store locally
kv.mu.Lock()
kv.data[key] = value
kv.mu.Unlock()
fmt.Printf("✅ Stored %s (I'm primary)\n", key)
// Replicate to backup servers
replicas := kv.ck.GetReplicas(partition)
for _, replica := range replicas {
go kv.replicateTo(replica, key, value)
}
return nil
}
// 3. Am I a replica?
if kv.ck.IsReplica(partition) {
// YES - Just store it
kv.mu.Lock()
kv.data[key] = value
kv.mu.Unlock()
fmt.Printf("✅ Stored %s (I'm replica)\n", key)
return nil
}
// 4. I'm neither - forward to primary
primary := kv.ck.GetPrimary(partition)
fmt.Printf("⏩ Forwarding to %s\n", primary.ID)
return kv.forwardTo(primary, key, value)
}
func (kv *KVStore) Get(key string) (string, error) {
partition, err := kv.ck.GetPartition(key)
if err != nil {
return "", err
}
// Can read from primary or replica
if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {
kv.mu.RLock()
defer kv.mu.RUnlock()
value, exists := kv.data[key]
if !exists {
return "", fmt.Errorf("key not found")
}
return value, nil
}
// Forward to primary
primary := kv.ck.GetPrimary(partition)
return kv.readFrom(primary, key)
}
What just happened?
ClusterKit handles partition assignment via consistent hashing
Your app decides whether to store, replicate, or forward
You control the storage engine (could be PostgreSQL, Redis, files, etc.)
You control the replication protocol (HTTP, gRPC, TCP, whatever!)
Getting Started is Stupid Simple
Bootstrap node (first server):
ck, _ := clusterkit.NewClusterKit(clusterkit.Options{
NodeID: "node-1",
HTTPAddr: ":8080",
})
ck.Start()
Additional nodes (join the cluster):
ck, _ := clusterkit.NewClusterKit(clusterkit.Options{
NodeID: "node-2",
HTTPAddr: ":8081",
JoinAddr: "localhost:8080", // Bootstrap node
})
ck.Start()
That's it. Your cluster is running with:
✅ Automatic partition assignment
✅ Primary/replica designation
✅ Health monitoring
✅ Auto-rebalancing
The Killer Feature: Sub-100ms Failover
Here's where ClusterKit shines. Traditional distributed systems have terrible failover:
Primary fails → Wait 30s for timeout → Leader election → Update routing → Resume
30 seconds of downtime! 🔥
ClusterKit uses client-side retry with replica fallback:
Client → Primary ❌ → Replica ✅ (instant!)
When a primary fails:
Client detects error immediately
Client retries on replica (no delay)
Replica accepts write and returns success
Topology refreshes in background
Life goes on
Result: <100ms failover instead of 30 seconds. That's a 300x improvement! 🚀
Handling Data Migration
When nodes join or leave, partitions need to move. ClusterKit provides a hook:
ck.OnPartitionChange(func(partitionID string, copyFromNodes []*Node, copyToNode *Node) {
if copyToNode.ID != myNodeID {
return // Not for me
}
log.Printf("📦 Migrating partition %s\n", partitionID)
// Merge data from ALL source nodes
mergedData := make(map[string]string)
for _, sourceNode := range copyFromNodes {
data := fetchFrom(sourceNode, partitionID)
// Your conflict resolution strategy
for key, value := range data {
mergedData[key] = value // Last-write-wins
// OR: Use version numbers, timestamps, etc.
}
}
storeData(mergedData)
log.Printf("✅ Migrated %d keys\n", len(mergedData))
})
Why multiple source nodes?
During failover, different replicas might have received different writes. ClusterKit gives you ALL nodes that had the data so you can merge them and prevent data loss.
This is crucial for eventual consistency scenarios!
Three Replication Strategies Included
The library includes three complete, runnable examples:
- Client-Side SYNC (Quorum-Based)
Best for: Banking, inventory, critical data
Client writes to all nodes (primary + replicas) and waits for quorum (2/3):
// Client writes to ALL nodes in parallel
client.Set("key", "value")
// Internally:
// - Write to primary + replicas
// - Wait for 2/3 nodes to confirm
// - Return success only if quorum reached
Consistency: Strong (data on 2+ nodes before success)
2. Client-Side ASYNC (Primary-First)
Best for: Streaming, analytics, high throughput
Client writes to primary only, returns immediately:
// Client writes to primary, returns FAST
client.Set("key", "value") // ~1ms latency
// Internally:
// - Write to primary (blocking - but fast!)
// - Return success immediately
// - Replicate to replicas in background
Consistency: Eventual (primary has data now, replicas catch up)
3. Server-Side Routing
Best for: Web/mobile apps, simple HTTP clients
Client sends to ANY node. Server handles routing:
# No SDK needed - just HTTP!
curl -X POST http://any-node:8080/kv/set \
-d '{"key":"test","value":"hello"}'
# Server automatically:
# - Routes to primary if needed
# - Handles replication
# - Returns result
Consistency: Configurable (server decides)
Production Ready Features
Despite the simple API, ClusterKit is built for production:
Raft Consensus — Built on HashiCorp's battle-tested Raft
Write-Ahead Log — All state changes are logged
Snapshots — Periodic snapshots for fast recovery
Persistence — Survives restarts
HTTP API — RESTful cluster info endpoint
Docker & Kubernetes — Deployment examples included
When Should You Use ClusterKit?
Perfect for:
Building custom distributed caches
Creating specialized KV stores
Distributed message queues
Session stores across multiple servers
Time-series databases with custom partitioning
Any system where you need custom distribution logic
Not ideal for:
Single-node applications (obviously!)
When Redis Cluster/Cassandra already fits perfectly
If you don't want to handle replication yourself
Real-World Use Cases
Distributed Cache:
// Cache layer that scales horizontally
partition, _ := ck.GetPartition(sessionID)
if ck.IsPrimary(partition) {
redis.Set(sessionID, sessionData)
}
Multi-Region Queue:
// Partition queue by region
partition, _ := ck.GetPartition(region)
if ck.IsPrimary(partition) {
queue.Enqueue(message)
}
Sharded Analytics:
// Partition events by date
partition, _ := ck.GetPartition(date)
if ck.IsPrimary(partition) {
timescaleDB.Insert(event)
}
Running the Examples
# Install
go get github.com/skshohagmiah/clusterkit
# Run SYNC example (strong consistency)
cd example/sync && ./run.sh
# Run ASYNC example (high throughput)
cd example/async && ./run.sh
# Run Server-Side example (simple clients)
cd example/server-side && ./run.sh
Each example starts a 6-10 node cluster and runs 1000 operations showing:
Cluster formation
Data distribution
Automatic replication
Node failure handling
Data migration
The Philosophy
ClusterKit follows the Unix philosophy: do one thing and do it well.
It doesn't try to be:
A database (you choose: PostgreSQL, MySQL, MongoDB)
A cache (you choose: Redis, Memcached, in-memory)
A message queue (you choose: your protocol)
It's just the coordination layer. This means:
✅ You understand every piece of your system
✅ You can debug issues easily
✅ You can optimize for your use case
✅ You're not locked into vendor features
✅ You can swap storage engines anytime
Performance
In my benchmarks (6-node cluster, 1000 ops):
SYNC (Quorum):
Write latency: ~5ms (waits for 2/3 nodes)
Read latency: ~1ms (local read)
Consistency: Strong
ASYNC (Primary-First):
Write latency: ~1ms (primary only)
Read latency: ~1ms (local read)
Consistency: Eventual
Server-Side:
Write latency: ~3ms (includes routing)
Read latency: ~2ms (includes routing)
Consistency: Configurable
Docker Deployment
docker-compose.yml
version: '3.8'
services:
node-1:
build: .
environment:
- NODE_ID=node-1
- HTTP_PORT=8080
ports:
- "8080:8080"
node-2:
build: .
environment:
- NODE_ID=node-2
- HTTP_PORT=8080
- JOIN_ADDR=node-1:8080
ports:
- "8081:8080"
docker-compose up -d
What's Next?
I'm actively working on:
[ ] Dynamic partition count adjustment
[ ] Metrics and observability hooks
[ ] More replication examples (gRPC, NATS)
[ ] Admin UI for cluster visualization
[ ] Better documentation
Try It Yourself
go get github.com/skshohagmiah/clusterkit
Links:
GitHub: https://github.com/skshohagmiah/clusterkit
Examples: https://github.com/skshohagmiah/clusterkit/tree/main/example
Documentation: Check the README
Conclusion
Building distributed systems doesn't have to be rocket science. With ClusterKit, you get:
Simple API (7 methods + 1 hook)
Full control over storage and replication
Production-ready consensus and failover
Minimal configuration (2 required fields)
Complete working examples
If you've been wanting to build something distributed but the complexity scared you away, give ClusterKit a shot. Start small with the examples, then build something awesome.
Got questions? Drop them in the comments! 👇
And if you find this useful, give it a ⭐ on GitHub!
Tags: #go #distributedsystems #opensource #backend #clustering
Top comments (0)