shohag miah

Posted on Nov 7

I Built a Go Library That Makes Distributed Systems Actually Simple

#webdev #cloud #architecture #distributedsystems

Ever tried to scale an application beyond a single server? Then you know the pain:

Which server stores which data?

How do I keep replicas in sync?

What happens when a node crashes?

How do I rebalance when adding nodes?

I spent months building distributed systems and kept reimplementing the same coordination logic. So I extracted it into ClusterKit — a Go library that handles all the cluster coordination while you keep full control over storage and replication.

What Problem Does This Solve?

Let's say you're building a key-value store. On a single server, it's simple:

func Set(key, value string) {

    db[key] = value  // Done!

}

But when you need to scale to multiple servers, suddenly you need:

Partitioning — Which server stores "user:123"?

Replication — How do I keep 3 copies of the data?

Routing — Where should clients send requests?

Failover — What if the primary server dies?

Rebalancing — How do I redistribute data when adding servers?

You could use Redis Cluster or Cassandra, but that locks you into their storage model. Or you could build it from scratch... and spend months on infrastructure instead of your application.

ClusterKit solves this by handling ONLY the coordination layer.

The Core Concept

ClusterKit tells you WHERE data should go. You decide HOW to store it.

Here's the complete API (seriously, this is all of it):

// Find which partition owns a key

partition, _ := ck.GetPartition("user:123")



// Get the primary node for this partition

primary := ck.GetPrimary(partition)



// Get replica nodes

replicas := ck.GetReplicas(partition)



// Am I responsible for this data?

isPrimary := ck.IsPrimary(partition)

isReplica := ck.IsReplica(partition)



// Get all nodes (primary + replicas)

nodes := ck.GetNodes(partition)



// What's my node ID?

myID := ck.GetMyNodeID()

That's it. Seven methods. No complex config files, no YAML manifests, no XML schemas.

Show Me Code!

Let's build a distributed key-value store in ~60 lines:

package main



import (

    "fmt"

    "sync"

    "github.com/skshohagmiah/clusterkit"

)



type KVStore struct {

    ck   *clusterkit.ClusterKit

    data map[string]string

    mu   sync.RWMutex

}



func (kv *KVStore) Set(key, value string) error {

    // 1. Ask ClusterKit where this key belongs

    partition, err := kv.ck.GetPartition(key)

    if err != nil {

        return err

    }



    // 2. Am I the primary for this partition?

    if kv.ck.IsPrimary(partition) {

        // YES - Store locally

        kv.mu.Lock()

        kv.data[key] = value

        kv.mu.Unlock()



        fmt.Printf("✅ Stored %s (I'm primary)\n", key)



        // Replicate to backup servers

        replicas := kv.ck.GetReplicas(partition)

        for _, replica := range replicas {

            go kv.replicateTo(replica, key, value)

        }

        return nil

    }



    // 3. Am I a replica?

    if kv.ck.IsReplica(partition) {

        // YES - Just store it

        kv.mu.Lock()

        kv.data[key] = value

        kv.mu.Unlock()



        fmt.Printf("✅ Stored %s (I'm replica)\n", key)

        return nil

    }



    // 4. I'm neither - forward to primary

    primary := kv.ck.GetPrimary(partition)

    fmt.Printf("⏩ Forwarding to %s\n", primary.ID)

    return kv.forwardTo(primary, key, value)

}



func (kv *KVStore) Get(key string) (string, error) {

    partition, err := kv.ck.GetPartition(key)

    if err != nil {

        return "", err

    }



    // Can read from primary or replica

    if kv.ck.IsPrimary(partition) || kv.ck.IsReplica(partition) {

        kv.mu.RLock()

        defer kv.mu.RUnlock()



        value, exists := kv.data[key]

        if !exists {

            return "", fmt.Errorf("key not found")

        }

        return value, nil

    }



    // Forward to primary

    primary := kv.ck.GetPrimary(partition)

    return kv.readFrom(primary, key)

}

What just happened?

ClusterKit handles partition assignment via consistent hashing

Your app decides whether to store, replicate, or forward

You control the storage engine (could be PostgreSQL, Redis, files, etc.)

You control the replication protocol (HTTP, gRPC, TCP, whatever!)

Getting Started is Stupid Simple

Bootstrap node (first server):

ck, _ := clusterkit.NewClusterKit(clusterkit.Options{

    NodeID:   "node-1",

    HTTPAddr: ":8080",

})

ck.Start()

Additional nodes (join the cluster):

ck, _ := clusterkit.NewClusterKit(clusterkit.Options{

    NodeID:   "node-2",

    HTTPAddr: ":8081",

    JoinAddr: "localhost:8080",  // Bootstrap node

})

ck.Start()

That's it. Your cluster is running with:

✅ Automatic partition assignment

✅ Primary/replica designation

✅ Health monitoring

✅ Auto-rebalancing

The Killer Feature: Sub-100ms Failover

Here's where ClusterKit shines. Traditional distributed systems have terrible failover:

Primary fails → Wait 30s for timeout → Leader election → Update routing → Resume

30 seconds of downtime! 🔥

ClusterKit uses client-side retry with replica fallback:

Client → Primary ❌ → Replica ✅ (instant!)

When a primary fails:

Client detects error immediately

Client retries on replica (no delay)

Replica accepts write and returns success

Topology refreshes in background

Life goes on

Result: <100ms failover instead of 30 seconds. That's a 300x improvement! 🚀

Handling Data Migration

When nodes join or leave, partitions need to move. ClusterKit provides a hook:

ck.OnPartitionChange(func(partitionID string, copyFromNodes []*Node, copyToNode *Node) {

    if copyToNode.ID != myNodeID {

        return  // Not for me

    }



    log.Printf("📦 Migrating partition %s\n", partitionID)



    // Merge data from ALL source nodes

    mergedData := make(map[string]string)



    for _, sourceNode := range copyFromNodes {

        data := fetchFrom(sourceNode, partitionID)



        // Your conflict resolution strategy

        for key, value := range data {

            mergedData[key] = value  // Last-write-wins

            // OR: Use version numbers, timestamps, etc.

        }

    }



    storeData(mergedData)

    log.Printf("✅ Migrated %d keys\n", len(mergedData))

})

Why multiple source nodes?

During failover, different replicas might have received different writes. ClusterKit gives you ALL nodes that had the data so you can merge them and prevent data loss.

This is crucial for eventual consistency scenarios!

Three Replication Strategies Included

The library includes three complete, runnable examples:

Client-Side SYNC (Quorum-Based)

Best for: Banking, inventory, critical data

Client writes to all nodes (primary + replicas) and waits for quorum (2/3):

// Client writes to ALL nodes in parallel

client.Set("key", "value")



// Internally:

// - Write to primary + replicas

// - Wait for 2/3 nodes to confirm

// - Return success only if quorum reached

Consistency: Strong (data on 2+ nodes before success)

2. Client-Side ASYNC (Primary-First)

Best for: Streaming, analytics, high throughput

Client writes to primary only, returns immediately:

// Client writes to primary, returns FAST

client.Set("key", "value")  // ~1ms latency



// Internally:

// - Write to primary (blocking - but fast!)

// - Return success immediately

// - Replicate to replicas in background

Consistency: Eventual (primary has data now, replicas catch up)

3. Server-Side Routing

Best for: Web/mobile apps, simple HTTP clients

Client sends to ANY node. Server handles routing:

# No SDK needed - just HTTP!

curl -X POST http://any-node:8080/kv/set \

  -d '{"key":"test","value":"hello"}'



# Server automatically:

# - Routes to primary if needed

# - Handles replication

# - Returns result

Consistency: Configurable (server decides)

Production Ready Features

Despite the simple API, ClusterKit is built for production:

Raft Consensus — Built on HashiCorp's battle-tested Raft

Write-Ahead Log — All state changes are logged

Snapshots — Periodic snapshots for fast recovery

Persistence — Survives restarts

HTTP API — RESTful cluster info endpoint

Docker & Kubernetes — Deployment examples included

When Should You Use ClusterKit?

Perfect for:

Building custom distributed caches

Creating specialized KV stores

Distributed message queues

Session stores across multiple servers

Time-series databases with custom partitioning

Any system where you need custom distribution logic

Not ideal for:

Single-node applications (obviously!)

When Redis Cluster/Cassandra already fits perfectly

If you don't want to handle replication yourself

Real-World Use Cases

Distributed Cache:

// Cache layer that scales horizontally

partition, _ := ck.GetPartition(sessionID)

if ck.IsPrimary(partition) {

    redis.Set(sessionID, sessionData)

}

Multi-Region Queue:

// Partition queue by region

partition, _ := ck.GetPartition(region)

if ck.IsPrimary(partition) {

    queue.Enqueue(message)

}

Sharded Analytics:

// Partition events by date

partition, _ := ck.GetPartition(date)

if ck.IsPrimary(partition) {

    timescaleDB.Insert(event)

}

Running the Examples

# Install

go get github.com/skshohagmiah/clusterkit



# Run SYNC example (strong consistency)

cd example/sync && ./run.sh



# Run ASYNC example (high throughput)

cd example/async && ./run.sh



# Run Server-Side example (simple clients)

cd example/server-side && ./run.sh

Each example starts a 6-10 node cluster and runs 1000 operations showing:

Cluster formation

Data distribution

Automatic replication

Node failure handling

Data migration

The Philosophy

ClusterKit follows the Unix philosophy: do one thing and do it well.

It doesn't try to be:

A database (you choose: PostgreSQL, MySQL, MongoDB)

A cache (you choose: Redis, Memcached, in-memory)

A message queue (you choose: your protocol)

It's just the coordination layer. This means:

✅ You understand every piece of your system

✅ You can debug issues easily

✅ You can optimize for your use case

✅ You're not locked into vendor features

✅ You can swap storage engines anytime

Performance

In my benchmarks (6-node cluster, 1000 ops):

SYNC (Quorum):

Write latency: ~5ms (waits for 2/3 nodes)

Read latency: ~1ms (local read)

Consistency: Strong

ASYNC (Primary-First):

Write latency: ~1ms (primary only)

Read latency: ~1ms (local read)

Consistency: Eventual

Server-Side:

Write latency: ~3ms (includes routing)

Read latency: ~2ms (includes routing)

Consistency: Configurable

Docker Deployment

docker-compose.yml

version: '3.8'

services:

node-1:

build: .

environment:

  - NODE_ID=node-1

  - HTTP_PORT=8080

ports:

  - "8080:8080"

node-2:

build: .

environment:

  - NODE_ID=node-2

  - HTTP_PORT=8080

  - JOIN_ADDR=node-1:8080

ports:

  - "8081:8080"

docker-compose up -d

What's Next?

I'm actively working on:

[ ] Dynamic partition count adjustment

[ ] Metrics and observability hooks

[ ] More replication examples (gRPC, NATS)

[ ] Admin UI for cluster visualization

[ ] Better documentation

Try It Yourself

go get github.com/skshohagmiah/clusterkit

Links:

GitHub: https://github.com/skshohagmiah/clusterkit

Examples: https://github.com/skshohagmiah/clusterkit/tree/main/example

Documentation: Check the README

Conclusion

Building distributed systems doesn't have to be rocket science. With ClusterKit, you get:

Simple API (7 methods + 1 hook)

Full control over storage and replication

Production-ready consensus and failover

Minimal configuration (2 required fields)

Complete working examples

If you've been wanting to build something distributed but the complexity scared you away, give ClusterKit a shot. Start small with the examples, then build something awesome.

Got questions? Drop them in the comments! 👇

And if you find this useful, give it a ⭐ on GitHub!

Tags: #go #distributedsystems #opensource #backend #clustering