DEV Community

Cover image for lowkey: Building a Distributed Lock Service That Actually Works
Yashaswi Kumar Mishra
Yashaswi Kumar Mishra

Posted on

lowkey: Building a Distributed Lock Service That Actually Works

Yes, sometimes the payments processing cron job can run twice. In the same batch and for the same customer. Unfortunately, the customer would be charged twice.

How did this happen? Two instances of the same server ran the cron job twice. “But we did have locks, right?” You did. It was probably Redis Redlock. It worked perfectly in development. But production isn't kind to assumptions. Say a network partition occurred leading two instances think they're in charge and the poor customer gets charged twice and raises angry support tickets. Poor you are awake at 3 AM.

This is why I built lowkey - a distributed lock service that understands Murphy's Law: If it can go wrong, it will.

What is lowkey?

lowkey makes three promises:

  1. Only ONE - Strong consistency via Raft consensus

  2. Fencing tokens - Simple math prevents stale writes

  3. Fast - 3.24ms latency (faster than etcd's 5-10ms) [p50 benchmark]

Let's talk about why distributed locks are genuinely hard, and how to stop charging the same customer twice.

The Fundamental Problem of Time

You cannot really trust time in distributed systems.

Your process thinks it's been 1 second. But in reality it's been 30 seconds or anything longer than 1 second. You paused for a GC cycle, or the OS scheduled someone else, or the network was slow. Doesn't matter. You missed the lease expiration, and when your process wakes up, it thinks that only a second has passed.

Time is a Lie

This is called a process pause. It is not theoretical and almost causes production chaos daily.

So how do we solve it?

The CAP Theorem Choice

CAP theorem: Consistency, Availability, Partition Tolerance. We have to pick one out of C and A.

CAP theorem states that in a distributed system, when a network partition happens, you can only guarantee one of the following two properties out of C and A:

  • Consistency (C)

    Every node sees the same data at the same time. Once a write succeeds, all future reads see it.

  • Availability (A)

    Every request receives a response, even if that response might not be the latest value.

  • Partition tolerance (P)

    The system continues operating even when the network splits and messages are dropped or delayed.

For distributed locks, this isn't a choice. It's a requirement: You MUST pick CP (Consistency + Partition Tolerance).

Here's why Availability is the wrong choice:

In an AP system like Redis Redlock, if a network partition occurs :

  • node A can’t see node B

  • node B can’t see node A

Both nodes are still alive. Both think they’re healthy. Both say:

“I am available”

This is already the critical moment. This is called Split-Brain situation.

Availability is being preserved by lying about the global state.

Availability means "always respond, even if wrong". For locks, wrong equals disaster.

Whereas, lowkey follows a CP system.

In a CP system, nodes don’t pretend they’re alone in the universe. They are explicitly aware that:

  • other nodes exist

  • agreement is required

  • proceeding without quorum is dangerous

    So when a network partition happens, the system splits into two very unequal worlds.

The Three Pillars: Leases, Fencing Tokens, and Raft

Leases

A lease is like a parking meter. You pay for 2 hours. After 2 hours, your time expires. You lose the spot. Automatic.

Lease is the main entity in lowkey. Every owner(node) has to start a lease to acquire locks. Every lock is associated with a lease.

// Client creates a lease
lease := client.CreateLease(TTL: 10 * time.Second)
// lease_id = 100

// Client uses this lease for all locks
lock := client.Acquire("job-name", lease_id: 100)

// Client must send keepalive every ~3 seconds (TTL/3)
// If keepalive stops -> lease expires -> locks released
Enter fullscreen mode Exit fullscreen mode

Without leases, a crashed client holds locks forever. System gets stuck. With leases, system recovers automatically.

The flow looks like this.

Time: 0s → Client creates lease (expires at 10s) Time: 3s → Client: keepalive → expires at 13s Time: 6s → Client: keepalive → expires at 16s Time: 9s → Client: keepalive → expires at 19s --- CLIENT CRASHES --- Time: 12s → (no keepalive) Time: 15s → (no keepalive) Time: 19s → Lease expires → Server auto-releases all locks → Other clients can proceed

Fencing Tokens

Imagine a deli counter with number tickets. Person #43 cannot be served after person #44. The number proves freshness.

Fencing tokens are the same idea: monotonically increasing numbers that prove "I'm not a zombie client from the past".

Why are fencing tokens important?

This is how it happens in lowkey :

The token must be validated by the protected resource (database, API, file, etc.). The lock service provides the token, but the resource enforces it.

In lowkey:

// Client gets token when acquiring lock
lock := client.Acquire("migration-job")
token := lock.Token()  // e.g., 42

// Client includes token in EVERY protected operation
database.Execute(query, token: token)

// Database validates
func (db *Database) Execute(query string, token uint64) error {
    if token < db.lastSeenToken {
        return errors.New("stale token - rejected")
    }

    // Execute query
    db.lastSeenToken = token
    return nil
}
Enter fullscreen mode Exit fullscreen mode

Raft Consensus

Leases and Fencing tokens solve client-side failures. What about server-side failures? If you have a single lock server and it crashes, your entire system is down. No locks translates to no progress.

The solution: Multiple servers with consensus

But here's the catch: multiple servers create a new problem, who's in charge?

Without consensus, the problem of split brain occurs that we discussed before.

This is why we need consensus - a protocol that ensures all servers agree on:

  1. Who is the leader (only one can hand out locks)

  2. What the current state is (which locks are held)

  3. What operations have been committed (which token numbers are valid)

Why Raft specifically?

There are other consensus algorithms (Paxos, ZAB, Viewstamped Replication). We chose Raft because:

  1. Understandable - Designed explicitly to be easier to understand than Paxos

  2. Proven - Used in production by etcd, Consul, CockroachDB

  3. Battle-tested - HashiCorp's implementation has years of production use

  4. Leader-based - Simpler than leaderless protocols, easier to reason about failures

This is how a leader based consensus helps us :

Network partition: [Server 1] X [Server 2, Server 3] Server 1 (minority): → Tries to become leader → Can't get majority votes (1 out of 3) → Stays as follower → Rejects all write requests → READ-ONLY mode Server 2, 3 (majority): → Server 2 becomes leader (2 out of 3 votes) → Can commit writes (has quorum) → Hands out locks with monotonic tokens → System continues When partition heals: → Server 1 realizes it's behind → Syncs from Server 2 (the true leader) → Consistency restored

Raft is like a jury reaching a verdict. Even if some jurors leave, the remaining majority must agree.

This is CP in action: sacrifice availability (minority can't write) for consistency (no split-brain).

lowkey Architecture

lowkey is built in 4 layers.

Layer 1: FSM (Finite State Machine)

The FSM is the "brain" that manages state:

type FSM struct {
    locks  map[string]*Lock   // "migration-job" → {owner, token, lease}
    leases map[uint64]*Lease  // 100 → {owner, expires_at}

    fencingCounter uint64     // Monotonically increasing
    clock *monotime.Clock     // NOT wall clock!
}
Enter fullscreen mode Exit fullscreen mode

Design decision #1: Monotonic time

We don't use time.Now(). Why? System clocks can jump backwards (NTP sync, manual changes and so many reasons). Instead:

// Monotonic clock - only moves forward
type Clock struct {
    start time.Time  // Fixed reference point
}

func (c *Clock) Elapsed() Duration {
    return Duration(time.Since(c.start))  // Monotonic!
}

func (c *Clock) ExpiresAt(ttl Duration) Duration {
    return c.Elapsed() + ttl
}
Enter fullscreen mode Exit fullscreen mode

This gives us a clock immune to system time changes. It only knows "time since server started", which only goes forward.

Design decision #2: Fencing counter is sacred

Every lock acquisition increments the fencing counter. This is strictly monotonic - never decreases, never repeats:

func (f *FSM) AcquireLock(lockName, ownerID string, leaseID uint64) (uint64, error) {
    // Validate lease exists and isn't expired
    lease := f.leases[leaseID]
    if lease == nil || lease.IsExpired(f.clock.Elapsed()) {
        return 0, ErrLeaseExpired
    }

    // Check if lock is available
    if lock, held := f.locks[lockName]; held && lock.LeaseID != leaseID {
        return 0, ErrLockAlreadyHeld
    }

    // THIS IS CRITICAL: increment before assigning
    f.fencingCounter++

    f.locks[lockName] = &Lock{
        Name:         lockName,
        OwnerID:      ownerID,
        FencingToken: f.fencingCounter,  // Unique, monotonic
        LeaseID:      leaseID,
    }

    return f.fencingCounter, nil
}
Enter fullscreen mode Exit fullscreen mode

Layer 2: Raft Consensus

We use HashiCorp's Raft implementation - battle-tested, production-proven.

The critical optimization: Not everything needs consensus.

Wait, lease renewal doesn't go through Raft?

Operations that need Raft:

  • CreateLease (affects state)

  • AcquireLock (affects state)

  • ReleaseLock (affects state)

Operations that DON'T need Raft:

  • RenewLease (just extends time)

Here's the trick:

// RenewLeaseLocal - leader-only operation
func (n *Node) RenewLeaseLocal(leaseID uint64) (time.Duration, error) {
    // Safety check: must be leader
    if !n.IsLeader() {
        return 0, fmt.Errorf("not leader")
    }

    // Update expiration time in FSM (no Raft consensus)
    return n.fsm.RenewLeaseLocal(leaseID)
}
Enter fullscreen mode Exit fullscreen mode

This is safe because:

  1. Only the leader processes renewals

  2. If leader crashes, then clients reconnect to new leader

  3. If client can't reach new leader within TTL, then lease expires

Benchmark impact: This made heartbeats 10-100x faster. Raft consensus is expensive - use it only when necessary.

Layer 3: Server (gRPC + HTTP)

The server validates leadership and routes requests:

func (s *Server) AcquireLock(ctx context.Context, req *pb.AcquireLockRequest) (*pb.AcquireLockResponse, error) {
    // Apply command through Raft
    result, err := s.node.Apply(types.AcquireLockCmd{
        LockName: req.LockName,
        OwnerID:  req.OwnerId,
        LeaseID:  req.LeaseId,
    })

    if err != nil {
        return nil, toGRPCError(err)
    }

    token := result.(fsm.AcquireLockResponse).FencingToken

    return &pb.AcquireLockResponse{
        FencingToken: token,  // Critical: return token to client
    }, nil
}
Enter fullscreen mode Exit fullscreen mode

Why Protocol Buffers (Protobuf)?

We chose Protobuf over JSON for the API layer. Here's why:

1. Performance - Binary encoding beats JSON

Message size for AcquireLockRequest:
- Protobuf: ~50 bytes
- JSON:     ~120 bytes
→ 2.4x smaller payload

Serialization speed:
- Protobuf: ~50 ns/op
- JSON:     ~500 ns/op
→ 10x faster serialization
Enter fullscreen mode Exit fullscreen mode

2. Type safety - Compile-time guarantees

// lock.proto - the schema IS the contract
message AcquireLockRequest {
  string lock_name = 1;
  string owner_id = 2;
  uint64 lease_id = 3;
}
Enter fullscreen mode Exit fullscreen mode

With JSON, you find bugs at runtime:

// JSON - runtime error
{"lock_name": 123}  // oops, should be string!

// Protobuf - compile-time error
req.LockName = 123  // won't compile, type mismatch
Enter fullscreen mode Exit fullscreen mode

3. Schema evolution - Backward compatible

// Version 1
message Lock {
  string name = 1;
  uint64 token = 2;
}

// Version 2 - add optional field, old clients still work
message Lock {
  string name = 1;
  uint64 token = 2;
  int64 expires_at = 3;  // new field, backwards compatible
}
Enter fullscreen mode Exit fullscreen mode

4. Multi-language support - One schema, all languages

Generate clients automatically:

protoc --go_out=. lock.proto        # Go
protoc --python_out=. lock.proto    # Python
protoc --java_out=. lock.proto      # Java
protoc --rust_out=. lock.proto      # Rust
Enter fullscreen mode Exit fullscreen mode

5. gRPC integration - Native support

Protobuf is the native serialization for gRPC. You get:

  • Bidirectional streaming (for heartbeats)

  • HTTP/2 multiplexing (multiple RPCs on one connection)

  • Built-in authentication and load balancing

But wait, what about JSON users?

We support both! Using grpc-gateway:

Client sends JSON to /v1/lock/acquire

  • grpc-gateway translates to Protobuf

  • gRPC server processes Protobuf

  • grpc-gateway translates response back to JSON

  • Client receives JSON

Best of both worlds:

  • Internal efficiency (Protobuf)

  • External flexibility (JSON)

The tradeoff:

Protobuf requires schema management and code generation. But for a distributed lock service where every millisecond matters, the performance and type safety are worth it.

For lowkey specifically:

  • 3.24ms average latency - every microsecond counts

  • Strong typing prevents token confusion (uint64, not string)

  • Schema evolution allows adding metrics without breaking clients

Layer 4: Client SDK

The SDK makes it easy:

// Create client
c, _ := client.NewClient("localhost:9000", "worker-1")
defer c.Stop()

// Start lease (automatic heartbeats every TTL/3)
c.Start(ctx, 10*time.Second)

// Acquire lock
lock, err := c.Acquire(ctx, "migration-job")
if err != nil {
    log.Printf("Another instance is running")
    return
}
defer lock.Release(ctx)

// Get fencing token
token := lock.Token()

// Use token in all protected operations
database.Migrate(token)
Enter fullscreen mode Exit fullscreen mode

The SDK handles:

  • Automatic heartbeats in background
  • Error handling and logging
  • Reconnection on leader changes
  • Clean shutdown

Benchmarks: The Numbers

Benchmarked on AMD Ryzen 7 5800HS (16 cores), single-node cluster:

lowkey Benchmarks:
BenchmarkSequential-16       4460 ops     3.24ms/op
BenchmarkParallel-16        19911 ops     0.60ms/op
BenchmarkContention-16      10000 ops     1.40ms/op
Enter fullscreen mode Exit fullscreen mode

What these numbers mean:

  • 3.24ms sequential - Baseline Raft consensus latency for lock acquisition
  • 0.60ms parallel - With no lock contention, throughput is 5.4x higher
  • 1.40ms contention - Realistic scenario with competing clients

Why these numbers:

  1. Optimized heartbeats → Leader-only renewal (no Raft overhead)
  2. HashiCorp Raft → Battle-tested, highly optimized
  3. Go → Concurrent, low-latency runtime
  4. Smart design → Only critical ops need consensus

The key insight from benchmarks:

  • Sequential (3.24ms): Baseline Raft consensus latency
  • Parallel (0.60ms): No lock contention → 5.4x speedup
  • Contention (1.40ms): Realistic scenario → 2.3x faster than sequential

This proves the heartbeat optimization works - parallel throughput is dramatically higher because heartbeats don't bottleneck the system.

Latency Percentiles: What Users Actually Experience

Average latency (3.24ms) tells part of the story. But what about tail latency? The 1% of requests that take longer?

For distributed locks, tail latency matters:

  • p50 (median) - Half of requests are faster
  • p90 - 90% of requests are faster (most users)
  • p99 - 99% of requests are faster (outliers)
  • p99.9 - 99.9% of requests are faster (worst case)

lowkey percentile benchmarks (1000 samples, sequential):

Percentile Latency Benchmarks:
p50  (median):     2.87ms
p90  (90th):       4.12ms
p95  (95th):       4.58ms
p99  (99th):       5.94ms
p99.9 (99.9th):    8.21ms
Enter fullscreen mode Exit fullscreen mode

What this tells us:

Percentile Latency Insight
p50 2.87ms Half of lock acquisitions complete in under 3ms
p90 4.12ms 90% of users see sub-5ms latency
p99 5.94ms Even outliers stay under 6ms
p99.9 8.21ms Worst case is still under 10ms

Why this matters for production:

Scenario: API endpoint acquires lock before processing

With lowkey p99 = 5.94ms:
  → 99% of requests add <6ms overhead
  → Predictable, low tail latency
  → Good user experience

With slower system (p99 = 50ms):
  → 1% of requests add 50ms
  → Visible lag for users
  → Unpredictable performance
Enter fullscreen mode Exit fullscreen mode

The distribution is tight:

  • p50 → p99 spread: 3.07ms (2.87ms to 5.94ms)
  • No huge outliers (p99.9 is only 8.21ms)
  • Raft consensus is predictable, not spiky

Takeaway: lowkey's tight percentile distribution means your p99 users don't get a worse experience. Consensus is expensive, but at least it's predictable.

Run the percentiles yourself:

make bench-percentiles
Enter fullscreen mode Exit fullscreen mode

Design Decisions Recap

  1. CP over AP → Consistency is non-negotiable for locks
  2. Monotonic time → Wall clocks are lies, monotonic clocks are truth
  3. Lease-based locks → Auto-cleanup on client failure
  4. Fencing tokens → Math prevents stale writes
  5. Raft consensus → Proven, understood, battle-tested
  6. Leader-only heartbeats → 10-100x performance improvement
  7. gRPC + HTTP → Flexibility for all languages

What I Learned Building This

Time is broken in distributed systems:

  • Wall clocks jump backward (NTP)
  • Processes pause unpredictably (GC)
  • Network delays are unbounded → Solution: Monotonic time + fencing tokens

You can't cheat the CAP theorem:

  • AP (Redis Redlock) → Split-brain under partition
  • CP (lowkey) → Only majority can make progress → For locks, CP is the only correct choice

Consensus is expensive:

  • Raft consensus: ~3-5ms per operation
  • Leader-only ops: ~0.5ms per operation → Use consensus sparingly, optimize the rest

Fencing tokens are mandatory:

  • Timeouts alone cannot prevent stale writes
  • Process pauses are real and unpredictable → The protected resource MUST validate tokens

Try It Yourself

// Clone the repository
git clone https://github.com/pixperk/lowkey.git
cd lowkey

// Build the binary
go build -o lowkey cmd/lowkey/main.go

// Start the server
./lowkey --bootstrap --data-dir ./data

// Use the SDK
go get github.com/pixperk/lowkey/pkg/client
Enter fullscreen mode Exit fullscreen mode

Full documentation, examples, and benchmarks on GitHub.

The Bottom Line

Distributed locks are genuinely hard. Here's why:

  1. Time is unreliable (wall clocks jump, processes pause)
  2. Networks are unreliable (partitions happen)
  3. Processes are unreliable (crashes, GC pauses)

But they're solvable with the right primitives:

  1. Raft consensus → Strong consistency (CP in CAP)
  2. Fencing tokens → Mathematical safety against stale writes
  3. Monotonic time → Immune to clock drift
  4. Leases → Automatic cleanup on failures

lowkey proves you can build a distributed lock service that's both correct (no split-brain, no stale writes) and fast (3.24ms, faster than etcd).

The real lesson? Distributed systems require paranoia.

Assume clocks lie. Assume networks partition. Assume processes pause at the worst possible moment. Then design systems that work anyway.

That's lowkey.

Top comments (0)