Yes, sometimes the payments processing cron job can run twice. In the same batch and for the same customer. Unfortunately, the customer would be charged twice.
How did this happen? Two instances of the same server ran the cron job twice. “But we did have locks, right?” You did. It was probably Redis Redlock. It worked perfectly in development. But production isn't kind to assumptions. Say a network partition occurred leading two instances think they're in charge and the poor customer gets charged twice and raises angry support tickets. Poor you are awake at 3 AM.
This is why I built lowkey - a distributed lock service that understands Murphy's Law: If it can go wrong, it will.
What is lowkey?
lowkey makes three promises:
Only ONE - Strong consistency via Raft consensus
Fencing tokens - Simple math prevents stale writes
Fast - 3.24ms latency (faster than etcd's 5-10ms) [p50 benchmark]
Let's talk about why distributed locks are genuinely hard, and how to stop charging the same customer twice.
The Fundamental Problem of Time
You cannot really trust time in distributed systems.
Your process thinks it's been 1 second. But in reality it's been 30 seconds or anything longer than 1 second. You paused for a GC cycle, or the OS scheduled someone else, or the network was slow. Doesn't matter. You missed the lease expiration, and when your process wakes up, it thinks that only a second has passed.
This is called a process pause. It is not theoretical and almost causes production chaos daily.
So how do we solve it?
The CAP Theorem Choice
CAP theorem: Consistency, Availability, Partition Tolerance. We have to pick one out of C and A.
CAP theorem states that in a distributed system, when a network partition happens, you can only guarantee one of the following two properties out of C and A:
Consistency (C)
Every node sees the same data at the same time. Once a write succeeds, all future reads see it.Availability (A)
Every request receives a response, even if that response might not be the latest value.Partition tolerance (P)
The system continues operating even when the network splits and messages are dropped or delayed.
For distributed locks, this isn't a choice. It's a requirement: You MUST pick CP (Consistency + Partition Tolerance).
Here's why Availability is the wrong choice:
In an AP system like Redis Redlock, if a network partition occurs :
node A can’t see node B
node B can’t see node A
Both nodes are still alive. Both think they’re healthy. Both say:
“I am available”
This is already the critical moment. This is called Split-Brain situation.
Availability is being preserved by lying about the global state.
Availability means "always respond, even if wrong". For locks, wrong equals disaster.
Whereas, lowkey follows a CP system.
In a CP system, nodes don’t pretend they’re alone in the universe. They are explicitly aware that:
other nodes exist
agreement is required
-
proceeding without quorum is dangerous
So when a network partition happens, the system splits into two very unequal worlds.
The Three Pillars: Leases, Fencing Tokens, and Raft
Leases
A lease is like a parking meter. You pay for 2 hours. After 2 hours, your time expires. You lose the spot. Automatic.
Lease is the main entity in lowkey. Every owner(node) has to start a lease to acquire locks. Every lock is associated with a lease.
// Client creates a lease
lease := client.CreateLease(TTL: 10 * time.Second)
// lease_id = 100
// Client uses this lease for all locks
lock := client.Acquire("job-name", lease_id: 100)
// Client must send keepalive every ~3 seconds (TTL/3)
// If keepalive stops -> lease expires -> locks released
Without leases, a crashed client holds locks forever. System gets stuck. With leases, system recovers automatically.
The flow looks like this.
Time: 0s → Client creates lease (expires at 10s) Time: 3s → Client: keepalive → expires at 13s Time: 6s → Client: keepalive → expires at 16s Time: 9s → Client: keepalive → expires at 19s --- CLIENT CRASHES --- Time: 12s → (no keepalive) Time: 15s → (no keepalive) Time: 19s → Lease expires → Server auto-releases all locks → Other clients can proceed
Fencing Tokens
Imagine a deli counter with number tickets. Person #43 cannot be served after person #44. The number proves freshness.
Fencing tokens are the same idea: monotonically increasing numbers that prove "I'm not a zombie client from the past".
Why are fencing tokens important?
This is how it happens in lowkey :
The token must be validated by the protected resource (database, API, file, etc.). The lock service provides the token, but the resource enforces it.
In lowkey:
// Client gets token when acquiring lock
lock := client.Acquire("migration-job")
token := lock.Token() // e.g., 42
// Client includes token in EVERY protected operation
database.Execute(query, token: token)
// Database validates
func (db *Database) Execute(query string, token uint64) error {
if token < db.lastSeenToken {
return errors.New("stale token - rejected")
}
// Execute query
db.lastSeenToken = token
return nil
}
Raft Consensus
Leases and Fencing tokens solve client-side failures. What about server-side failures? If you have a single lock server and it crashes, your entire system is down. No locks translates to no progress.
The solution: Multiple servers with consensus
But here's the catch: multiple servers create a new problem, who's in charge?
Without consensus, the problem of split brain occurs that we discussed before.
This is why we need consensus - a protocol that ensures all servers agree on:
Who is the leader (only one can hand out locks)
What the current state is (which locks are held)
What operations have been committed (which token numbers are valid)
Why Raft specifically?
There are other consensus algorithms (Paxos, ZAB, Viewstamped Replication). We chose Raft because:
Understandable - Designed explicitly to be easier to understand than Paxos
Proven - Used in production by etcd, Consul, CockroachDB
Battle-tested - HashiCorp's implementation has years of production use
Leader-based - Simpler than leaderless protocols, easier to reason about failures
This is how a leader based consensus helps us :
Network partition: [Server 1] X [Server 2, Server 3] Server 1 (minority): → Tries to become leader → Can't get majority votes (1 out of 3) → Stays as follower → Rejects all write requests → READ-ONLY mode Server 2, 3 (majority): → Server 2 becomes leader (2 out of 3 votes) → Can commit writes (has quorum) → Hands out locks with monotonic tokens → System continues When partition heals: → Server 1 realizes it's behind → Syncs from Server 2 (the true leader) → Consistency restored
Raft is like a jury reaching a verdict. Even if some jurors leave, the remaining majority must agree.
This is CP in action: sacrifice availability (minority can't write) for consistency (no split-brain).
lowkey Architecture
lowkey is built in 4 layers.
Layer 1: FSM (Finite State Machine)
The FSM is the "brain" that manages state:
type FSM struct {
locks map[string]*Lock // "migration-job" → {owner, token, lease}
leases map[uint64]*Lease // 100 → {owner, expires_at}
fencingCounter uint64 // Monotonically increasing
clock *monotime.Clock // NOT wall clock!
}
Design decision #1: Monotonic time
We don't use time.Now(). Why? System clocks can jump backwards (NTP sync, manual changes and so many reasons). Instead:
// Monotonic clock - only moves forward
type Clock struct {
start time.Time // Fixed reference point
}
func (c *Clock) Elapsed() Duration {
return Duration(time.Since(c.start)) // Monotonic!
}
func (c *Clock) ExpiresAt(ttl Duration) Duration {
return c.Elapsed() + ttl
}
This gives us a clock immune to system time changes. It only knows "time since server started", which only goes forward.
Design decision #2: Fencing counter is sacred
Every lock acquisition increments the fencing counter. This is strictly monotonic - never decreases, never repeats:
func (f *FSM) AcquireLock(lockName, ownerID string, leaseID uint64) (uint64, error) {
// Validate lease exists and isn't expired
lease := f.leases[leaseID]
if lease == nil || lease.IsExpired(f.clock.Elapsed()) {
return 0, ErrLeaseExpired
}
// Check if lock is available
if lock, held := f.locks[lockName]; held && lock.LeaseID != leaseID {
return 0, ErrLockAlreadyHeld
}
// THIS IS CRITICAL: increment before assigning
f.fencingCounter++
f.locks[lockName] = &Lock{
Name: lockName,
OwnerID: ownerID,
FencingToken: f.fencingCounter, // Unique, monotonic
LeaseID: leaseID,
}
return f.fencingCounter, nil
}
Layer 2: Raft Consensus
We use HashiCorp's Raft implementation - battle-tested, production-proven.
The critical optimization: Not everything needs consensus.
Wait, lease renewal doesn't go through Raft?
Operations that need Raft:
CreateLease (affects state)
AcquireLock (affects state)
ReleaseLock (affects state)
Operations that DON'T need Raft:
- RenewLease (just extends time)
Here's the trick:
// RenewLeaseLocal - leader-only operation
func (n *Node) RenewLeaseLocal(leaseID uint64) (time.Duration, error) {
// Safety check: must be leader
if !n.IsLeader() {
return 0, fmt.Errorf("not leader")
}
// Update expiration time in FSM (no Raft consensus)
return n.fsm.RenewLeaseLocal(leaseID)
}
This is safe because:
Only the leader processes renewals
If leader crashes, then clients reconnect to new leader
If client can't reach new leader within TTL, then lease expires
Benchmark impact: This made heartbeats 10-100x faster. Raft consensus is expensive - use it only when necessary.
Layer 3: Server (gRPC + HTTP)
The server validates leadership and routes requests:
func (s *Server) AcquireLock(ctx context.Context, req *pb.AcquireLockRequest) (*pb.AcquireLockResponse, error) {
// Apply command through Raft
result, err := s.node.Apply(types.AcquireLockCmd{
LockName: req.LockName,
OwnerID: req.OwnerId,
LeaseID: req.LeaseId,
})
if err != nil {
return nil, toGRPCError(err)
}
token := result.(fsm.AcquireLockResponse).FencingToken
return &pb.AcquireLockResponse{
FencingToken: token, // Critical: return token to client
}, nil
}
Why Protocol Buffers (Protobuf)?
We chose Protobuf over JSON for the API layer. Here's why:
1. Performance - Binary encoding beats JSON
Message size for AcquireLockRequest:
- Protobuf: ~50 bytes
- JSON: ~120 bytes
→ 2.4x smaller payload
Serialization speed:
- Protobuf: ~50 ns/op
- JSON: ~500 ns/op
→ 10x faster serialization
2. Type safety - Compile-time guarantees
// lock.proto - the schema IS the contract
message AcquireLockRequest {
string lock_name = 1;
string owner_id = 2;
uint64 lease_id = 3;
}
With JSON, you find bugs at runtime:
// JSON - runtime error
{"lock_name": 123} // oops, should be string!
// Protobuf - compile-time error
req.LockName = 123 // won't compile, type mismatch
3. Schema evolution - Backward compatible
// Version 1
message Lock {
string name = 1;
uint64 token = 2;
}
// Version 2 - add optional field, old clients still work
message Lock {
string name = 1;
uint64 token = 2;
int64 expires_at = 3; // new field, backwards compatible
}
4. Multi-language support - One schema, all languages
Generate clients automatically:
protoc --go_out=. lock.proto # Go
protoc --python_out=. lock.proto # Python
protoc --java_out=. lock.proto # Java
protoc --rust_out=. lock.proto # Rust
5. gRPC integration - Native support
Protobuf is the native serialization for gRPC. You get:
Bidirectional streaming (for heartbeats)
HTTP/2 multiplexing (multiple RPCs on one connection)
Built-in authentication and load balancing
But wait, what about JSON users?
We support both! Using grpc-gateway:
Client sends JSON to /v1/lock/acquire
grpc-gateway translates to Protobuf
gRPC server processes Protobuf
grpc-gateway translates response back to JSON
Client receives JSON
Best of both worlds:
Internal efficiency (Protobuf)
External flexibility (JSON)
The tradeoff:
Protobuf requires schema management and code generation. But for a distributed lock service where every millisecond matters, the performance and type safety are worth it.
For lowkey specifically:
3.24ms average latency - every microsecond counts
Strong typing prevents token confusion (uint64, not string)
Schema evolution allows adding metrics without breaking clients
Layer 4: Client SDK
The SDK makes it easy:
// Create client
c, _ := client.NewClient("localhost:9000", "worker-1")
defer c.Stop()
// Start lease (automatic heartbeats every TTL/3)
c.Start(ctx, 10*time.Second)
// Acquire lock
lock, err := c.Acquire(ctx, "migration-job")
if err != nil {
log.Printf("Another instance is running")
return
}
defer lock.Release(ctx)
// Get fencing token
token := lock.Token()
// Use token in all protected operations
database.Migrate(token)
The SDK handles:
- Automatic heartbeats in background
- Error handling and logging
- Reconnection on leader changes
- Clean shutdown
Benchmarks: The Numbers
Benchmarked on AMD Ryzen 7 5800HS (16 cores), single-node cluster:
lowkey Benchmarks:
BenchmarkSequential-16 4460 ops 3.24ms/op
BenchmarkParallel-16 19911 ops 0.60ms/op
BenchmarkContention-16 10000 ops 1.40ms/op
What these numbers mean:
- 3.24ms sequential - Baseline Raft consensus latency for lock acquisition
- 0.60ms parallel - With no lock contention, throughput is 5.4x higher
- 1.40ms contention - Realistic scenario with competing clients
Why these numbers:
- Optimized heartbeats → Leader-only renewal (no Raft overhead)
- HashiCorp Raft → Battle-tested, highly optimized
- Go → Concurrent, low-latency runtime
- Smart design → Only critical ops need consensus
The key insight from benchmarks:
- Sequential (3.24ms): Baseline Raft consensus latency
- Parallel (0.60ms): No lock contention → 5.4x speedup
- Contention (1.40ms): Realistic scenario → 2.3x faster than sequential
This proves the heartbeat optimization works - parallel throughput is dramatically higher because heartbeats don't bottleneck the system.
Latency Percentiles: What Users Actually Experience
Average latency (3.24ms) tells part of the story. But what about tail latency? The 1% of requests that take longer?
For distributed locks, tail latency matters:
- p50 (median) - Half of requests are faster
- p90 - 90% of requests are faster (most users)
- p99 - 99% of requests are faster (outliers)
- p99.9 - 99.9% of requests are faster (worst case)
lowkey percentile benchmarks (1000 samples, sequential):
Percentile Latency Benchmarks:
p50 (median): 2.87ms
p90 (90th): 4.12ms
p95 (95th): 4.58ms
p99 (99th): 5.94ms
p99.9 (99.9th): 8.21ms
What this tells us:
| Percentile | Latency | Insight |
|---|---|---|
| p50 | 2.87ms | Half of lock acquisitions complete in under 3ms |
| p90 | 4.12ms | 90% of users see sub-5ms latency |
| p99 | 5.94ms | Even outliers stay under 6ms |
| p99.9 | 8.21ms | Worst case is still under 10ms |
Why this matters for production:
Scenario: API endpoint acquires lock before processing
With lowkey p99 = 5.94ms:
→ 99% of requests add <6ms overhead
→ Predictable, low tail latency
→ Good user experience
With slower system (p99 = 50ms):
→ 1% of requests add 50ms
→ Visible lag for users
→ Unpredictable performance
The distribution is tight:
- p50 → p99 spread: 3.07ms (2.87ms to 5.94ms)
- No huge outliers (p99.9 is only 8.21ms)
- Raft consensus is predictable, not spiky
Takeaway: lowkey's tight percentile distribution means your p99 users don't get a worse experience. Consensus is expensive, but at least it's predictable.
Run the percentiles yourself:
make bench-percentiles
Design Decisions Recap
- CP over AP → Consistency is non-negotiable for locks
- Monotonic time → Wall clocks are lies, monotonic clocks are truth
- Lease-based locks → Auto-cleanup on client failure
- Fencing tokens → Math prevents stale writes
- Raft consensus → Proven, understood, battle-tested
- Leader-only heartbeats → 10-100x performance improvement
- gRPC + HTTP → Flexibility for all languages
What I Learned Building This
Time is broken in distributed systems:
- Wall clocks jump backward (NTP)
- Processes pause unpredictably (GC)
- Network delays are unbounded → Solution: Monotonic time + fencing tokens
You can't cheat the CAP theorem:
- AP (Redis Redlock) → Split-brain under partition
- CP (lowkey) → Only majority can make progress → For locks, CP is the only correct choice
Consensus is expensive:
- Raft consensus: ~3-5ms per operation
- Leader-only ops: ~0.5ms per operation → Use consensus sparingly, optimize the rest
Fencing tokens are mandatory:
- Timeouts alone cannot prevent stale writes
- Process pauses are real and unpredictable → The protected resource MUST validate tokens
Try It Yourself
// Clone the repository
git clone https://github.com/pixperk/lowkey.git
cd lowkey
// Build the binary
go build -o lowkey cmd/lowkey/main.go
// Start the server
./lowkey --bootstrap --data-dir ./data
// Use the SDK
go get github.com/pixperk/lowkey/pkg/client
Full documentation, examples, and benchmarks on GitHub.
The Bottom Line
Distributed locks are genuinely hard. Here's why:
- Time is unreliable (wall clocks jump, processes pause)
- Networks are unreliable (partitions happen)
- Processes are unreliable (crashes, GC pauses)
But they're solvable with the right primitives:
- Raft consensus → Strong consistency (CP in CAP)
- Fencing tokens → Mathematical safety against stale writes
- Monotonic time → Immune to clock drift
- Leases → Automatic cleanup on failures
lowkey proves you can build a distributed lock service that's both correct (no split-brain, no stale writes) and fast (3.24ms, faster than etcd).
The real lesson? Distributed systems require paranoia.
Assume clocks lie. Assume networks partition. Assume processes pause at the worst possible moment. Then design systems that work anyway.
That's lowkey.









Top comments (0)