How We Stopped Double Booking in Go with Redis Distributed Locks

#go #redis #backend #microservices

If you are selling the same seat to two different people, you do not get to call it “a small race condition.”

At that point, your system starts losing its shape the moment real traffic shows up.

On the Chimera-State side, this is exactly what we ran into while building GigaScale. Once our distributed reservation engine built on Go, gRPC, and Redis started taking in thousands of requests per second, the problem stopped being just about performance. The real problem was making sure that, when multiple hands reached for the same seat at the same time, only one of them got through.

When the old reflex breaks

In a microservices world, the classic RDBMS reflex is always the same: “let’s lock the row inside a transaction.” On paper, that sounds reasonable. In production, with requests pouring in over gRPC, going back to the database for every reservation attempt just to chase a row-level lock does not protect the system. It chokes it.

As traffic ramps up, the database stops being just a persistence layer and turns into a narrow passage everyone is trying to force themselves through. Lock waits get longer, transaction times swell, contention rises, and then you are left staring at the dashboard asking why throughput fell off a cliff.

What we actually had to solve was not “how do we lock the same row harder?” It was guaranteeing that only one request could own the same seat at a given moment.

The cluster side

We did not hand this system over to a single Redis instance and hope for the best. We ran a 6-node Redis Cluster in the background. That meant the locking mechanism was not sitting on one machine waiting to become a single point of pain. It was running on a real distributed layer, sharded and managed as a cluster.

That detail matters, because in the field, “we use Redis locks” sometimes really means “we wired a fragile shortcut to one Redis box.” On our side, there was a cluster behind it, so the solution was tested under actual traffic and actual distributed behavior.

var RedisClient redis.UniversalClient // Redis connection

func InitRedisCluster() {
    clusterAddrs := []string{
        "redis-node-1:6379", "redis-node-2:6379", "redis-node-3:6379",
        "redis-node-4:6379", "redis-node-5:6379", "redis-node-6:6379",
    }

    RedisClient = redis.NewClusterClient(&redis.ClusterOptions{
        Addrs:        clusterAddrs,
        MaxRedirects: 8,
        ReadOnly:     false,
    })

    // log.Println("Redis Cluster created successfully")
}

Acquire is easy, release is where things go bad

The lock acquisition side was simple, but brutally effective. Before reserving the seat, we wrote a UUID token into the Redis key for that seat. The modern pattern here is SET key value NX EX ttl; NX makes Redis reject the write if the key already exists, and EX makes sure the lock does not stay around forever if something goes sideways. Redis also points to the same idea in its distributed locks guidance: the lock key should carry a unique random value.

The nice part is this: when two requests crash into the same seat at the same time, only one gets the lock because the write is atomic. The other one gets dropped immediately. No gray area. You either got the lock or you did not.

But the real mistake is not on the acquire side. It is on release. A lot of examples in this space talk about distributed locking and then casually unlock with a plain DEL. That is unnecessary risk in production.

Say your process gets the lock and starts doing work. Then the TTL expires. Another request comes in and acquires the same key with its own token. If you now fire a delayed DEL, you are not deleting your own lock anymore. You are deleting someone else’s. And the race condition is back on the table.

That is why we did not use a naked DEL on release. We used token-checked Lua. Redis Lua scripting runs atomically on the server side, which means the check and the delete happen as one uninterrupted operation. Redis also documents EVAL for this exact kind of server-side flow.

So the logic is simple: if the value stored in the key matches my UUID, delete it; otherwise, leave it alone. Since the check and delete happen in one flow, the “I accidentally deleted someone else’s lock after ownership changed” risk disappears.

package redislock

import (
    "context"
    "time"

    "github.com/google/uuid"
    "github.com/redis/go-redis/v9"
)

type Locker struct {
    client redis.UniversalClient
}

func NewLocker(client redis.UniversalClient) *Locker {
    return &Locker{client: client}
}

func (l *Locker) Acquire(ctx context.Context, key string, ttl time.Duration) (string, bool, error) {
    token := uuid.New().String()

    acquired, err := l.client.SetNX(ctx, key, token, ttl).Result()
    if err != nil {
        return "", false, err
    }

    return token, acquired, nil
}

const releaseScript = `
if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("del", KEYS[1])
else
    return 0
end
`

func (l *Locker) Release(ctx context.Context, key, token string) error {
    err := l.client.Eval(ctx, releaseScript, []string{key}, token).Err()
    if err != nil && err != redis.Nil {
        return err
    }
    return nil
}

What happened under test

At some point, theory stops mattering. We left only 3 seats available in the system, then hammered the reservation endpoint with k6 using 200 VUs(virtual user) for 10 straight seconds.

That test fired 5,718 HTTP requests. In the background, k6 ran 22,872 checks. We got exactly 3 200 OK responses, because there were only 3 seats that could actually be sold.

The other 5,606 requests slammed into the lock wall and got 409 Conflict. Another 106 hit the rate limit and got 429. The part I liked most was this: zero 500 Internal Server Error. The service did not panic, and it did not throw away consistency under pressure.

On top of that, average response time stayed around 144 ms. In other words, this mechanism did not just prevent bad reservations. It did the job under load without pushing the system into a bottleneck.What came out of all this is not some romantic lesson. It is pretty blunt. Trying to solve concurrency problems in the application layer by doing backflips, stacking mutexes on top of mutexes, and pushing even more load onto the database is usually just fake heroism.

If you are going to build a distributed lock, the rule is simple: use token-based SET … NX EX when acquiring, and use token-validated Lua when releasing.

The real point here is not acquiring the lock. Anyone can write that part. The real test is releasing it safely. The ones who stay alive in production are the ones disciplined enough not to delete someone else’s lock on release.