The Game That Broke Its Own Leaderboard

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Veltrix.conf is a 48-hour multi-city hybrid event with a digital treasure hunt running on a single React SPA and a lightweight Go API. The hunt assigns points for QR codes scanned in the real world; the backend records every scan as a mutable score and recalculates a leaderboard every 30 seconds. We originally gave each player a monotonically increasing integer ID from a PostgreSQL sequence. The sequence was cached in the Go app with a 1-second refresh so we didnt hammer the database on every request.

What we did not anticipate was that our real-time leaderboard job, written in Python, sometimes restarted mid-window due to Kubernetes preemptible nodes. When it restarted, it re-seeded its in-memory cache from the same sequence. Within 300 milliseconds, two pods issued the same nextval(), both believing they had reserved ID 42. Back-to-back inserts collided, the unique constraint fired, and the entire hunt paused until the next leaderboard job cycle.

Worse, the sequence gap showed up on the public leaderboard page, because we used the ID for tie-breaking. Players saw identical scores with the same ID and assumed cheating.

What We Tried First (And Why It Failed)

Our first fix was to move the sequence behind a dedicated ID-service written in Rust. Every request became a network hop to get an ID. Latency rose from 0.4 ms to 2.4 ms under load; at 200 RPS that added 480 ms/s of network time and spiked p95 latency from 8 ms to 42 ms. The Rust service also introduced three new failure modes: TCP backoff, connection leaks under backpressure, and a 15-second rolling restart when jemalloc triggered. The game slowed down more than it sped up.

We then tried client-side UUID v4 generation. The Go API would accept any string up to 36 characters as a player identifier, trusting the client. Within two hours, a player discovered they could impersonate another by reusing a UUID generated on a different device. The anticheat team flagged 47 duplicate identities before we shut it down.

Finally, we rewrote the leaderboard job to use a single PostgreSQL window function that computed ranks in memory without any mutable state outside the database. It worked—until we hit 8,000 players and the window function ran for 1.8 seconds, blocking the entire event leaderboard table and timing out the API. We had traded identity collisions for blocking queries.

The Architecture Decision

We settled on PostgreSQL identity columns with local caching, but changed two critical details:

We abandoned the shared sequence. Instead, each Kubernetes pod seeded its own local counter starting at 100,000 plus the pod ordinal. The counter lived for the life of the pod; once a pod restarted, its base jumped by 100,000. This gave each pod a private 100,000-ID runway and eliminated duplicate IDs across restarts.
We split the leaderboard into two tables: player_scores_raw for raw events and player_scores_leaderboard for the public view. The raw table used the pod-local counter as the primary key. The leaderboard materialized view recomputed ranks every 30 seconds using a window function that only touched the latest score per player. We tuned the materialized view with:

CREATE MATERIALIZED VIEW player_scores_leaderboard AS
SELECT
 player_id,
 SUM(points) as total_points,
 RANK() OVER (ORDER BY SUM(points) DESC) as rank
FROM player_scores_raw
GROUP BY player_id;

We added a unique partial index on player_scores_raw(player_id) WHERE event_time > NOW() - INTERVAL '24 hours' to keep the view refresh fast.

We paid the cost: the raw table grew at 1.2 million rows per hour during peak hours, and autovacuum struggled to keep up. We set maintenance_work_mem to 2 GB and ran vacuum full nightly during off-hours. The leaderboard materialized view refreshed in 1.2 seconds under load and never blocked writes.

What The Numbers Said After

After the change, error rate on leaderboard writes dropped from 0.12% to 0.0008%. P95 latency for a scan request stayed at 12 ms; p99 rose to 35 ms but stayed under our 50 ms SLO. The Rust ID-service was decommissioned; network hops shrank by 3,000 per hour.

Storage grew 18% faster, but we gained a 7x improvement in write amplification because we stopped replaying the same events through multiple layers. The PostgreSQL autovacuum graph showed 12% lower CPU usage after we tuned maintenance_work_mem and added a nightly vacuum full job.

Player complaints about duplicate IDs vanished. The anticheat team closed 47 fraud cases in the first week, all non-reproducible after the UUID experiment stopped.

What I Would Do Differently

I would have chosen a synthetic shard key at the beginning instead of trying to retrofit identity isolation. A simple tenant_id column with values 0 and 1, seeded from the event region, would have given us 2 private counters without any magic numbers. It would have cost one extra byte per row and saved us three days of firefighting.

I also would not have trusted the client with any identifier that could be replayed. Even with JWTs and short expiry, one misconfigured load balancer can let stale tokens circulate. From now on, the API owns the identity envelope; clients can suggest a display name, but the backend issues a UUID v7 with a 10-minute expiry and stores it in a short-lived Redis cache. The cache is ephemeral; if it disappears, the client re-authenticates and gets a new token.

Finally, I would have moved to TimescaleDB hypertables for the raw event stream from day one. The compression would have cut