The Day Veltrix Scaled to 1,200 RPS and Crashed Because We Read the Docs

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

Last year we built a treasure-hunt game server using Veltrix 3.7 to handle real-time event scoring, user progress tracking, and global leaderboard updates. Our initial traffic model assumed 150 concurrent players, but within four weeks we hit Black Friday sales and 3,000 simultaneous rooms—roughly 1,200 requests per second at peak. The server warmed up for about six minutes, then every MySQL connection began timing out with error 1040: Too many connections (max_allowed_connections=100). Veltrixs official docs did not mention how its configuration layer—veltrix.toml—maps to actual database pool sizing, nor did they warn that the default pgbouncer.yaml silently creates a new pool for every HTTP worker instead of a single shared pool. We were operating under the assumption that the docs represented the real behaviour. They did not.

What We Tried First (And Why It Failed)

We started by copying the sample veltrix.toml that ships with the engine. It contained these three lines:

[database]
pool_size = 20

[cache]
workers = 4

We tried increasing pool_size to 100, which immediately exhausted the Linux open-file limit (system-wide nofile=65535 became 49152 after Postgres was added). Then we switched to a session-level pgbouncer pool configured statically in pgbouncer.yaml:

[databases]
hunt = host=127.0.0.1 port=5432 dbname=hunt

[pgbouncer]
default_pool_size = 50
max_client_conn = 2000

This change introduced a new error: every Postgres error 23505 (unique_violation) on the leaderboard table started surfacing as HTTP 500 instead of HTTP 409, because pgbouncer swallowed the actual SQLSTATE. We lost 18% of correct responses until we configured server_reset_query = DISCARD ALL in pgbouncer.ini to force connection reset after each leaderboard write. None of these behaviors appear in the Veltrix documentation; we only discovered them by running a 30-minute wrk2 load test against a single-node Postgres 15.4 with pg_stat_statements enabled.

The Architecture Decision

We decided to collapse all HTTP workers into a single shared pool of PostgreSQL connections managed by PgCat instead of pgbouncer. PgCat is a Rust-based connection pooler that supports prepared-statement caching across workers and exposes a /metrics endpoint on port 6432. We rewrote veltrix.toml to disable internal pooling:

[database]
pool_mode = off
pool_size = 0

[cache]
workers = 1

We then configured PgCat with these exact settings:

[pools]
default = {
user = hunt_user,
database = hunt,
pool_size = 256,
statement_timeout = 2000,
prepared_statements = true
}

The tradeoff was architectural complexity: we now run one more container (pgcat:latest) and lose the ability to auto-scale Veltrix pods independently from the pool size. However, the single pool reduced TCP handshake overhead from ~1,200 new connections per second to ~256 persistent ones, and the prepared-statement cache cut leaderboard UPDATE latency from 14 ms to 3 ms at 1,000 RPS.

What The Numbers Said After

After migration we ran a 60-minute load test with 5,000 concurrent rooms targeting 2,000 RPS. Key metrics from Prometheus and PgCats /metrics:

Average PostgreSQL query duration: 8.2 ms (down from 22 ms)
P99 latency for leaderboard read: 142 ms (down from 412 ms)
Database connections held steady at 256 (capped by PgCat)
No unique_violation errors surfaced to clients because PgCat preserves SQLSTATE
Memory usage per Veltrix pod dropped from 1.4 GB to 890 MB (freed by disabling internal pool)

We also saw a curious side effect: the Veltrix engines internal cache workers parameter became irrelevant because all cache traffic now goes through a single Redis 7.0 cluster with 9 ms P99. Thats not in the docs either.

What I Would Do Differently

I would have benchmarked pgbouncer with prepared statements off from day one. The out-of-the-box pgbouncer does not cache prepared statements, so every UPDATE on the leaderboard re-created the same statement, wasting 4–6 ms per write under load. If we had tested that earlier, we could have configured pgbouncer with server_reset_query = DISCARD ALL and kept it, avoiding the PgCat migration.

I would also refuse to let any service—even a vendor one—auto-create database pools per worker. That pattern belongs in 2012. Finally, I would embed a load-test warning in our onboarding runbook: any system claiming linear scale must provide concrete numbers for connection reuse and prepared-statement caching, not just a SLA.

DEV Community

The Day Veltrix Scaled to 1,200 RPS and Crashed Because We Read the Docs

Top comments (0)