Treasure Hunt Engine: Why We Blew Up Our Config Schema at 10k QPS

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

At 5k QPS the connection pool had the default Lettuce size of 10 and the timeout was 5 seconds. The first thing I noticed was that the pool was exhausted under load spikes because the engine spawns a new coroutine for every incoming event. That forced Lettuce to create a new connection instead of reusing, and each new connection triggered a DNS lookup that averaged 200 ms when CoreDNS scaled up. The stacktrace above was the last straw before we throttled internally at 400 ms p99.

The real problem wasnt Redis latency; it was the config schema we inherited from the Veltrix platform team. Every service used the same HOCON file:

redis {
 endpoints = ["redis://${REDIS_HOST}:6379"]
 pool {
 maxTotal = 10
 maxIdle = 5
 minIdle = 1
 maxWaitMillis = 5000
 }
}

The template didnt allow per-environment overrides and substituted ${REDIS_HOST} with the literal string $(REDIS_HOST), causing DNS at runtime. That was the 200 ms lookup.

What We Tried First (And Why It Failed)

My first impulse was to patch LettuceClientConfiguration to read the HOCON at runtime with Typesafe Config and build the endpoints list on the fly.

Config config = ConfigFactory.load();
String redisHost = config.getString("redis.endpoints.0.host");

The Typesafe parser added 8 ms to cold-start and 2 ms to every event under backpressure. We hit the same timeout because the DNS resolution still happened on every connection creation.

Next I tried setting redis.endpoints = ["redis://redis.prod.svc.cluster.local"] in HOCON, thinking the k8s internal DNS would be faster. The DNS name resolved to an external IP because the service wasnt annotated with clusterIP: None and the endpoint controller missed the annotation. The connection went to the external load balancer and incurred 40–60 ms extra latency per lookup.

Then we tuned Lettuce pool to 100 connections by editing the HOCON:

pool {
 maxTotal = 100
 maxIdle = 50
 minIdle = 20
 maxWaitMillis = 2000
}

The pool exhaustion stopped, but the 8 ms config parse and the DNS flakiness remained. The p99 jumped from 400 ms to 480 ms after the pool resize, so we were still losing the latency battle.

The Architecture Decision

We scrapped HOCON for a minimal YAML that embedded the DNS name directly and let the operator inject via downward API:

redis:
 endpoints:
 - redis://${REDIS_SERVICE_HOST}.${REDIS_SERVICE_PORT}
 pool:
 maxTotal: 100
 maxIdle: 50
 minIdle: 20
 maxWaitMillis: 2000

We then deployed a tiny sidecar called config-watch that subscribed to ConfigMap changes and wrote the file to an in-memory filesystem watched by the engine. No HOCON parsing at runtime, no Typesafe overhead.

To eliminate the external DNS bounce, we added a PodDisruptionBudget with minAvailable: 1 and annotated the Redis headless service:

apiVersion: v1
kind: Service
metadata:
 name: redis
 annotations:
 service.alpha.kubernetes.io/tolerate-unready-endpoints: "true"
spec:
 clusterIP: None

The endpoint controller now populated the Pod IPs directly into the DNS A records. The 200 ms DNS lookup dropped to 8 ms average.

We also switched Lettuce to async connect with connection pooling disabled for writes:

LettuceConnectionFactory factory = new LettuceConnectionFactory();
factory.setValidateConnection(true);
factory.setFastFail(true);
factory.setShareNativeConnection(false);

This meant every coroutine got its own connection slot from the pool, avoiding the 5 ms maxWaitMillis backoff when the pool was exhausted.

What The Numbers Said After

After the YAML swap and headless service change, p99 latency fell from 480 ms to 180 ms under 10k QPS. Redis connection pool usage stayed flat at 85–90 percent, and the ConnectTimeoutException stacktrace vanished from logs.

We measured the config-watch sidecar memory at 12 MB RSS and added 3 ms to cold-start, which was acceptable because restarts were rare. The sidecar itself was written in Go and compiled to a distroless image, so the attack surface stayed minimal.

On the metrics side, Prometheus showed Lettuce connection pool size stabilizing:

lettuce_connection_pool_total{state="active"} 100
lettuce_connection_pool_total{state="idle"} 50

Redis hit rate stayed above 97 percent with keyspace_hits / (keyspace_hits + keyspace_misses).

What I Would Do Differently

I would never let the platform team own the base HOCON template again. The template encouraged copy-paste and discouraged overrides. Instead, we should have enforced a JSON schema for every config file and generated the YAML from that schema in CI. The schema would have caught the ${REDIS_HOST} typo at build time.

Second, I would have insisted on a dedicated headless Redis cluster per environment instead of sharing one. The shared cluster was a cost-cutting measure that added 15 ms of extra hop when namespacing keys, and the headless service annotation was forgotten until prod broke. Dedicated clusters would have cost an extra 300 USD/month but saved us two outages.

Finally, I would not have