The Cache That Bled — How We Turned Veltrix Event Config From Silent Killer to Silent Savior

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

We assumed Postgres could keep up with event ordering. We were wrong. The real performance killer wasnt the database—it was our event configuration model.

Veltrix tracked three things per event: user_id, event_type, and timestamp. We used a GIN index on (user_id, event_type) and relied on Postgres default synchronous commit. At 300k events/sec, the index became a hotspot. WAL writes saturated the disk. P99 latency jumped to 1.2s during spikes. Engineers blamed Kafka consumer lag, but the root cause was a 30-line configuration file we never reviewed.

The silent killer wasnt the traffic—it was the default values. We inherited them from an old prototype.

What We Tried First (And Why It Failed)

First fix: shard the table by user_id. Split 20M rows into 1024 shards using hash(user_id) % 1024. The sharding code ran in a separate service, so we could hot-swap strategies. We measured:

Baseline: 300k events/sec, 1.2s p99
After sharding: 450k events/sec, 800ms p99

Thats better, but still unacceptable. The sharding code introduced a race condition: two in-flight events with the same user_id could land on different shards, breaking event ordering guarantees wed promised customers.

Next, we tried batching in the producer. Set batch.size=16384, linger.ms=200. Reduced Kafka broker load, but increased producer memory usage by 400MB per instance. At 1.2M events/sec, that meant 1.2GB extra RAM per pod—no way we could afford that in production.

We also tried async replication on Postgres. Set synchronous_commit=off. Reduced index contention, but now event ordering became probabilistic. Customers reported duplicate events in their dashboards. We rolled it back within hours.

The pattern was clear: every fix solved one problem while creating another. We needed a configuration model that matched our ordering guarantees, not one we inherited from a prototype.

The Architecture Decision

We migrated to a two-tier cache model: local LRU + shared Redis Cluster with Lua scripts for ordering.

Local LRU ran in-process with each event producer. Default size: 10,000 events. We used the go-redis library with redis-ring for consistent hashing. The Lua script ensured events were appended in order:

local last = redis.call('GET', KEYS[1])
if not last or tonumber(ARGV[1]) > last then
 redis.call('SET', KEYS[1], ARGV[1])
 redis.call('LPUSH', KEYS[2], ARGV[2])
end

The script cost 120 microseconds per event—cheap, but not free. We measured:

P99 latency before: 780ms
P99 latency after: 42ms

The tradeoff: memory. Each producer instance now used 60MB for the LRU and 10MB for the Redis connection pool. Thats 70MB per pod, but we gained ordering guarantees and reduced Postgres load by 85%.

We also changed the configuration values:

event.ordering.strategy=lua_script
postgres.sync=remote_apply
max_in_flight_per_user=100
redis.cluster.nodes=redis-cluster-0:6379,redis-cluster-1:6379,...

We documented the migration in a runbook named config-immigration.md. It took three engineers two days to roll out. The first deployment failed because a typo in the Redis cluster address caused a 15-minute outage. We fixed it by pinning the cluster address in the config file.

What The Numbers Said After

After six weeks in production:

1.2M events/sec sustained
P50: 8ms, P95: 22ms, P99: 42ms
Postgres write load: 18k tps (down from 140k)
Kafka consumer lag: 1.2s peak (down from 3.4s)
Memory per pod: 320MB (up from 200MB, acceptable for our cluster)

The most surprising metric: customer-reported duplicate events dropped to zero. Our Lua script eliminated race conditions that sharding couldnt fix.

The configuration file is now 84 lines, documented, and reviewed in every PR. No defaults. Every value is intentional.

What I Would Do Differently

I would have architected the ordering guarantee before writing the first line of code.

We started with a prototype that assumed eventual consistency was fine. It wasnt. Our customers needed strict event ordering to power their analytics dashboards. We should have written the ordering contract first, then built the system around it.

Second, I would have replaced Postgres with ScyllaDB from day one. Scyllas shard-per-core model and async writes would have handled our event volume without the cache bleed. But we inherited the stack, so we made it work.

Finally, I would never trust a default configuration file again. Every value must be intentional, documented, and reviewed. We learned that the hard way when a 30-line file nearly killed the system.