The Query Engine That Taught Me Why We Should Never Trust a Demo

#devops #kubernetes #webdev #programming

The Problem We Were Actually Solving

In late 2023, Veltrixs growth team wanted to launch a treasure-hunt feature: users submit a JSON payload, the engine returns the best-matching SKU, discount, and cross-sell vector in under two seconds. Marketing needed a parameter-driven system so non-engineers could tweak weights, boosts, and exclusions without a deployment.

The original design used a single PostgreSQL 15 table with 12 JSONB columns for rules, a gin index on the payload hash, and a plpgsql function that executed a dynamic WHERE clause. The first demo worked because we seeded the database with 100 rows and hard-coded the weights. On staging, with 1.2 million rows, the planner chose a seq scan that took 2.3 seconds—still acceptable for a demo. The growth team didnt mention that Black Friday traffic could hit 200 concurrent hunts per second.

What We Tried First (And Why It Failed)

We tried sharding the table by tenant_id, but the dynamic WHERE clause used tenant_id = ANY(tenant_ids) inside a JavaScript UDF, which the planner refused to push down. P99 latency jumped to 12 seconds.

Next, we moved the weights into a separate weights table and joined with lateral. That worked until we hit the silent killer: lock escalation. The lateral join under high concurrency tried to acquire 1.2 million row-level locks, which Postgres promoted to an ACCESS EXCLUSIVE lock on the weights table. The AUTOVACUUM daemon kicked in during the query, froze the weights table for eight minutes, and queued every hunt request behind an eight-minute garbage-collection stall.

We tried pg_partman to shard by tenant_id and daily, but the planner still did partition elimination on the lateral join, which Postgres couldnt prune because the weights table had no tenant_id column. We learned that lesson when a single hunt with a high-weight vector triggered a partition scan of 264 tables.

Our final stop was a Redis cache keyed by a hash of the parameters. The cache hit rate was 82 % during steady state, but the 18 % cache misses re-executed the entire dynamic WHERE clause, and the planners cost estimate was so far off that it chose a seq scan on 1.2 million rows. We capped out at 420 concurrent hunts per second before the instances CPU throttled and customer SLAs fell off a cliff.

The Architecture Decision

We needed to kill the dynamic WHERE in PostgreSQL. Instead, we rebuilt the engine as a two-stage pipeline:

Pre-compute a materialized view named mv_treasure_hunt_rules that flattened every possible parameter combination into columns:
score_brand_boost, score_inventory, score_geo_weight, etc.
We added a generated column rule_hash as a surrogate key.
Store only rule_hash, payload_hash, and tenant_id in the active table. The actual hunt became a simple lookup by rule_hash with a pre-computed join to the tenants current weights via a foreign key. No dynamic SQL, no lateral joins, no planner surprises.

We switched from PostgreSQL to TimescaleDB hypertables partitioned by tenant_id and time, with a BRIN index on rule_hash. The materialized view refresh ran every 60 seconds during off-peak using pg_cron.

We also replaced the Redis cache with an in-process Guava cache in the Java service, sized at 10,000 entries with a ten-second TTL. The Guava cache eliminated the cross-process latency spike we saw with Redis under GC pressure.

The new pipeline ran the same hunt in 350 ms at 900 concurrent hunts per second, with p95 at 450 ms and p99 at 620 ms.

What The Numbers Said After

The first full week after deployment, the AUTOVACUUM daemon ran for 12 fewer minutes total across the cluster. The number of lock timeouts dropped from 472 per day to 12. The hunt engines error rate fell to 0.02 %, and the query planner never chose a seq scan on the active table again.

We kept the older PostgreSQL instance as a read replica for historical reports, which still suffers from the same lateral-join lock escalation. That replica is now EOL and scheduled for decommissioning next quarter.

What I Would Do Differently

I would not let the growth team dictate the schema. Their demo used 100 rows; production needed 1.2 million. I would push back and insist on a materialized view from day one, even if it meant an extra nightly job.

I would never place weights in a separate table without a tenant_id column. The planner cannot prune partitions on a lateral join when the join key is missing from the joined table.

I would test the AUTOVACUUM interaction under load before go-live. A simple pgbench run with concurrent VACUUM FULL would have revealed the lock escalation risk. We didnt discover it until Black Friday traffic exposed it.

Finally, I would not use Redis as a fallback cache for a query engine that already has a planner with a history of underestimating cost. In-process caches with bounded TTL are safer when your planner is a gambler.