Veltrix Events Were Sending Our PagerDuty Into Cardiac Arrest

#webdev #programming #architecture #systems

The Problem We Were Actually Solving

The Veltrix event pipeline was originally designed to handle user-triggered actions: button clicks, form submissions, that kind of thing. At 200 events/second it ran fine, but once we hit 2 000 events/second the tail latency on GET /events/{entityId} climbed past 800 ms. The real pain point wasnt throughput—it was the way our consumers treated the endpoint. They assumed they could poll every 500 ms for new events, pull 200 entities in one call, and get sub-second responses. Our first-generation Postgres schema had a single events table with a JSONB payload column, an index on (entity_id, created_at), and a GIN index on tags for filtering. In hindsight, that JSONB column was a premature optimisation; every SELECT pulled 4 KB of JSON even when the consumer only needed the event id and type. The query planner ignored the GIN index because the cost of deserialising each JSONB blob outweighed the selectivity gain. Result: 40 % of our p95 latency was spent in tsl_heap_getpage and 25 % in jsonb_deconstruct. Meanwhile, the Postgres connection pool at 120 connections ran out of idle_in_transaction sessions every night at 03:17 because our Node consumers opened transactions but never closed them when the upstream service returned 503.

What We Tried First (And Why It Failed)

We splurged on an Elasticsearch cluster thinking it would solve the filtering problem. We wrote a nightly ETL job that ingested the entire event table, parsed the JSONB, and re-indexed by entity_id, event_type, and custom tags. The reindexing took 47 minutes and consumed 450 GB of heap before we hit OOM on node-1. We tuned the circuit breaker and saw index time drop to 12 minutes, but our 95th percentile latency still spiked to 1.4 s because the shards were on magnetic disks in us-east-1a. Switching to gp3 volumes shaved 300 ms, but the real killer was the mapping explosion: we had 5 000 dynamic mappings after five weeks because product kept adding new event types. Elasticsearch 7.15 couldnt keep up with the shard-split storms; we saw CoordinatingNodeAllocationExceeded every time a shard moved. The operations team spent two weeks tweaking cluster.allocate.timeout and indices.recovery.max_bytes_per_sec before they finally stabilised, but by then the business had lost confidence and demanded a rollback.

The Architecture Decision

We replaced the monolith with three services:

EventWriter: single writer to Postgres 15 with logical decoding to Debezium. We sharded the events table by entity_id % 32 to keep tables under 250 GB. The Debezium connector publishes to a Kafka topic called events.raw with 64 partitions.
EventCache: a Go service that consumes events.raw and builds a materialised view per entity. The view is an in-memory sharded map (entity → ring buffer of last 1 000 events). We use a RocksDB write-ahead log so cache restarts are under 5 s. The service exposes gRPC endpoints GetEvents(entityId, sinceCursor, limit) and StreamEvents(entityId, cursor).
EventIndex: a ClickHouse cluster (3 nodes, 2 TB NVMe each) that subscribes to events.raw via Kafka and runs a two-level merge tree: one partition per day, one sort key on (entity_id, created_at). We denormalised everything—no joins, just wide rows. Consumers that need complex filtering (date range, tag intersection) hit EventIndex; real-time UI clients hit EventCache.

We paid the price of eventual consistency: a new event can take up to 200 ms to appear in EventCache and up to 15 s to appear in EventIndex. For our product, that latency is acceptable because the actual use case is a treasure hunt where users only see the last event every 5 s anyway. We measured the end-to-end p99 at 280 ms, which was under our SLO of 500 ms.

What The Numbers Said After

Post-deployment week 1:

P95 latency for /events/{entityId} dropped from 820 ms to 120 ms.
Postgres CPU utilisation fell from 65 % to 18 % because EventWriter only writes, no reads.
Kafka lag stayed zero; Debezium connector processed 2.1 M events/minute at 0.8 ms/event.
ClickHouse ingestion lag peaked at 12 s during back-pressure events but never exceeded 30 s.
Our PagerDuty alerts for high latency went from 14 per week to 0 for the first month.

What I Would Do Differently

If I rebuilt this today, I would skip the ClickHouse tier entirely. Our filtering needs never exceeded simple per-entity streams plus a global offset. The ClickHouse cluster cost us $4 200/month in us-west-2 alone, and 90 % of queries were single-entity fetches that were already cached by EventCache. Instead, I would push the entire materialised view into EventCache and let consumers poll a single in-memory shard. For the remaining 10 % who need analytics, I would stream events.raw directly to a Snowflake table through Kafka Connect and let SQL do the work. The hardware budget saved would pay for an extra availability zone, which is where our next outage came from when us-west-2a melted during an AWS heatwave.

DEV Community

Veltrix Events Were Sending Our PagerDuty Into Cardiac Arrest

Top comments (0)