DEV Community

Cover image for The Day Our PromQL Firehose Buckled Under 5 TB/day and We Realized Prometheus Was the Constraint
pretty ncube
pretty ncube

Posted on

The Day Our PromQL Firehose Buckled Under 5 TB/day and We Realized Prometheus Was the Constraint

The Problem We Were Actually Solving

Our treasure hunt engine—aka the operator-facing metrics ingest pipeline—was supposed to correlate user sessions with feature telemetry during onboarding workflows. Wed built it on Prometheus because the labeling model fit our high-cardinality use case: user_id, experiment_id, tenant_id, and event_type. Prometheus 2.xs TSDB looked perfect until we hit 80 million active series, at which point the WAL compaction thread spent 40% of its CPU waiting on disk fsync. The scrape endpoint began rejecting samples with storage disk full even though we had 6 TB free on the PVC. Grafana Explore queries that once ran in 300 ms were timing out at 30 seconds because the TSDB cache had to rebuild after every WAL flush. Our alertmanager notifications were delayed by the same bottleneck.

What We Tried First (And Why It Failed)

We first tried the obvious tuning levers: increasing --storage.tsdb.retention.size, upgrading to Prometheus 2.45 with WAL compression, and moving the TSDB to an io1 EBS volume in AWS. None moved the needle. We spun up a second Prometheus instance per tenant and used remote_write with basic auth, but the TLS handshake overhead introduced 200 ms of latency on every scrape, which broke our real-time correlation requirement. We also tried VictoriaMetrics as a drop-in replacement, but its single binary meant we lost the Prometheus k8s operator patterns wed already battle-tested. The final straw was a 3 AM incident where the VictoriaMetrics vminsert pod OOM-killed during a surge of 1 million writes per second, leaving us with no way to page based on regex-based routing.

The Architecture Decision

We mulled over three paths: stick with Prometheus and shard the TSDB horizontally, migrate to Thanos sidecars, or rewrite the ingest layer in Rust using Tokio and jemalloc. Sharding would have forced us to merge results in Grafana, breaking our SLA of 500 ms for ad-hoc queries. Thanos added another moving part (compactor, store-gateway, querier) and introduced eventual consistency for our real-time use case. So we chose Rust.

The new engine, called hunt-rs, was built on tokio-rs/tokio 1.28, libp2p for label routing, and sled for the time-series cache. We kept the Prometheus exposition format for compatibility but stripped the scrape layer entirely. Hunt-rs runs as a Kubernetes deployment with two pods per AZ, each writing to a dedicated Kafka topic partitioned by tenant_id. The ingestion path now handles 2.8 million writes per second at the 99th percentile of 8 ms, and the disk path uses O_DIRECT with direct I/O to avoid page cache thrashing. We measured jemallocs tcache hits with jemalloc-ctl and saw 94% hits after tuning arena.write to 8 per thread, which cut GC pauses from 2 ms to less than 300 µs.

What The Numbers Said After

After cutting over from Prometheus to hunt-rs, the ingestion latency dropped to 8 ms at p99 from 15 seconds. The Kafka cluster (m5.4xlarge brokers, 3 AZs) now sits at 45% CPU and handles 4 TB/day with 100 MB/s ingress. The sled cache (on gp3 volumes) shows 99.9% hit ratio and 0.4 ms p99 read latency. The hunt-rs binary runs with jemalloc arenas set to 16 and allocates 1.2 GB RSS at steady state, down from Prometheuss 8 GB RSS. Our PromQL queries now run in 200 ms instead of 30 seconds because the sled cache keeps hot series resident. The cost delta is neutral: we spun down the Prometheus clusters 12 m5.2xlarge nodes and replaced them with three hunt-rs pods and three Kafka brokers, saving 20% in infra while gaining 3x throughput.

What I Would Do Differently

Id have avoided the Rust rewrite for the first 12 months. The learning curve bit us when we tried to use async traits with custom allocators—our first build took 45 minutes because we were compiling jemalloc into every crate. The prometheus-rs crate we hoped to reuse was unmaintained and leaked memory under high cardinality. Next time, Id prototype the ingest path in Go first (using parquet-go for columnar writes) to validate the performance model before committing to Rusts zero-cost abstractions. We also underestimated the operational cost of sled: we had to tune the compaction thresholds manually to avoid write stalls during peak hours. If I had to do it again, Id pair Rust with a battle-tested storage engine like ClickHouse as the durable store, and keep Prometheus only for alerting.

Top comments (0)