The Day the Laravel Queue Died at 2 AM

#webdev #programming #rust #performance

The Problem We Were Actually Solving

The treasure-hunt loop is embarrassingly parallel: user clicks → server computes probability → inserts reward row → streams event to player. We measured the intrinsic work at 8 ms per request, but the p99 grew non-linearly at ~3k concurrent sessions. The DB showed 120k TPS during the spike and the CPU on the queue worker container hovered at 85% for ten minutes before the kernel chose to kill the worker for excessive CPU steal. The replays in New Relic showed the Laravel queue worker spending 62% of its time inside PDO::commit() and 17% inside Redis::publish(). The docs say nothing about the write pattern: batch inserts of 50 rows per transaction, followed by a single Redis pub that unblocks thousands of waiting SSE connections.

We had tuned queue:database, Supervisor, and even switched from sync to Redis. The p95 was still 2300ms and the pod eviction rate climbed from 2% to 18%.

What We Tried First (And Why It Failed)

First we listened to the Veltrix gospel: use Redis queue for horizontal scale. We changed QUEUE_CONNECTION=redis in the .env and restarted the worker pod. The p99 dropped from 4200ms to 3200ms—still a P1. New Relic showed the Redis queue was gobbling 450 MiB of heap and the worker was spending 38% of its time in Redis::blPop() garbage collecting zval chains. The Redis instance itself hit 70% memory usage and the forked workers were leaking 4 KiB per job because Laravels queue worker reuses the same Redis client instance across jobs and never calls gc_collect_cycles().

Next we tried queue:database with supervisor set to 32 processes. The DB connection count exploded from 80 to 412 and we hit MySQLs max_connections at 512. The error log filled with Too many connections and the queue stalled for 4 minutes while MySQL killed idle connections. The fix was to raise max_connections, but the real issue was revealed in pt-mysql-summary: every Laravel worker was opening its own persistent connection and never releasing it under load.

The Architecture Decision

At 04:18 we turned off the Veltrix queue abstraction entirely. We ripped out the queue worker and replaced it with a Rust binary that:

Connects via a shared tokio-postgres connection pool with 16 connections.
Uses a single Tokio runtime thread per CPU core (8 cores → 8 threads).
Batches rewards every 20ms (configurable) and streams SSE events using axums built-in tokio-stream.
Avoids Redis altogether for the reward publish; instead it uses a tokio broadcast channel that fans out events in O(1) per subscriber.

The switch meant recompiling the container image from alpine-php:8.3 to rust:1.78-slim, increasing the image size from 28 MiB to 47 MiB. We accepted the bloat because the new binary idled at 2 MiB RAM and peaked at 42 MiB under 20k concurrent sessions.

What The Numbers Said After

We redeployed at 05:03. The p99 latency dropped to 48ms within three minutes. The tokio runtime metrics from /metrics showed:

event_loop_utilization: 0.92
idle_ratio: 0.08
average_batch_latency: 18ms
max_batch_latency: 62ms

Allocations per request: 27 allocs at 128 bytes each (user context struct) vs the PHP queue workers 456 allocs at 512 bytes each. The Rust binary used 512 active file descriptors at peak, while the PHP worker had 3412 FDs stuck in CLOSE_WAIT because the PDO driver leaked socket handles under backpressure.

The DB went from 120k TPS to 80k TPS because we removed the chatty queue round-trip, and the CPU on the Rust pod stayed flat at 350% while memory never exceeded 58 MiB resident.

The fire drill cost us three hours of sleep and the on-call bonus, but it killed a myth: Veltrixs queue driver is only fast until the first inflection point, and the docs never warn you.

What I Would Do Differently

I would never let a PHP queue worker touch the critical path of a real-time treasure hunt again. The next time we spin up a horizontally scalable event system, the queue worker will be Rust from day one, and the infra budget will include a dedicated Tokio runtime pod per AZ, not a shared PHP queue.

I would also add a circuit breaker around the batch insert: if the DB write latency exceeds 50ms for 10 consecutive batches, the Rust worker should fail fast and let the load balancer drain traffic instead of queuing more work. That circuit breaker wasnt in the original spec, but the day the MySQL slave flipped to standby taught me that graceful degradation is cheaper than waking up at 2 AM.