Gabriel Anhaia

Posted on May 24

Laravel Queues at 1M Jobs/Day: The 4 Bottlenecks Nobody Documents

#laravel #queues #redis #performance

Book: Decoupled PHP — Clean and Hexagonal Architecture for Applications That Outlive the Framework
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You ship the queue on a Tuesday. Redis, two workers, a handful of jobs per second. Everything is green. Six months later the product is doing a million jobs a day and the dashboard turns red on a Friday at 3pm. Memory keeps climbing on one worker. Another worker is idle. Latency on dispatch is suddenly 40ms. The on-call engineer is staring at redis-cli LLEN queues:default and watching the number not move.

The thing nobody writes down is that the four bottlenecks hit in a specific order. Add workers, you trip the serializer. Fix the serializer, memory growth eats you. Fix memory, the failed_jobs table becomes the slowest write in your system. You fix them in that order or you waste a quarter chasing the wrong one.

Here's the order, with code you can paste.

The Laravel queue model in 60 seconds

A Laravel job is a PHP object. Queue::push($job) serializes the object to a string and LPUSHes it onto a Redis list (queues:default by default). A worker process runs php artisan queue:work, which blocks on BRPOP against that list, deserializes the payload, and runs handle(). The worker exits after --max-jobs or --max-time, gets restarted by supervisor, repeats.

Three places where work happens: dispatch (serialize + push), broker (Redis stores the string), worker (pop + deserialize + execute). All three can be the bottleneck. You don't know which one until you measure.

A real payload looks like this. A ProcessOrderJob that pulls a 12 KB JSON blob from the queue:

<?php

namespace App\Jobs;

use App\Models\Order;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;

class ProcessOrderJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public int $tries = 3;
    public int $timeout = 90;

    public function __construct(public Order $order) {}

    public function handle(): void
    {
        // pricing, fulfillment, email. The usual.
    }
}

SerializesModels is the trait that turns the Order instance into {class: Order, id: 4421} on push and re-fetches it on pop. That is the second bottleneck in disguise, and we'll come back to it.

Bottleneck 1: Worker count vs CPU

The Laravel docs say queue:work runs a worker. They don't say how many. Most teams ship with two or three because that's what the tutorial has. Above roughly 10k jobs per hour, two workers is wrong.

The math is simple. If your average job takes 200ms (network call, two DB writes, a Stripe webhook), one worker handles 5 jobs/second, so 18,000 jobs/hour. Two workers cap out around 36k/hour. A million jobs/day is roughly 11,500 jobs/hour averaged, but real traffic is peaky. Your peak hour is often 4x the average. So a 1M/day app needs to survive 46k jobs/hour. Two workers can't do it.

The naive fix is to crank workers to 32 and call it done. That trips a different limit: each worker is a PHP process holding around 60-80 MB of bootstrapped Laravel. 32 workers is 2 GB of memory before any job runs. On a 4-vCPU box, 32 workers also mean 8x oversubscription. They thrash.

The rule that works: start at workers = vCPU count × 2 for IO-bound jobs, workers = vCPU count for CPU-bound jobs. Most Laravel jobs are IO-bound (DB, HTTP, mail). On an 8-vCPU box, that's 16 workers. Scale horizontally beyond that. Add a second machine before adding a 17th worker on the first.

The configuration that matters lives in supervisor.conf, not in Laravel:

[program:laravel-worker]
process_name=%(program_name)s_%(process_num)02d
command=php /var/www/artisan queue:work redis --queue=high,default,low --sleep=0 --tries=3 --max-time=3600 --max-jobs=1000
autostart=true
autorestart=true
user=www-data
numprocs=16
redirect_stderr=true
stdout_logfile=/var/log/worker.log
stopwaitsecs=3600

Two things people miss. --sleep=0 because at this volume a worker should never sleep; there's always something to pop. --max-jobs=1000 because PHP leaks memory in long-running processes (more on this in bottleneck 3). The worker dies cleanly after 1000 jobs and supervisor restarts it.

Bottleneck 2: Redis serialization cost (igbinary vs php-serialize)

Around 30k jobs/hour with a payload heavier than ["id" => 42], dispatch latency starts showing up on the dashboard. Not the worker side. The dispatch side. Queue::push() starts taking 4ms, then 8ms, then 15ms when the queue is hot.

The culprit is PHP's default serialize(). It's text-based, allocates a lot, and for any payload with eager-loaded relations on a model, you're serializing a tree. A ProcessOrderJob with an Order plus three OrderItem relations plus a Customer produces a 14 KB serialized string. Push 100 jobs/second and that's 1.4 MB/s of allocation churn before Redis sees a byte.

igbinary is a binary serializer. It's been a PHP extension for over a decade and it ships in most Docker base images. On the same payload, output drops from 14 KB to about 4.2 KB, roughly 70% smaller. Serialize time drops from around 180μs to 65μs, deserialize from 240μs to 85μs. Both numbers measured on PHP 8.3 with a realistic Eloquent payload.

At 1M jobs/day that's a few minutes of CPU saved per machine per day and about 9 GB of Redis bandwidth. Not enormous in absolute terms. But it also halves the size of each entry in Redis, which means MEMORY USAGE queues:default drops by 60% and your Redis fits comfortably on the same instance instead of needing the next tier up.

Switching is two steps. Install the extension:

apt-get install -y php8.3-igbinary
# or docker-php-ext-install igbinary in your Dockerfile

Then in config/queue.php:

'connections' => [
    'redis' => [
        'driver' => 'redis',
        'connection' => 'default',
        'queue' => env('REDIS_QUEUE', 'default'),
        'retry_after' => 90,
        'block_for' => null,
        'serializer' => 'igbinary', // the actual change
    ],
],

The gotcha: in-flight jobs already on the queue were serialized with the old format. Drain the queue before switching, or write a one-time migration worker that handles both. If a worker tries to igbinary_unserialize a php-serialized string, you get unserialize(): Error at offset 0 of N bytes and the job lands in failed_jobs. Drain first, switch second.

Bottleneck 3: Long-running job memory growth (Octane + queues)

Around the time you fix serialization, you start running Laravel Octane to speed up HTTP. Then someone notices the queue workers also benefit from skipping the bootstrap on every job. They flip queue:work to run inside Octane. Memory on workers climbs from 80 MB at boot to 600 MB after 30 minutes. The OOM killer takes over.

The collision is that Octane keeps the framework state in memory between requests, and queue:work already keeps PHP alive between jobs. Stack them and any singleton, any static cache, any Eloquent global scope leaks across job boundaries until something fills up. The classic offender is Laravel's query log. DB::enableQueryLog() accidentally left on in a service provider grows by every query, forever.

Three rules that hold at scale:

Don't run queue:work inside Octane. The bootstrap-skip benefit (15ms per job) doesn't pay for the memory pressure. Run workers as plain artisan queue:work under supervisor.
Set --max-jobs=1000 and --max-time=3600. Whichever hits first triggers a clean restart. Memory resets to 80 MB. PHP's garbage collector is not your friend here. Long-running processes accumulate fragmentation that GC can't release.
Inside the job, call app('db')->disconnect() after any large query, and gc_collect_cycles() after any large object you no longer need. Both feel old-school. Both work.

The diagnostic that finds the leak:

public function handle(): void
{
    $before = memory_get_usage(true);

    // do the work

    $after = memory_get_usage(true);
    $delta = $after - $before;

    if ($delta > 5 * 1024 * 1024) {
        Log::warning('job leaked memory', [
            'job' => self::class,
            'delta_mb' => round($delta / 1024 / 1024, 2),
            'order_id' => $this->order->id,
        ]);
    }
}

Watch the warning log for two days. You'll find the leaker.

Bottleneck 4: failed_jobs table writes

This one nobody sees coming. At 1M jobs/day with a 1% failure rate, that's 10k failed jobs/day landing in a single MySQL or Postgres table. Each row holds the full serialized payload (the same one you spent bottleneck 2 shrinking) plus the exception trace. Rows average 30-80 KB. Writes use the table's default settings, which means a single B-tree index on id and probably nothing else.

The table grows. After a quarter, you have 900k rows averaging 50 KB each. A 45 GB table. The first thing you notice is that SELECT * FROM failed_jobs WHERE queue = 'default' from Horizon's UI takes 12 seconds. Soon after, the insert itself starts dominating: a failing job takes 80ms to fail because the write contends with the read query Horizon is making every 5 seconds.

Four fixes, in order of impact:

-- 1. add the indexes Laravel doesn't ship
CREATE INDEX idx_failed_jobs_failed_at ON failed_jobs (failed_at);
CREATE INDEX idx_failed_jobs_queue_failed ON failed_jobs (queue, failed_at);

-- 2. prune old rows in a scheduled task
DELETE FROM failed_jobs WHERE failed_at < NOW() - INTERVAL 30 DAY;

The Laravel scheduler entry:

// app/Console/Kernel.php
$schedule->command('queue:prune-failed --hours=720')->daily();

The third fix is moving the table off the primary database entirely. A separate failed_jobs connection pointed at a low-tier MySQL instance keeps write contention away from your app DB:

// config/queue.php
'failed' => [
    'driver' => env('QUEUE_FAILED_DRIVER', 'database-uuids'),
    'database' => 'failed_jobs_db', // separate connection
    'table' => 'failed_jobs',
],

The fourth fix is the one most teams skip: stop storing the full payload. Override FailedJobProviderInterface and store a hash of the payload plus a reference to S3 where the full blob lives. Failed-job inspection through Horizon still works because Horizon reads the table, but the rows drop from 50 KB to 400 bytes.

The Horizon supervisor config that holds at scale

Horizon sits on top of queue:work and gives you the dashboard, auto-balancing, and metrics. The config that survives 1M jobs/day looks like this. config/horizon.php:

'environments' => [
    'production' => [
        'high-priority' => [
            'connection' => 'redis',
            'queue' => ['high'],
            'balance' => 'simple',
            'minProcesses' => 4,
            'maxProcesses' => 16,
            'memory' => 256,
            'tries' => 3,
            'timeout' => 60,
            'nice' => 0,
        ],
        'default' => [
            'connection' => 'redis',
            'queue' => ['default'],
            'balance' => 'auto',
            'minProcesses' => 8,
            'maxProcesses' => 48,
            'balanceMaxShift' => 4,
            'balanceCooldown' => 3,
            'memory' => 256,
            'tries' => 3,
            'timeout' => 120,
        ],
        'low' => [
            'connection' => 'redis',
            'queue' => ['low', 'reports'],
            'balance' => 'simple',
            'minProcesses' => 2,
            'maxProcesses' => 8,
            'memory' => 384,
            'tries' => 2,
            'timeout' => 600,
        ],
    ],
],

Three things to flag. balance => 'auto' is what you want on the busy supervisor. Horizon shifts workers between queues based on backlog. balanceMaxShift => 4 caps how aggressively it scales (the default 1 is too slow at this volume). memory => 256 matches the --max-jobs restart cadence. Horizon kills the worker if it crosses 256 MB, which is the early warning that bottleneck 3 is back.

The nice => 0 on high-priority is intentional. Default is 0 already, but writing it down makes the next engineer look up what nice values are and consider whether the low queue should run at nice => 10 to deprioritize report generation when the box is hot.

When to leave Redis

Around 5M jobs/day, Redis itself becomes the bottleneck. Single-threaded, in-memory, with persistence amplification on every LPUSH. You'll see INFO replication showing replica lag climbing and redis-cli --latency reporting 95th-percentile latencies in the 8-15ms range. You can shard with Redis Cluster, but at that point the operational cost of a clustered Redis is similar to switching brokers.

Three real options:

SQS is the laziest migration if you're on AWS. Laravel ships an SQS driver. Throughput is effectively unlimited, persistence is managed, you pay per request ($0.40 per million). The gotchas: 256 KB message size limit (your fat payloads break), at-least-once delivery means duplicate jobs will run (you need idempotency on every handler), and the FIFO queues that fix duplicate delivery cap at 3000 messages/second with batching. For most apps SQS plus an idempotency key column on the job is the cheapest path off Redis.

RabbitMQ is the right answer if you need actual queue semantics: priorities, dead letter exchanges, fan-out, per-message TTL. The vladimir-yuldashev/laravel-queue-rabbitmq package is solid. You'll pay for it in operational complexity: running RabbitMQ in HA mode means quorum queues, which means thinking about Erlang. Throughput tops out around 50k-80k messages/second per node, which is more than enough but not free.

NATS (with JetStream for persistence) is the newer option. The basis-company/nats PHP client is mature enough for production as of 2026. It's fast, over 100k messages/second per node on commodity hardware, and lighter operationally than RabbitMQ. The trade-off is ecosystem: Horizon doesn't speak NATS, so you lose the dashboard and write your own observability. Worth it for high-throughput, less worth it if your team values the Laravel-native tooling.

The decision rule: SQS if you're on AWS and can make jobs idempotent. RabbitMQ if you need routing semantics. NATS if you're optimizing for throughput and willing to write the monitoring yourself. Stay on Redis if you're under 5M/day and the four bottlenecks above are tuned. Redis is fine, and the migration cost is real.

If this was useful

This post is about the queue layer, but the deeper habit is the same one that runs through clean and hexagonal architecture: keep your domain (the job's actual work) decoupled from the framework's plumbing (Horizon, Redis, the failed-jobs table). When the queue broker changes (and at this scale it will), only the adapter changes, never the use case. That separation is what Decoupled PHP is about: the architectural layer your Laravel codebase reaches for after it outgrows the framework defaults.

What's the first bottleneck that bit your team at scale, and which fix here would have caught it earlier?