DEV Community

Cover image for Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold
Dewald Hugo
Dewald Hugo

Posted on • Originally published at origin-main.com

Laravel Horizon in Production: Configuring AI Queue Workloads That Actually Hold

Laravel Horizon in production looks deceptively simple until your first LLM inference job times out silently and your users start receiving empty responses. Standard queue jobs (sending emails, processing images, syncing records), complete in milliseconds. AI inference jobs do not. A cold claude-sonnet-4-6 call with a dense system prompt can run for 45 seconds. A gemini-2.5-pro batch summarisation job can breach two minutes under load. Horizon’s defaults were not built for this, and the failure modes are nasty: jobs that disappear without landing in failed_jobs, rate limit retries that exhaust the tries budget in under 30 seconds, and expensive inference work discarded mid-completion.

This guide is part of the AI Deployment & Production Operations module, which covers the full surface area of running Laravel AI applications in production. If you are still wiring up the surrounding deployment infrastructure, the complete production deployment guide is the right starting point.

What follows covers the three layers where AI queue workloads require deliberate configuration: supervisor setup, job class design, and operational monitoring.

Why AI Jobs Break Standard Horizon Assumptions

Standard Horizon configuration assumes workers cycle through jobs in seconds. The defaults reflect that: 60-second timeouts, three retries with no backoff configuration, and supervisor settings tuned for throughput. Those assumptions collapse the moment you start queuing LLM inference.

Three failure modes come up repeatedly.

Silent timeout kills. Horizon’s default timeout of 60 seconds is aggressive for AI inference. A gpt-4o call with a large context window can sit at 50 seconds before returning its first token. Add network variance and the worker process receives a SIGKILL mid-call. No exception is logged. The job does not land in failed_jobs. It just vanishes. This is the most common support ticket pattern we see from teams that have not tuned Horizon for AI: “jobs are disappearing.”

Rate limit mishandling. Provider 429 responses from OpenAI, Anthropic, and Google are not errors in the traditional sense. They are expected, temporary, and recoverable. Retrying immediately burns through the tries budget in seconds. Without a backoff array defined on the job, Laravel uses zero delay between retries by default. A job hitting a rate limit five times in 15 seconds has failed just as permanently as one that hit a genuine error.

Partial output loss. AI jobs often do useful work before failing. A document summarisation job might process 80% of its input before hitting a context limit. Standard job failure handling discards that state entirely. For expensive inference workloads on long documents, that is a measurable cost.

The fix requires changes at three levels: supervisor configuration, job class design, and monitoring.

Configuring Laravel Horizon in Production for AI Workloads

Install Horizon if you have not already:

composer require laravel/horizon
php artisan horizon:install
Enter fullscreen mode Exit fullscreen mode

The critical configuration lives in config/horizon.php. The default supervisor configuration is intentionally generic. For AI workloads, you need a dedicated supervisor pool with materially different settings.

// config/horizon.php

'environments' => [
    'production' => [

        'supervisor-ai-inference' => [
            'connection'          => 'redis',
            'queue'               => ['ai-high', 'ai-default', 'ai-low'],
            'balance'             => 'auto',
            'autoScalingStrategy' => 'time',
            'minProcesses'        => 2,
            'maxProcesses'        => 12,
            'balanceMaxShift'     => 2,
            'balanceCooldown'     => 5,
            'timeout'             => 300,  // 5 minutes — covers streaming completions
            'sleep'               => 3,
            'tries'               => 5,
            'nice'                => 0,
        ],

        'supervisor-default' => [
            'connection'  => 'redis',
            'queue'       => ['default', 'notifications', 'mail'],
            'balance'     => 'simple',
            'minProcesses'=> 1,
            'maxProcesses'=> 8,
            'timeout'     => 60,
            'sleep'       => 3,
            'tries'       => 3,
        ],
    ],
],
Enter fullscreen mode Exit fullscreen mode

A few decisions here that are worth explaining.

autoScalingStrategy: time scales workers based on queue wait time rather than queue size. For AI workloads, queue size is a poor signal: three jobs waiting sounds manageable, but if each takes 90 seconds, you are looking at a 4-minute tail for the last user. Time-based scaling catches this earlier.

timeout: 300 gives generous headroom for streaming completions and large context calls. This is not a ceiling you should approach routinely; it is a safety net. If jobs are regularly running past 120 seconds, that is a prompt engineering problem, not a timeout problem.

balanceCooldown: 5 prevents the auto-balancer from thrashing worker counts during a burst of short AI calls followed by a trough. Default is 3 seconds, which is too reactive for inference workloads.

Supervisord server configuration is equally important. The stopwaitsecs value must exceed Horizon’s timeout value, or the process manager will kill a running Horizon worker before it finishes a long inference job during deployments:

[program:laravel-horizon]
process_name=%(program_name)s
command=php /var/www/html/artisan horizon
autostart=true
autorestart=true
user=www-data
redirect_stderr=true
stdout_logfile=/var/www/html/storage/logs/horizon.log
stopwaitsecs=360
Enter fullscreen mode Exit fullscreen mode

Set stopwaitsecs to at least timeout + 60. We have seen rolling deployments silently truncate in-flight inference calls because this was left at the default 10 seconds.

Designing the Job Class: Timeout, Retry, and Rate Limit Handling

The supervisor configuration sets the outer boundary. The job class defines behaviour within it. For AI inference jobs, three properties are non-negotiable: $timeout, $tries, and $backoff.

<?php

namespace App\Jobs;

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use Illuminate\Queue\Middleware\RateLimited;
use Illuminate\Support\Facades\Log;

class GenerateAIInsightJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    /**
     * Hard kill threshold. Horizon's supervisor timeout is the outer wall;
     * this property is the job's own declaration to the queue system.
     * Set it below the supervisor timeout to allow graceful error handling.
     */
    public int $timeout = 240;

    /**
     * Total attempts before the job is moved to failed_jobs.
     * 5 attempts with exponential backoff covers transient provider outages.
     */
    public int $tries = 5;

    /**
     * Seconds to wait before each retry attempt.
     * Indices correspond to attempt number: attempt 1 waits 30s, attempt 2 waits 60s, etc.
     */
    public array $backoff = [30, 60, 120, 180, 240];

    public function __construct(
        private readonly int    $documentId,
        private readonly string $prompt,
        private readonly string $model = 'claude-sonnet-4-6',
    ) {}

    public function middleware(): array
    {
        return [new RateLimited('ai-inference')];
    }

    public function handle(): void
    {
        $document = Document::findOrFail($this->documentId);

        try {
            $response = \Anthropic::messages()->create([
                'model'      => $this->model,
                'max_tokens' => 2048,
                'messages'   => [
                    ['role' => 'user', 'content' => $this->prompt],
                ],
            ]);

            $document->update([
                'ai_insight'       => $response->content[0]->text,
                'insight_model'    => $this->model,
                'insight_token_count' => $response->usage->inputTokens + $response->usage->outputTokens,
            ]);

        } catch (\Throwable $e) {
            if ($this->isRateLimitException($e)) {
                // Release back to queue with the appropriate backoff delay.
                // Do NOT throw — throwing counts as a failed attempt.
                $this->release($this->backoff[$this->attempts() - 1] ?? 240);
                return;
            }

            Log::error('AI insight generation failed', [
                'document_id' => $this->documentId,
                'attempt'     => $this->attempts(),
                'error'       => $e->getMessage(),
            ]);

            throw $e;
        }
    }

    public function failed(\Throwable $exception): void
    {
        // Preserve whatever partial work exists rather than nulling it.
        Document::where('id', $this->documentId)->update([
            'ai_insight_status' => 'failed',
            'ai_insight_error'  => $exception->getMessage(),
        ]);

        Log::critical('AI insight job exhausted all retries', [
            'document_id' => $this->documentId,
            'model'       => $this->model,
        ]);
    }

    public function retryUntil(): \DateTime
    {
        // Absolute deadline. Even with $tries remaining, the job will not
        // retry after this point. Critical for time-sensitive inference pipelines.
        return now()->addHours(3);
    }

    private function isRateLimitException(\Throwable $e): bool
    {
        return str_contains($e->getMessage(), '429')
            || str_contains($e->getMessage(), 'rate_limit')
            || str_contains($e->getMessage(), 'Too Many Requests');
    }
}
Enter fullscreen mode Exit fullscreen mode

The $this->release() pattern inside isRateLimitException is the correct approach for provider rate limits. Throwing the exception counts as a failed attempt and triggers a retry cycle. Calling release() puts the job back on the queue with a delay and does not decrement the tries counter. Rate limits are not job failures; they are scheduling signals.

retryUntil() is the safety valve. Without it, a job sitting in exponential backoff across five attempts could theoretically retry for hours after the result is no longer needed. Set this to match the actual business requirement.

Registering the Rate Limiter

The RateLimited middleware on the job references a named rate limiter. Register it in your AppServiceProvider:

// app/Providers/AppServiceProvider.php

use Illuminate\Cache\RateLimiting\Limit;
use Illuminate\Support\Facades\RateLimiter;

public function boot(): void
{
    RateLimiter::for('ai-inference', function (object $job) {
        // Adjust this to match your provider tier.
        // Anthropic Tier 2: ~1,000 RPM. OpenAI Tier 3: ~5,000 RPM.
        // Start conservative and increase as you verify throughput.
        return Limit::perMinute(60)->by('global');
    });
}
Enter fullscreen mode Exit fullscreen mode

For multi-tenant applications with per-tenant provider keys, scope the limiter by tenant:

RateLimiter::for('ai-inference', function (object $job) {
    $tenantId = $job->tenantId ?? 'global';
    return Limit::perMinute(30)->by("tenant:{$tenantId}");
});
Enter fullscreen mode Exit fullscreen mode

If you are building the broader governance layer around token budgets and per-tenant limits, the AI middleware article covering token tracking covers the HTTP request layer equivalent of this pattern.

Job Duration: Why This Visualisation Matters

The chart below illustrates why AI inference jobs cannot share a supervisor pool with standard queue jobs. Standard jobs cluster almost entirely below 200ms. AI inference jobs distribute across a 5-to-90-second range, with a meaningful tail. A shared 60-second timeout kills the tail of the AI distribution silently.

Bar chart comparing job duration distribution between standard queue jobs and AI inference jobs, illustrating why a shared 60-second timeout is insufficient for AI workloads.

Monitoring What Actually Matters for AI Queues

The Horizon dashboard gives you throughput, wait time, and recent job runtime out of the box. For standard workloads, those three numbers tell you most of what you need. For AI inference workloads, the signal you need most is not surfaced by default: the ratio of jobs that exit via SIGKILL versus jobs that complete normally.

The table below outlines which Horizon metrics carry real weight for AI workloads, and the thresholds worth building alerts around.

  • Metric: Job wait time
    • Default threshold: Alert at 30s
    • AI workload threshold: Alert at 120s
    • Why it differs: AI jobs are slower (short wait time spikes are normal)
  • Metric: Job runtime (p95)
    • Default threshold: Alert at 10s
    • AI workload threshold: Alert at 90s
    • Why it differs: Long completions are expected; watch the tail, not the mean
  • Metric: Failed job rate
    • Default threshold: Alert at 5%
    • AI workload threshold: Alert at 2%
    • Why it differs: Expensive inference; failures cost more than compute
  • Metric: Retry rate
    • Default threshold: Not monitored
    • AI workload threshold: Alert at 15%
    • Why it differs: High retries indicate a rate limit or model instability problem
  • Metric: Queue depth (ai-high)
    • Default threshold: Alert at 50
    • AI workload threshold: Alert at 10
    • Why it differs: High-priority AI jobs should process near-immediately

For the silent timeout kill problem, there is no native Horizon alert. The symptom is a job that leaves no trace in failed_jobs despite not completing. You can detect this indirectly by tracking job dispatch counts against completion counts in your application layer:

// Dispatch side — record intent
Cache::increment("jobs:dispatched:{$this->documentId}");
GenerateAIInsightJob::dispatch($document->id, $prompt)->onQueue('ai-default');

// Handle side — record completion
Cache::increment("jobs:completed:{$this->documentId}");
Enter fullscreen mode Exit fullscreen mode

A growing gap between those two counters, without a corresponding growth in failed_jobs, is the fingerprint of silent SIGKILL kills. Add a scheduled command to routes/console.php that alerts when the gap exceeds a threshold:

// routes/console.php

use Illuminate\Support\Facades\Schedule;

Schedule::call(function () {
    $dispatched  = Cache::get('jobs:dispatched:total', 0);
    $completed   = Cache::get('jobs:completed:total', 0);
    $failed      = DB::table('failed_jobs')
                     ->where('queue', 'like', 'ai-%')
                     ->where('failed_at', '>=', now()->subHour())
                     ->count();

    $missing = $dispatched - $completed - $failed;

    if ($missing > 5) {
        Log::critical('AI jobs disappearing without trace', [
            'dispatched' => $dispatched,
            'completed'  => $completed,
            'failed'     => $failed,
            'missing'    => $missing,
        ]);
    }
})->everyFiveMinutes();
Enter fullscreen mode Exit fullscreen mode

[Production Pitfall] The silent SIGKILL kill is by far the most dangerous failure mode in AI queue workloads, because it produces no actionable output. Teams routinely run in this state for weeks without realising it, attributing the missing outputs to “the AI being slow.” Check your stopwaitsecs and timeout alignment before anything else. If stopwaitsecs in your supervisord config is lower than Horizon’s timeout, every deployment is silently truncating in-flight inference calls.

If you are building the broader observability layer around your AI architecture, the governance and telemetry patterns in the production AI architecture guide cover how to centralise this kind of cross-cutting operational signal across providers.

Failed Job Strategy for LLM Inference

When a job does land in failed_jobs, the default response is to re-dispatch it manually via php artisan queue:retry and hope the error was transient. For AI inference, that is rarely sufficient. Inference failures tend to cluster around specific causes (provider outages, malformed prompts, context window overflows, or inference parameters that produce invalid output), and each warrants a different response.

Structure your failed job handling to capture enough context to triage correctly:

public function failed(\Throwable $exception): void
{
    $reason = match (true) {
        str_contains($exception->getMessage(), 'context_length_exceeded') => 'context_overflow',
        str_contains($exception->getMessage(), '429')                     => 'rate_limit_exhausted',
        str_contains($exception->getMessage(), 'invalid_request_error')   => 'bad_prompt',
        default                                                           => 'unknown',
    };

    Document::where('id', $this->documentId)->update([
        'ai_insight_status' => 'failed',
        'ai_failure_reason' => $reason,
        'ai_insight_error'  => $exception->getMessage(),
    ]);

    Log::error('AI inference job failed permanently', [
        'document_id' => $this->documentId,
        'model'       => $this->model,
        'reason'      => $reason,
        'attempts'    => $this->attempts(),
    ]);
}
Enter fullscreen mode Exit fullscreen mode

The reason field is the important addition. context_overflow failures need prompt truncation logic, not a retry. bad_prompt failures need a developer looking at the prompt template, not an automated re-queue. Retrying them blindly burns provider quota for no benefit.

For agentic pipelines where the inference job is one step in a multi-step chain, the failure handling becomes more complex. The question of what to do with downstream jobs that depend on the failed step, and how to validate the partial output that does exist, is covered in depth in the agentic workflow schema validation guide. The core principle applies here too: validate what you have before deciding whether a retry is warranted.

One final note on the Horizon dashboard and Laravel Telescope together: Telescope’s job watcher captures the full job payload, exception stack trace, and timing for every failed job. For AI workloads, the job payload includes the prompt, which makes post-mortem analysis significantly faster. Enable the job watcher in non-production environments at minimum, and consider enabling it in production with payload scrubbing for anything containing PII. See the Laravel Horizon documentation for tag-based filtering, which lets you isolate AI job failures without noise from other queue traffic. Anthropic’s rate limit documentation is the authoritative reference for per-tier RPM and token-per-minute limits if you are calibrating your RateLimiter::for('ai-inference') ceiling.

Top comments (0)