Hafiz

Posted on May 27 • Originally published at hafiz.dev

Laravel AI SDK Silently Kills Your Horizon Queue (And How to Fix It in 4 Config Changes)

#laravel #laravelhorizon #queues #laravelaisdk

Originally published at hafiz.dev

Your Horizon was healthy. Email jobs, notifications, small processing tasks, all running fine with zero issues. You added the AI SDK, wrapped a few agent calls in background jobs, and deployed to production. Within a week you noticed failed jobs nobody reported, one user received the same AI analysis twice, and on a Tuesday morning when five report jobs queued simultaneously, your email queue froze for eight minutes.

Nothing threw an exception. No alerting fired. Horizon's dashboard showed a handful of failed jobs and then everything looked normal again. The failure mode is invisible because the worker doesn't crash. It gets killed mid-execution, Horizon records the failure silently, and there's no stack trace and no message, nothing to debug.

The problem isn't the AI SDK. It's that Horizon's default configuration was designed for jobs measured in milliseconds. AI API calls take 30 to 120 seconds. Four specific default values in your config silently conflict with that reality, and three of them aren't in config/horizon.php. They're scattered across your codebase in ways you'd only find if you knew to look.

If you're new to Laravel's queue system and want to understand the fundamentals before tuning Horizon specifically, the Laravel queue jobs guide covers job structure, retries, and failure handling from the ground up.

Why AI jobs break Horizon's defaults

Horizon's default supervisor configuration assumes jobs are fast. A worker polls the queue, picks up a job, executes it in a second or two, picks up the next one. The defaults reflect this: a 60-second timeout before the worker is killed, a retry_after value in the 90-second range before a job is considered stuck, and no backoff delay between retries.

AI SDK jobs break every one of these assumptions. A single Agent::run() call that uses tools, generates structured output, or hands off to a sub-agent can legitimately take 45 to 90 seconds before it returns. When you layer in the sub-agent patterns that chain multiple providers, job duration climbs further.

Here's what actually happens when that 90-second AI job runs on a default Horizon setup:

View the interactive diagram on hafiz.dev

The SIGKILL at step four is silent. Horizon marks the job as failed with no visible error in the dashboard UI beyond the failure count incrementing. There's no stack trace, no message from the AI provider, no PHP exception in your logs. Most developers spend hours looking for an application error that doesn't exist, because the job never threw one.

The double-processing at step seven is the worst part. Laravel's official Horizon documentation warns about it explicitly: "the timeout value should always be at least a few seconds shorter than the retry_after value... otherwise, your jobs may be processed twice." But almost nobody reads that warning until after a user reports seeing duplicate results.

Four changes fix all of this cleanly.

Change 1: Set a supervisor timeout that reflects AI job duration

Horizon's default supervisor timeout is 60 seconds. That's the value in config/horizon.php under each supervisor's configuration. When a job runs longer than this, the worker process receives SIGKILL and is forcefully terminated.

// config/horizon.php: what ships by default
'supervisor-1' => [
    'connection' => 'redis',
    'queue' => ['default'],
    'balance' => 'auto',
    'processes' => 10,
    'timeout' => 60, // kills any job taking longer than 60 seconds
    'tries' => 1,
],

For AI jobs, raise this to something that reflects the realistic upper bound of your slowest agent call. 300 seconds (five minutes) is a sensible ceiling for most production AI workloads:

'supervisor-ai' => [
    'connection' => 'redis',
    'queue' => ['ai'],
    'balance' => 'auto',
    'processes' => 3,
    'timeout' => 300,
    'tries' => 2,
    'memory' => 512,
],

The Horizon docs note a critical constraint: "always ensure the Horizon timeout is greater than any job-level timeout, otherwise jobs may be terminated mid-execution." So if you define $timeout on the job class itself, the supervisor timeout must exceed it by a few seconds.

class RunAiReportJob implements ShouldQueue
{
    // Job-level timeout: must be < supervisor timeout
    public int $timeout = 270;
}

Setting the job-level timeout slightly below the supervisor timeout means clean termination happens at the job level first, which triggers the failed() method and gives you a cleanup hook. A SIGKILL from the supervisor sends a process signal that skips failed() entirely, leaving any in-progress state (partially written database rows, uncleaned temp files, open API sessions) without cleanup. For AI jobs that write intermediate results or update progress columns, this distinction matters.

Change 2: Fix retry_after to prevent double processing

This is the one that causes duplicate user output, and it lives in config/queue.php, not in config/horizon.php. That's why most developers miss it.

The retry_after value defines how many seconds Redis waits before assuming a job is stuck and re-queuing it. The default for the Redis connection is 90 seconds. After you raise your supervisor timeout to 300, that 90-second retry_after means any job running longer than 90 seconds gets re-queued while the original is still executing.

// config/queue.php: the default Redis connection
'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => env('REDIS_QUEUE', 'default'),
    'retry_after' => 90, // danger: shorter than the AI supervisor timeout
    'block_for' => null,
],

Laravel's docs say this directly: timeout must be shorter than retry_after. So if your supervisor timeout is 300 seconds, retry_after needs to be at least 300 plus a buffer. Add 60 seconds as the minimum buffer:

'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => env('REDIS_QUEUE', 'default'),
    'retry_after' => 360, // supervisor timeout (300) + 60s buffer
    'block_for' => null,
],

This is the change most likely to already be causing silent problems in your app. The double-processing scenario has a specific pattern: a user triggers an AI analysis, gets the result, then gets the same result again 90 seconds later with no explanation. If you've had users report duplicate outputs and assumed it was a UI bug or a double-click, it may have been this. The job ran, got killed, re-queued, and ran again from the beginning.

If you have separate Redis connections for different queues, each connection needs its own retry_after value matched to the highest supervisor timeout that uses it. A dedicated redis-ai connection with its own retry_after of 360 is cleaner than raising the value for every queue across the board.

Change 3: Add exponential backoff for rate limit failures

AI API providers rate-limit aggressively. OpenAI, Anthropic, and Gemini all have per-minute token limits. When a job hits a rate limit, it throws an exception and fails. Without backoff configuration, Horizon retries immediately. Under sustained load, a burst of AI jobs exhausting the rate limit causes every retry to also hit the rate limit, causing another retry, in a tight loop that burns through your API budget and fills the failed jobs list.

The fix is exponential backoff, which Horizon supports at both the supervisor level and the job class level.

At the supervisor level:

'supervisor-ai' => [
    'connection' => 'redis',
    'queue' => ['ai'],
    'balance' => 'auto',
    'processes' => 3,
    'timeout' => 300,
    'tries' => 3,
    'backoff' => [30, 60, 120], // wait 30s, then 60s, then 120s between retries
    'memory' => 512,
],

Or on the job class directly, which takes precedence over supervisor config:

class RunAiReportJob implements ShouldQueue
{
    public int $timeout = 270;
    public int $tries = 3;

    // Exponential backoff: 30s after first failure, 60s after second
    public array $backoff = [30, 60, 120];
}

The job class approach is more explicit and survives config changes without you needing to remember to update the supervisor. It's the pattern to prefer when different AI jobs have different retry requirements. A cheap text classification job can retry faster than an expensive multi-tool agent call.

One thing worth noting: tries counts total attempts, not retries. A job with $tries = 3 gets three chances total: the original execution plus two retries. Set it low enough that an actually broken job doesn't exhaust your API quota before hitting the failed jobs list. Three is usually the right number for AI jobs with rate limit risks.

Change 4: Move AI jobs to a dedicated queue and supervisor

Even with the timeout and retry_after fixed, AI jobs and fast jobs sharing the same supervisor still block each other. Here's why: if you have three workers in supervisor-1 and three 90-second AI jobs land simultaneously, all three workers are occupied for 90 seconds. Email jobs, notifications, and payment webhooks sit queued and waiting until one of those workers finishes.

The solution is a dedicated queue for AI work and a supervisor that only handles it, completely isolated from your default queue.

// In your job class
class RunAiReportJob implements ShouldQueue
{
    public string $queue = 'ai';

    public int $timeout = 270;
    public int $tries = 3;
    public array $backoff = [30, 60, 120];
}

Or when dispatching:

RunAiReportJob::dispatch($report)->onQueue('ai');

Then in config/horizon.php, run two supervisors with different constraints:

'environments' => [
    'production' => [
        // Fast jobs: tight timeout, many workers
        'supervisor-default' => [
            'connection' => 'redis',
            'queue' => ['default', 'high'],
            'balance' => 'auto',
            'processes' => 10,
            'timeout' => 60,
            'tries' => 3,
            'backoff' => 3,
            'memory' => 256,
        ],

        // AI jobs: long timeout, fewer workers, more memory
        'supervisor-ai' => [
            'connection' => 'redis',
            'queue' => ['ai'],
            'balance' => 'auto',
            'processes' => 3,
            'timeout' => 300,
            'tries' => 3,
            'backoff' => [30, 60, 120],
            'memory' => 512,
        ],
    ],
],

The result:

View the interactive diagram on hafiz.dev

Three workers for AI is a starting point. The right number depends on your API rate limits and job volume. Three concurrent AI jobs consuming tokens simultaneously can hit per-minute limits quickly, so scaling the worker count upward requires monitoring your provider's rate limit headroom first.

You might also want to set balance to simple for the AI supervisor rather than auto. The auto strategy dynamically scales workers based on queue depth, but it has a side effect: Horizon considers workers "hanging" when scaling down and will force-kill them after the supervisor timeout if they're mid-execution. For long-running AI jobs, simple balance with a fixed process count is more predictable. You control the concurrency explicitly, and Horizon doesn't try to adjust it based on queue depth in a way that could interrupt active jobs.

For a deeper look at structuring queue topology across multiple job types, the Laravel queue route guide covers centralising queue assignments cleanly.

The complete production config

Here's everything together as a reference:

// config/queue.php: raise retry_after to cover the longest AI job
'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => env('REDIS_QUEUE', 'default'),
    'retry_after' => 360, // supervisor timeout (300) + 60s buffer
    'block_for' => null,
],

// config/horizon.php: two supervisors, two concerns
'environments' => [
    'production' => [
        'supervisor-default' => [
            'connection' => 'redis',
            'queue' => ['default', 'high'],
            'balance' => 'auto',
            'processes' => 10,
            'timeout' => 60,
            'tries' => 3,
            'backoff' => 3,
            'memory' => 256,
        ],
        'supervisor-ai' => [
            'connection' => 'redis',
            'queue' => ['ai'],
            'balance' => 'auto',
            'processes' => 3,
            'timeout' => 300,
            'tries' => 3,
            'backoff' => [30, 60, 120],
            'memory' => 512,
        ],
    ],
],

// App\Jobs\RunAiReportJob.php: job-level timeouts and backoff
class RunAiReportJob implements ShouldQueue
{
    public string $queue = 'ai';
    public int $timeout = 270;  // slightly below supervisor timeout
    public int $tries = 3;
    public array $backoff = [30, 60, 120];

    public function handle(): void
    {
        // Your AI SDK call here
    }

    public function failed(Throwable $exception): void
    {
        // Log, notify, or clean up on final failure
    }
}

After deploying, run php artisan horizon:terminate to restart Horizon and pick up the new config. The queue:restart signal alone doesn't reload Horizon's supervisor configuration.

FAQ

Why does Horizon kill jobs at 60 seconds when I never set a timeout?

Horizon inherits the timeout from the queue worker defaults. If you don't explicitly set timeout in the supervisor config, it falls back to the worker default of 60 seconds. This is documented but easy to miss, because the installed config/horizon.php doesn't always include timeout in every supervisor block. Absence isn't zero, it's the default.

Can I just set a very high timeout everywhere instead of creating separate supervisors?

You can, but it's the wrong call. A 300-second timeout on your email supervisor means a stuck email job (due to a mail server timeout, for example) occupies a worker for five minutes instead of one. Separate supervisors let you apply constraints appropriate to each job type. The email jobs don't need to know about AI job timeouts, and vice versa.

Does the AI SDK dispatch jobs automatically, or do I need to wrap calls manually?

You wrap them manually. The Laravel AI SDK runs synchronously by default. Dispatching an agent call as a background job is an explicit architectural choice you make by wrapping the Agent::run() call in a ShouldQueue class. The SDK doesn't push to queues automatically.

What happens to an AI job that exceeds $timeout on the job class?

If $timeout is set on the job class and the supervisor timeout is higher, the job gets a TimeoutExceededException, which triggers the failed() method cleanly. This is the preferred failure mode: you get the cleanup hook, the exception is logged, and tries are decremented normally. A SIGKILL from the supervisor timeout skips all of that.

How do I verify the timeout and retry_after relationship is correct after deploying?

Run a test job that deliberately sleeps for a duration between your supervisor timeout and retry_after, then watch Horizon's dashboard. If the job appears twice in Recent Jobs, your retry_after is still too low. Also check the Redis key directly: redis-cli LRANGE queues:ai 0 -1 will show you if a job is sitting in both an executing state and the queue simultaneously.

One thing to remember

Most of the pain from adding AI SDK to an existing Laravel app isn't in the code. It's in configuration assumptions that were correct before and quietly broke afterward. None of these four changes are complex. They're just easy to not know about until something goes wrong in production.

The timeout and retry_after relationship is the most dangerous of the four. It's in the official docs, it's a known issue with a clear warning, and it still catches experienced developers because the two values live in different config files with no cross-reference in the UI. You'd have to know to check one when you change the other.

The queue isolation change has the biggest ongoing benefit. Once your AI jobs run in their own supervisor, you can tune that supervisor independently. You can scale the worker count up during peak AI usage without affecting the process count for your other queues. You can set different memory limits, different balancing strategies, and different alerting thresholds. Fast jobs and slow jobs just have different operational needs, and the config reflects that.

If you're running AI workloads on Horizon right now without dedicated supervisors, go check your retry_after value before anything else. If it's under 300, you may already be double-processing jobs and not seeing it in your logs.

Building something with the AI SDK and want a review of the queue architecture before it hits production? Get in touch.

DEV Community