Ameer Hamza

Posted on Mar 24

The N+1 Problem in AI Wrappers: Scaling Laravel + OpenAI

#laravel #ai #performance #php

The AI gold rush is here, and every developer is building an "AI wrapper." You spin up a Laravel app, pull in the OpenAI PHP client, wire up a controller, and boom—you have a product. It works perfectly on your local machine. It works perfectly for your first 10 users.

Then, you hit the front page of Hacker News.

Suddenly, your application grinds to a halt. Your logs are screaming 429 Too Many Requests. Your OpenAI API bill is skyrocketing because you're regenerating the same responses for different users. Your PHP-FPM workers are exhausted, hanging indefinitely while waiting for OpenAI's servers to respond.

You've just encountered the AI equivalent of the N+1 query problem.

In traditional web development, the N+1 problem occurs when you query the database in a loop instead of eager loading. In the AI era, the N+1 problem happens when you treat third-party LLM APIs like local, synchronous database calls.

In this deep dive, we'll explore the architectural pitfalls of naive AI integrations in Laravel and how to build a robust, queue-driven, and heavily cached AI pipeline that scales without bankrupting your API quota.

The Naive Approach: Synchronous API Calls

Let's look at how most developers initially integrate OpenAI into their Laravel controllers.

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use OpenAI\Laravel\Facades\OpenAI;
use App\Models\Article;

class ArticleSummaryController extends Controller
{
    public function store(Request $request, Article $article)
    {
        // ❌ BAD: Synchronous, blocking API call
        $response = OpenAI::chat()->create([
            'model' => 'gpt-4-turbo',
            'messages' => [
                ['role' => 'system', 'content' => 'Summarize the following article.'],
                ['role' => 'user', 'content' => $article->content],
            ],
        ]);

        $summary = $response->choices[0]->message->content;

        $article->update(['summary' => $summary]);

        return response()->json(['summary' => $summary]);
    }
}

Why this is a disaster waiting to happen:

Blocking the Worker: PHP is synchronous. If OpenAI takes 15 seconds to generate the summary, that PHP-FPM worker is locked for 15 seconds. If you have 50 workers and 50 concurrent users request a summary, your entire application goes down. No one can even load the homepage.
No Retry Mechanism: Network requests fail. OpenAI goes down. If the API returns a 500 or a 429 (Rate Limit), the user gets a generic error, and the data is lost.
Zero Caching: If 100 users ask for the summary of the same article, you pay OpenAI 100 times.

Step 1: Moving to a Queue-Driven Architecture

The golden rule of AI integration: Never make an LLM API call in the HTTP request lifecycle.

Instead, we need to dispatch a job to the queue and return a response to the user immediately. We can use Laravel's broadcasting or polling to notify the frontend when the AI is done.

Let's refactor our controller to dispatch a job.

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use App\Jobs\GenerateArticleSummary;
use App\Models\Article;

class ArticleSummaryController extends Controller
{
    public function store(Request $request, Article $article)
    {
        // ✅ GOOD: Dispatch to queue and return immediately
        GenerateArticleSummary::dispatch($article, $request->user());

        return response()->json([
            'message' => 'Summary generation started.',
            'status_url' => route('articles.summary.status', $article)
        ], 202);
    }
}

Now, let's build the job. This is where the magic happens.

namespace App\Jobs;

use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use OpenAI\Laravel\Facades\OpenAI;
use App\Models\Article;
use App\Models\User;
use Illuminate\Support\Facades\Log;
use Throwable;

class GenerateArticleSummary implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $tries = 3;
    public $backoff = [10, 30, 60]; // Exponential backoff

    public function __construct(
        public Article $article,
        public User $user
    ) {}

    public function handle(): void
    {
        try {
            $response = OpenAI::chat()->create([
                'model' => 'gpt-4-turbo',
                'messages' => [
                    ['role' => 'system', 'content' => 'Summarize the following article.'],
                    ['role' => 'user', 'content' => $this->article->content],
                ],
            ]);

            $summary = $response->choices[0]->message->content;

            $this->article->update(['summary' => $summary]);

            // Notify the user via WebSockets
            // Broadcast::event(new SummaryGenerated($this->article, $this->user));

        } catch (\OpenAI\Exceptions\ErrorException $e) {
            if ($e->getErrorCode() === 'rate_limit_exceeded') {
                Log::warning('OpenAI Rate Limit Hit. Releasing job.');
                $this->release(60); // Wait 60 seconds before retrying
                return;
            }

            throw $e;
        }
    }

    public function failed(Throwable $exception): void
    {
        Log::error("Failed to generate summary for Article {$this->article->id}: {$exception->getMessage()}");
        // Notify user of failure
    }
}

Key Improvements:

Exponential Backoff: If the job fails, it waits 10 seconds, then 30, then 60 before retrying.
Rate Limit Handling: We specifically catch OpenAI's rate limit exception and release the job back to the queue with a 60-second delay, preventing us from burning through our retries instantly.
Non-Blocking: The user gets a 202 Accepted response instantly. The heavy lifting happens in the background.

Step 2: Intelligent Caching to Save Your Quota

If your app allows users to ask questions or generate content based on static inputs, you must cache the results. LLMs are deterministic enough (at temperature 0) that identical prompts should yield identical (or acceptable) cached responses.

Let's implement a caching layer using Laravel's Cache facade. We'll hash the prompt to create a unique cache key.

namespace App\Services;

use Illuminate\Support\Facades\Cache;
use OpenAI\Laravel\Facades\OpenAI;

class OpenAIService
{
    public function generateCachedResponse(string $systemPrompt, string $userPrompt): string
    {
        // Create a unique fingerprint for this exact request
        $cacheKey = 'openai_response_' . md5($systemPrompt . $userPrompt);

        return Cache::remember($cacheKey, now()->addDays(30), function () use ($systemPrompt, $userPrompt) {
            $response = OpenAI::chat()->create([
                'model' => 'gpt-4-turbo',
                'temperature' => 0.2, // Lower temperature for more deterministic caching
                'messages' => [
                    ['role' => 'system', 'content' => $systemPrompt],
                    ['role' => 'user', 'content' => $userPrompt],
                ],
            ]);

            return $response->choices[0]->message->content;
        });
    }
}

By hashing the combined system and user prompts, we ensure that if any user asks the exact same question, we serve the response from Redis in 2 milliseconds instead of paying OpenAI and waiting 10 seconds.

Step 3: Semantic Caching with Vector Databases

Exact string matching (MD5 hashing) is great, but what if User A asks "How do I scale Laravel?" and User B asks "What is the best way to scale a Laravel app?"

These are semantically identical, but their MD5 hashes will be completely different. This is where Semantic Caching comes in.

Instead of caching by exact string match, we embed the user's query into a vector, search our vector database (like Pinecone, Weaviate, or pgvector) for similar past queries, and return the cached response if the similarity score is high enough (e.g., > 0.95).

Here is a conceptual implementation using Laravel and a hypothetical Vector DB client:

namespace App\Services;

use OpenAI\Laravel\Facades\OpenAI;
use App\Services\VectorDatabase;

class SemanticCacheService
{
    public function __construct(protected VectorDatabase $vectorDb) {}

    public function ask(string $question): string
    {
        // 1. Generate an embedding for the user's question
        $embeddingResponse = OpenAI::embeddings()->create([
            'model' => 'text-embedding-3-small',
            'input' => $question,
        ]);

        $vector = $embeddingResponse->embeddings[0]->embedding;

        // 2. Search the vector database for similar past questions
        $similarPastQuery = $this->vectorDb->search('cached_queries', $vector, limit: 1);

        // 3. If we find a match with > 95% similarity, return the cached answer
        if ($similarPastQuery && $similarPastQuery->score > 0.95) {
            return $similarPastQuery->metadata['answer'];
        }

        // 4. Otherwise, ask the LLM
        $llmResponse = OpenAI::chat()->create([
            'model' => 'gpt-4-turbo',
            'messages' => [['role' => 'user', 'content' => $question]],
        ]);

        $answer = $llmResponse->choices[0]->message->content;

        // 5. Store the new question and answer in the vector database for future users
        $this->vectorDb->insert('cached_queries', [
            'vector' => $vector,
            'metadata' => [
                'question' => $question,
                'answer' => $answer
            ]
        ]);

        return $answer;
    }
}

This approach drastically reduces API costs for applications like AI customer support bots or documentation assistants, where users frequently ask variations of the same questions.

Step 4: Circuit Breakers for API Outages

When OpenAI goes down (and it will), your queues will quickly fill up with failing jobs. If you have 10,000 jobs in the queue and they all start failing and retrying, you'll exhaust your worker resources and potentially get your IP banned when the API comes back up.

You need a Circuit Breaker.

A circuit breaker monitors for consecutive failures. If the failure rate crosses a threshold, the circuit "trips" (opens), and subsequent requests are immediately rejected or delayed without even trying to hit the API. After a cooldown period, it allows a "half-open" state to test if the API is back.

We can implement a simple circuit breaker in our Laravel job using the Cache:

namespace App\Jobs\Middleware;

use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Log;

class OpenAICircuitBreaker
{
    public function handle($job, $next)
    {
        if (Cache::has('openai_circuit_open')) {
            Log::warning('Circuit breaker open. Releasing job.');
            $job->release(300); // Delay for 5 minutes
            return;
        }

        try {
            $next($job);
            // On success, reset the failure counter
            Cache::forget('openai_consecutive_failures');
        } catch (\Exception $e) {
            $failures = Cache::increment('openai_consecutive_failures');

            if ($failures >= 10) {
                // Trip the circuit for 5 minutes
                Cache::put('openai_circuit_open', true, 300);
                Log::critical('OpenAI Circuit Breaker TRIPPED!');
            }

            throw $e;
        }
    }
}

You then attach this middleware to your job:

public function middleware()
{
    return [new \App\Jobs\Middleware\OpenAICircuitBreaker];
}

Conclusion

Building an AI wrapper is easy. Scaling it is hard. By treating LLM APIs with the same architectural respect as slow, external microservices, you can build resilient applications that survive traffic spikes and API outages.

Key Takeaways:

Never block the HTTP request: Always use queues for LLM calls.
Handle Rate Limits gracefully: Catch 429 errors and use exponential backoff.
Cache aggressively: Use exact string matching for static prompts and semantic caching for user queries.
Protect your workers: Implement circuit breakers to prevent queue stampedes during API outages.

By implementing these patterns, you'll ensure your Laravel application remains blazing fast, your workers stay healthy, and your OpenAI bill stays manageable.

Discussion Prompt

Have you encountered the "AI N+1 problem" in your own applications? What caching strategies have you found most effective for reducing LLM API costs? Let me know in the comments!

About the Author: Ameer Hamza is a Top-Rated Full-Stack Developer with 7+ years of experience building SaaS platforms, eCommerce solutions, and AI-powered applications. He specializes in Laravel, Vue.js, React, Next.js, and AI integrations — with 50+ projects shipped and a 100% job success rate. Check out his portfolio at ameer.pk to see his latest work, or reach out for your next development project.

Top comments (2)

Stas • Mar 24 • Edited

Sort of useless: $cacheKey = 'openai_response_' . md5($systemPrompt . $userPrompt); - user prompts usually are never the same unless it is a pre-defined select-one-of choice.
However, surprisingly useful article. I can relate to most of the points (except aggressive caching).

P.S. Funny enough, I found myself trying to do the same thing this week but using the Laravel AI kit. Eventually ended up with a custom workaround because the very hyped Laravel native AI kit has poor implementation and can't be used for background processing.

P.P.S. Background processing for non-urgent AI requests is the best killer feature - most providers support discounted rates for delayed/batched processing. For example, OpenAI has 50% off on "flex" requests, but from my experience, they are done within a few minutes (1-5 typically).

Ameer Hamza • Mar 26

That's a valid point about md5 hashing for variable user prompts, Stas. Semantic caching seems like a more robust solution as the article suggests. Glad you found the article useful overall!