DEV Community

Cover image for RAG-less Architecture in Laravel: Long-Context Caching with Gemini
Dewald Hugo
Dewald Hugo

Posted on • Originally published at origin-main.com

RAG-less Architecture in Laravel: Long-Context Caching with Gemini

For the past few years, RAG has been presented as the only serious approach to grounding AI on large private datasets. Build a vector database, chunk your documents, embed the chunks, retrieve the closest matches, inject them into the prompt. Repeat for every query. It works, and for certain use cases it is genuinely the right call.

But for others — particularly repeated, deep-analysis tasks on the same corpus — RAG is an expensive workaround for a context window problem that the models have largely outgrown.

Google’s Gemini 2.5 Flash and 2.5 Pro both ship with a 1-million-token context window. That is approximately 750,000 words, or a mid-sized Laravel monolith with room to spare. More importantly, the Gemini API now supports Explicit Context Caching: you upload that corpus once, pay full input price once, and reference the cached context on every subsequent query at up to 75% off the standard input rate. For workloads that involve many questions against the same codebase, document set, or knowledge base, this is a fundamentally different cost model — and a fundamentally different architecture.

This guide walks you through building a GeminiContextCacheService in Laravel 11/12, binding it to the Service Container, and building a production-ready audit endpoint around it. We will cover the full cache lifecycle, including the storage cost trap that most tutorials skip entirely.

RAG vs. Long-Context Caching with Gemini: The Architectural Trade-Off

Before touching code, we need to be honest about what each approach actually is. Neither is universally superior.

RAG trades context completeness for cost predictability. You pay only for the retrieved chunks, not the full corpus — which is excellent when your dataset is enormous (hundreds of GB) and any single query only needs a narrow slice of it. The problem is that retrieval is fundamentally lossy. A semantic search might pull the right controller method and miss the BaseModel that overrides save(). It will find the public interface and skip the private helper that silently transforms the input. That retrieval gap is not a bug you can patch; it is the architecture.

Long-context caching with Gemini trades retrieval precision for corpus completeness. The model sees everything. There is no retrieval step, no chunking pipeline, no embedding model to maintain. The trade-off is cost structure: the first query is expensive (you pay to ingest the full context), and you carry a per-hour storage fee while the cache lives. Beyond the break-even point — which arrives faster than you expect — repeated queries cost significantly less than the RAG equivalent.

The Cost Curve: Why Caching Is the Financial Lever

Let us put real numbers on this. Using Gemini 2.5 Flash, which offers the best price-to-context ratio for this workload:

  • Standard input rate (long context >128K tokens): $0.30/MTok
  • Cached input rate: $0.03/MTok (90% reduction at Flash long-context tier)
  • Cache storage: billed per hour per MTok while the cache lives

For a 500K-token codebase audit with 20 questions:

Without caching: Each query sends 500K tokens to the API. That is $0.15 per question × 20 = $3.00 total.

With caching: Initial cache creation: $0.15 (one-time). Each subsequent query references the cache at $0.015. Twenty queries: $0.15 + ($0.015 × 19) = $0.435 total — plus a small storage fee for however long the cache lives.

The break-even is query 2. Every question after that is 90% cheaper than the RAG alternative that re-sends the same context.

When RAG Still Wins

We should be direct about this before anyone rips out a working vector pipeline.

If your corpus exceeds 1M tokens — a multi-year ticket archive, a large documentation set, a multi-repo monorepo — the long-context window is not a solution. You physically cannot fit the data. RAG is the architecture. Our existing guide on Laravel Embeddings, Vector Databases, and RAG covers pgvector and the full embedding pipeline if that is your situation.

Similarly, if your corpus changes frequently — user-generated content, live database records — cache invalidation becomes expensive. Re-creating a 1M-token cache on every data change defeats the cost benefit.

Long-context caching targets a specific sweet spot: a corpus that is large enough to matter (above RAG’s retrieval quality threshold), small enough to fit in 1M tokens, and stable enough that you will run many queries before the next update. B2B codebase auditing, legal document analysis, product catalog deep-dives — these are the use cases.

Implementation: The Stateful Expert Pattern

Prerequisites

Add your Gemini API key to .env:

GEMINI_API_KEY=your_key_here
GEMINI_MODEL=gemini-2.5-flash
GEMINI_CACHE_TTL=7200
Enter fullscreen mode Exit fullscreen mode

Then register the configuration values in config/services.php:

'gemini' => [
    'api_key'   => env('GEMINI_API_KEY'),
    'base_url'  => 'https://generativelanguage.googleapis.com/v1beta',
    'model'     => env('GEMINI_MODEL', 'gemini-2.5-flash'),
    'cache_ttl' => (int) env('GEMINI_CACHE_TTL', 7200),
],
Enter fullscreen mode Exit fullscreen mode

The minimum content size for Gemini context caching is 32,768 tokens — roughly 25,000 words. You cannot cache a handful of files. You need to be pushing a full module, a full codebase, or a substantial document set. Keep that constraint in mind when designing your corpus preparation step.

If you have not yet set up your Gemini client or worked through model selection on your Laravel stack, our Integrating Gemini into the Laravel AI SDK guide covers the foundational setup decisions before you start hitting the API directly.

Step 1: The Context Caching Service

This service owns two responsibilities: cache lifecycle (create, retrieve, delete) and query dispatch. We keep them in one class because the cache name is required for query — splitting them would force unnecessary dependency injection.

<?php

namespace App\Services;

use Illuminate\Support\Facades\Cache;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;
use RuntimeException;

class GeminiContextCacheService
{
    private string $apiKey;
    private string $baseUrl;
    private string $model;
    private int    $ttl;

    public function __construct()
    {
        $this->apiKey  = config('services.gemini.api_key');
        $this->baseUrl = config('services.gemini.base_url');
        $this->model   = config('services.gemini.model');
        $this->ttl     = config('services.gemini.cache_ttl');
    }

    /**
     * Return an existing live cache name from Redis, or create a new one.
     * The cache key is your application's identifier (e.g. "audit:project-42").
     * The Gemini cache name is what the API issues back — it looks like
     * "cachedContents/abc123xyz" and is what you reference in generation requests.
     */
    public function getOrCreateCache(
        string $cacheKey,
        string $systemPrompt,
        array  $documents
    ): string {
        $geminiCacheName = Cache::get("gemini_cache:{$cacheKey}");

        if ($geminiCacheName) {
            Log::debug('Gemini cache hit', ['key' => $cacheKey]);
            return $geminiCacheName;
        }

        return $this->createCache($cacheKey, $systemPrompt, $documents);
    }

    private function createCache(
        string $cacheKey,
        string $systemPrompt,
        array  $documents
    ): string {
        // Each document becomes a separate part in the cached content
        $parts = array_map(
            fn(string $doc) => ['text' => $doc],
            $documents
        );

        $response = Http::withHeaders([
            'x-goog-api-key' => $this->apiKey,
            'Content-Type'   => 'application/json',
        ])->post("{$this->baseUrl}/cachedContents", [
            'model'            => "models/{$this->model}",
            'displayName'      => $cacheKey,
            'systemInstruction' => [
                'parts' => [['text' => $systemPrompt]],
            ],
            'contents' => [
                ['role' => 'user', 'parts' => $parts],
            ],
            'ttl' => "{$this->ttl}s",
        ]);

        if ($response->failed()) {
            Log::error('Gemini cache creation failed', [
                'status' => $response->status(),
                'body'   => $response->json(),
                'key'    => $cacheKey,
            ]);

            throw new RuntimeException(
                'Failed to create Gemini context cache: ' . $response->body()
            );
        }

        $cacheName = $response->json('name');

        // Store with a 60-second buffer before the Gemini TTL expires
        // so we never reference a cache the API has already deleted
        Cache::put(
            "gemini_cache:{$cacheKey}",
            $cacheName,
            $this->ttl - 60
        );

        Log::info('Gemini context cache created', [
            'name' => $cacheName,
            'key'  => $cacheKey,
        ]);

        return $cacheName;
    }

    /**
     * Run a user question against an existing cached context.
     */
    public function query(string $cacheName, string $question): string
    {
        $response = Http::withHeaders([
            'x-goog-api-key' => $this->apiKey,
            'Content-Type'   => 'application/json',
        ])->post("{$this->baseUrl}/models/{$this->model}:generateContent", [
            'cachedContent' => $cacheName,
            'contents'      => [
                ['role' => 'user', 'parts' => [['text' => $question]]],
            ],
        ]);

        if ($response->failed()) {
            if ($response->status() === 429) {
                throw new RuntimeException(
                    'Gemini rate limit exceeded. Implement exponential backoff before retrying.'
                );
            }

            Log::error('Gemini query failed', [
                'status' => $response->status(),
                'cache'  => $cacheName,
            ]);

            throw new RuntimeException('Gemini query failed: ' . $response->body());
        }

        $text = $response->json('candidates.0.content.parts.0.text');

        if ($text === null) {
            throw new RuntimeException(
                'Unexpected Gemini response structure: ' . $response->body()
            );
        }

        return $text;
    }

    /**
     * Explicitly delete a cache when the session ends.
     * Do not leave caches running — storage costs accumulate per hour.
     */
    public function deleteCache(string $cacheKey): void
    {
        $geminiCacheName = Cache::get("gemini_cache:{$cacheKey}");

        if (! $geminiCacheName) {
            return;
        }

        Http::withHeaders(['x-goog-api-key' => $this->apiKey])
            ->delete("{$this->baseUrl}/{$geminiCacheName}");

        Cache::forget("gemini_cache:{$cacheKey}");

        Log::info('Gemini context cache deleted', ['key' => $cacheKey]);
    }
}
Enter fullscreen mode Exit fullscreen mode

[Architect’s Note] The 60-second TTL buffer on the Redis key is not optional. If your Laravel cache expiry aligns exactly with Gemini’s TTL, a query that arrives in the final seconds will find a valid Redis key but hit a deleted Gemini cache. The API returns a 404, your query() method throws, and your user gets a 503. Build the buffer in. If you are concerned about token governance and want to log every token consumed by cached vs. standard queries separately, our Laravel AI Middleware: Token Tracking & Rate Limiting guide covers the instrumentation layer you would wrap around this service.

Step 2: Binding to the Service Container

Register the service as a singleton in AppServiceProvider. Because the service reads from config in its constructor, a singleton is correct here — we want one instance per request cycle, not one per injection.

<?php

namespace App\Providers;

use App\Services\GeminiContextCacheService;
use Illuminate\Support\ServiceProvider;

class AppServiceProvider extends ServiceProvider
{
    public function register(): void
    {
        $this->app->singleton(GeminiContextCacheService::class);
    }
}
Enter fullscreen mode Exit fullscreen mode

Laravel’s Service Container handles the rest. Any controller, job, or command that type-hints GeminiContextCacheService receives the same instance within a given request.

Step 3: The Audit Controller

<?php

namespace App\Http\Controllers;

use App\Services\GeminiContextCacheService;
use Illuminate\Http\JsonResponse;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Log;
use RuntimeException;

class CodeAuditController extends Controller
{
    public function __construct(
        private readonly GeminiContextCacheService $gemini
    ) {}

    public function audit(Request $request): JsonResponse
    {
        $validated = $request->validate([
            'project_id' => ['required', 'string', 'max:100'],
            'question'   => ['required', 'string', 'max:2000'],
        ]);

        try {
            $documents = $this->loadProjectDocuments($validated['project_id']);

            $systemPrompt  = 'You are a senior Laravel architect reviewing a production codebase. '
                . 'When answering questions, cite specific files and line numbers. '
                . 'Flag security risks, legacy patterns, and architectural concerns explicitly.';

            $cacheName = $this->gemini->getOrCreateCache(
                cacheKey:     "audit:{$validated['project_id']}",
                systemPrompt: $systemPrompt,
                documents:    $documents,
            );

            $answer = $this->gemini->query($cacheName, $validated['question']);

            return response()->json(['answer' => $answer]);

        } catch (RuntimeException $e) {
            Log::error('Code audit failed', [
                'project_id' => $validated['project_id'],
                'error'      => $e->getMessage(),
            ]);

            return response()->json(
                ['error' => 'Audit service temporarily unavailable.'],
                503
            );
        }
    }

    private function loadProjectDocuments(string $projectId): array
    {
        $path = storage_path("app/projects/{$projectId}/codebase.txt");

        if (! file_exists($path)) {
            throw new RuntimeException(
                "Codebase not found for project: {$projectId}"
            );
        }

        // Return as array — each element becomes a separate Part in the cached content.
        // Split large codebases into logical chunks (per-module) if approaching 1M tokens.
        return [file_get_contents($path)];
    }
}
Enter fullscreen mode Exit fullscreen mode

Register the route with Sanctum auth and a conservative throttle. Twenty questions per minute is generous for a codebase audit endpoint — adjust to match your billing ceiling.

// routes/api.php
use App\Http\Controllers\CodeAuditController;

Route::middleware(['auth:sanctum', 'throttle:20,1'])
    ->post('/audit', [CodeAuditController::class, 'audit']);
Enter fullscreen mode Exit fullscreen mode

Securing this endpoint properly is not optional — an unauthenticated audit endpoint is a direct vector for API cost abuse. Our Laravel Sanctum API Authentication guide covers token scoping and rate-limit configuration if you need to tighten this down further.

Cache Lifecycle Management

This is the section most tutorials omit. Skip it and you will find $800 in unexpected Gemini storage charges at the end of the month.

Gemini charges per MTok per hour for every active cache. The pricing varies by model — check Google’s official API pricing page for current rates. The pattern you want is: create the cache when the audit session starts, delete it when the session ends, and run a scheduled cleanup job to catch anything that leaked through.

The cleanup command:

<?php

namespace App\Console\Commands;

use Carbon\Carbon;
use Illuminate\Console\Command;
use Illuminate\Support\Facades\Http;
use Illuminate\Support\Facades\Log;

class CleanExpiredGeminiCaches extends Command
{
    protected $signature   = 'gemini:clean-caches';
    protected $description = 'Delete expired or stale Gemini context caches';

    public function handle(): int
    {
        $apiKey  = config('services.gemini.api_key');
        $baseUrl = config('services.gemini.base_url');

        $response = Http::withHeaders(['x-goog-api-key' => $apiKey])
            ->get("{$baseUrl}/cachedContents");

        if ($response->failed()) {
            $this->error('Failed to list caches: ' . $response->body());
            return self::FAILURE;
        }

        $caches = $response->json('cachedContents', []);

        foreach ($caches as $cache) {
            $expireTime = Carbon::parse($cache['expireTime']);

            if ($expireTime->isPast()) {
                Http::withHeaders(['x-goog-api-key' => $apiKey])
                    ->delete("{$baseUrl}/{$cache['name']}");

                Log::info('Deleted stale Gemini cache', ['name' => $cache['name']]);
                $this->info("Deleted: {$cache['name']}");
            }
        }

        return self::SUCCESS;
    }
}
Enter fullscreen mode Exit fullscreen mode

Schedule it in bootstrap/app.php:

->withSchedule(function (Schedule $schedule) {
    $schedule->command('gemini:clean-caches')->hourly();
})
Enter fullscreen mode Exit fullscreen mode

[Edge Case Alert] If a queue worker crashes mid-audit before deleteCache() is called, the Gemini cache lives until its TTL expires and storage charges accumulate. The scheduled cleanup command is your safety net, not a replacement for explicit deletion. Consider also dispatching a DeleteGeminiCacheJob via defer() at the end of each audit session as a belt-and-suspenders measure.

Use Case: The Persistent Auditor in Practice

Here is where the architecture earns its keep. You are auditing a legacy Laravel 6 application for a client. The codebase is 400K tokens — controllers, models, migrations, the lot.

The RAG approach: You ask about UserController::store(). The retrieval pulls three relevant chunks. The answer looks correct. But the client’s BaseModel has an overridden save() method that intercepts every Eloquent write and transforms certain fields. RAG did not retrieve BaseModel — it was not semantically close enough to the query. The AI gives you an answer that describes the surface behaviour and misses the hack underneath. The audit is wrong.

The cached-context approach: You upload the full codebase. When you ask about UserController::store(), Gemini has already read BaseModel. It answers that the apparent save logic is intercepted downstream, identifies the override, and flags it as a maintenance risk. The context is complete. The answer is complete.

This is what we mean by Architectural Integrity. RAG optimises for retrieval cost. Long-context caching optimises for answer correctness. For an audit product where a wrong answer has professional liability implications, correctness wins.

The cost math, using our earlier figures: 20 audit questions on a 500K-token codebase costs $3.00 without caching and approximately $0.44 with caching — plus storage for however long the session runs. For a B2B audit priced at $200 per report, the infrastructure cost is negligible either way. But the quality difference is not.

For production agentic pipelines — where the “questions” are not human prompts but tool calls from an orchestrated agent — the cost advantage compounds quickly. Our guide on Hardening Laravel Agentic Workflows: Schema Validation Against LLM Hallucinations covers the validation layer you will want around those agent outputs, regardless of which AI architecture you choose.

Preparing Your Corpus for the Cache

The code examples above load a pre-processed codebase.txt file. In production, you need a pipeline that generates this file from your actual source. A simple Artisan command handles it:

<?php

namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\File;
use Symfony\Component\Finder\Finder;

class PrepareCodebaseCorpus extends Command
{
    protected $signature   = 'corpus:prepare {project_id} {path}';
    protected $description = 'Concatenate a project codebase into a single corpus file';

    public function handle(): int
    {
        $projectId = $this->argument('project_id');
        $path      = $this->argument('path');

        $finder = (new Finder())
            ->files()
            ->in($path)
            ->name(['*.php', '*.json', '*.yaml', '*.env.example'])
            ->notPath(['vendor', 'node_modules', 'storage', '.git'])
            ->sortByName();

        $corpus  = '';
        $counter = 0;

        foreach ($finder as $file) {
            $relativePath = str_replace($path . '/', '', $file->getRealPath());
            $corpus .= "\n\n// FILE: {$relativePath}\n";
            $corpus .= $file->getContents();
            $counter++;
        }

        $outputDir  = storage_path("app/projects/{$projectId}");
        $outputPath = $outputDir . '/codebase.txt';

        File::ensureDirectoryExists($outputDir);
        File::put($outputPath, $corpus);

        $sizeKb = round(strlen($corpus) / 1024, 1);
        $this->info("Corpus prepared: {$counter} files, {$sizeKb} KB → {$outputPath}");

        return self::SUCCESS;
    }
}
Enter fullscreen mode Exit fullscreen mode

Run it before starting an audit session: php artisan corpus:prepare project-42 /path/to/legacy-app. The file headers (// FILE: app/Models/BaseModel.php) give Gemini the file path context it needs to cite locations in its answers.

[Efficiency Gain] Strip comments and docblocks from PHP files before creating the corpus. A typical Laravel application loses 20–30% of its token count after comment stripping, which can push a borderline 1.2M-token codebase under the 1M limit and save meaningfully on cache creation costs. PHP’s token_get_all() makes this straightforward; a simple token filter removes T_COMMENT and T_DOC_COMMENT tokens before concatenation.

Conclusion: Context Is the New Database

RAG will not disappear. For certain corpus shapes — enormous, dynamic, multi-tenant — it remains the correct architecture. But it has been treated as the default answer for too long, applied reflexively to problems that long-context models can now solve more cleanly and more accurately.

Explicit context caching in Gemini changes the trade-off. The vector database, the chunking strategy, the embedding model, the retrieval tuning — for a well-bounded corpus, all of that overhead collapses into a single cached upload and a TTL you manage in Redis. The model sees the whole codebase. The answers reflect the whole codebase.

The architecture we have built here — GeminiContextCacheService bound as a singleton, TTL-buffered cache keys in Redis, explicit lifecycle management, a corpus preparation command — is production-ready. Wire up your Sanctum-protected audit endpoint, run the corpus preparation pipeline, and you have a stateful expert that knows your codebase better than any retrieval pipeline can.

For official context caching documentation and current pricing, refer directly to Google’s Gemini API Context Caching guide — pricing has shifted frequently in 2026 and the source of truth is Google’s pricing page, not any third-party aggregator.

Top comments (0)