Sepehr Mohseni

Posted on Jan 28

Building AI-native backends – RAG pipelines, function calling, prompt versioning, LLM observability

#llm #rag #ai #backend

Two months ago, our internal knowledge base chatbot confidently told a support rep that our refund policy was “14 days, no questions asked.” Our real policy is 30 days with approval for larger amounts.

A $2,000 refund was processed based on that hallucination.

That was the moment we stopped treating LLM features like “smart text boxes” and started treating them like unreliable distributed systems that require real engineering.

This article is not about demos. It’s about what you have to build after the demo works.

The Reality of AI Backends

Traditional backends are deterministic.

Same input → same output.

AI backends are probabilistic.

Same input → slightly different output depending on context, model variance, and prompt structure.

This means:

You cannot trust outputs
You cannot trust retrieval
You cannot trust prompts
You cannot trust tool calls
You must observe everything

A production AI backend ends up looking like this:

API
│
AI Orchestrator
├─ Guardrails
├─ Router
├─ Rate limits
│
├─ RAG pipeline
├─ Function execution
└─ Direct generation
│
Observability + Evals

If you skip any of these layers, you will eventually ship a hallucination that costs money.

RAG Is Not “Chunk, Embed, Query”

The tutorial version of RAG is:

split text → embed → vector search → pass to LLM

That works in a notebook. It fails in production.

Real RAG needs:

Proper ingestion pipeline
Semantic chunking
Change detection
Hybrid retrieval (vector + keyword)
Reranking
Ongoing evaluation of retrieval quality

Ingestion in Laravel

Ingestion is a queued job, not a script.

You re see documents constantly. You re-embed only what changed.

class IngestDocuments
{
    public function handle(SourceInterface $source)
    {
        $documents = $source->fetch();

        foreach ($documents as $doc) {
            $hash = sha1($doc->content);

            if (Cache::get("doc_hash_{$doc->id}") === $hash) {
                continue;
            }

            $chunks = (new SemanticChunker())->chunk($doc->content);
            $embeddings = app(EmbeddingService::class)->embed($chunks);

            app(VectorStore::class)->upsert($doc->id, $chunks, $embeddings);

            Cache::put("doc_hash_{$doc->id}", $hash, now()->addDay());
        }
    }
}

The biggest quality improvement you will see is semantic chunking instead of fixed token splits.

Hybrid Retrieval Is Mandatory

Vector search misses exact matches like order IDs, SKUs, emails.

Keyword search misses meaning.

You need both.

class HybridRetriever
{
    public function search(string $query, int $limit = 8)
    {
        $vector = app(VectorStore::class)->search($query, $limit * 2);
        $keyword = app(KeywordSearch::class)->search($query, $limit * 2);

        return $this->mergeAndRank($vector, $keyword, $limit);
    }
}

Most hallucinations in RAG systems are actually retrieval failures, not model failures.

Generation With Grounded Context

What you pass to the model matters more than the model.

class RagResponder
{
    public function answer(string $question, array $chunks)
    {
        $context = collect($chunks)
            ->pluck('content')
            ->join("\n\n");

        $prompt = Prompt::load('rag-answer', 'v3');

        $response = app(LLM::class)->chat([
            ['role' => 'system', 'content' => $prompt->system],
            ['role' => 'user', 'content' => $prompt->fill([
                'context' => $context,
                'question' => $question,
            ])],
        ], temperature: 0.2, json: true);

        return $response;
    }
}

Low temperature. Structured output. Explicit context.

You are trying to reduce creativity, not increase it.

Function Calling Without Guardrails Will Burn You

Letting an LLM trigger backend actions without controls is equivalent to letting users hit internal APIs directly.

Every tool call must go through:

Authorization
Rate limiting
Audit logging
Optional approval

class ToolExecutor
{
    public function execute(string $tool, array $args, User $user)
    {
        $definition = ToolRegistry::get($tool);

        Gate::authorize($definition->ability, $user);

        if ($definition->needsApproval && !$user->isAdmin()) {
            throw new AuthorizationException();
        }

        RateLimiter::hit("tool:{$tool}", 60);

        $result = call_user_func($definition->handler, $args);

        AuditLog::create([
            'user_id' => $user->id,
            'tool' => $tool,
            'args' => $args,
            'result' => $result,
        ]);

        return $result;
    }
}

Refunds, account changes, billing operations — these must never be “just a function call.”

Prompts Are Code

Prompts change behavior more than code does.

They must be:

Versioned
Stored
Reviewed
Rolled out gradually

class Prompt extends Model
{
    protected $casts = ['variables' => 'array'];
}

class PromptManager
{
    public static function load(string $name, string $version): Prompt
    {
        return Prompt::where(compact('name', 'version'))->firstOrFail();
    }
}

Never hardcode prompts in PHP files.

You will want to change them without redeploying.

Observability Is Not Optional

You need to log, trace, and evaluate:

The user query
Retrieved chunks
Final prompt sent
Model output
Tokens and latency

Without this, you cannot debug hallucinations.

You also need automated evaluations that periodically ask:

“Is this answer actually grounded in the provided context?”

That’s how you catch issues before users do.

Caching and Cost Control

LLM calls are expensive and slow.

Cache deterministic calls by hashing inputs.

class CachedLLM
{
    public function chat(array $payload)
    {
        $key = hash('sha256', json_encode($payload));

        return Cache::remember($key, 3600, fn () =>
            app(LLM::class)->chat($payload)
        );
    }
}

Track cost daily and hard-stop if you exceed budget.

What Actually Prevents Incidents

After enough production incidents, you realize the real safeguards are:

Hybrid retrieval
Strict prompts
Guarded tool execution
Full tracing
Automated evals
Aggressive caching

Not model choice. Not fancy agents. Not frameworks.

Just engineering discipline applied to a probabilistic system.

Final Takeaway

A demo AI feature looks like magic.
A production AI system looks like a paranoid, over-engineered backend.

And that’s exactly what it needs to be.

DEV Community