shashank ms

Posted on Jun 20

Building Recommender Systems with LLM

#aiinfrastructure #oxlo #ai

Traditional recommender systems rely on matrix factorization or two-tower neural networks trained on static interaction logs. Large language models introduce a different primitive: they can reason over unstructured item metadata, user reviews, and long behavioral sequences in natural language. This shifts part of the problem from sparse ID-based learning to dense, semantic understanding, but it also changes the cost structure of inference. When a single recommendation request includes a user history with thousands of tokens, token-based billing scales linearly with input length. Oxlo.ai flips this model by charging one flat cost per API request regardless of prompt length, which makes long-context reranking and agentic recommendation loops economically viable at scale.

Why LLMs for Recommendation

Collaborative filtering struggles with cold-start items and sparse interactions. LLMs mitigate this by encoding domain knowledge at pretraining time. A model such as Llama 3.3 70B or DeepSeek R1 671B MoE can infer that a user who reads about distributed systems might also be interested in consensus algorithms, even if no co-occurrence exists in the training log. The trade-off is latency and cost, so the architecture must separate cheap candidate generation from expensive reasoning.

Architecture Patterns

Most production LLM-based recommenders use a retrieve-then-rank pipeline. The retrieval stage narrows a catalog of millions down to hundreds. The ranking stage uses an LLM to order those candidates by relevance. Common variants include:

Embedding retrieval: Use an embedding model to encode items and queries into a shared vector space, then search with FAISS or ScaNN.
LLM reranking: Prompt a chat model with user context and candidate item descriptions, then parse a structured relevance score or ranked list.
Generative retrieval: The model directly generates item titles or IDs from a catalog it has seen in context.
Agentic recommendation: The LLM calls tools, such as vector search or inventory APIs, in a multi-turn loop before producing a final suggestion.

Candidate Generation with Embeddings

Oxlo.ai offers BGE-Large and E5-Large through a fully OpenAI-compatible embeddings endpoint. Because Oxlo.ai uses request-based pricing, batching hundreds of item descriptions into a single embeddings request avoids the per-token accumulation you would see on token-based providers like Together AI or Fireworks AI. The retrieved vectors can be stored in any vector database for low-latency nearest-neighbor search.

import os
import openai
import numpy as np

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

def embed_catalog(items: list[str], model: str = "bge-large"):
    """Batch-encode item descriptions. One request regardless of batch size."""
    resp = client.embeddings.create(model=model, input=items)
    return np.array([d.embedding for d in resp.data])

# Example: encode 500 product descriptions in one request
vectors = embed_catalog(product_descriptions)

After indexing, retrieval is a pure vector search problem. The LLM only enters the pipeline at the ranking stage, which keeps costs predictable.

Reranking and Reasoning

For the ranking stage, you need structured output and reasoning. Oxlo.ai supports JSON mode and function calling across its chat models, so you can prompt Llama 3.3 70B or Kimi K2.6 to return a sorted list of item indices with explanation traces. If your pipeline requires deep reasoning over code or technical documentation, DeepSeek R1 671B MoE or DeepSeek V4 Flash are available, with the latter offering a 1M context window.

def rerank(user_history: str, candidates: list[str], model: str = "llama-3.3-70b"):
    candidate_block = "\n".join(
        f"Item {i}: {desc}" for i, desc in enumerate(candidates)
    )
    prompt = (
        f"Given the user history below, rank the candidate items by relevance. "
        f"Return strictly JSON with keys 'ranking' (list of item indices) and 'reasoning'.\n\n"
        f"User history:\n{user_history}\n\n"
        f"Candidates:\n{candidate_block}"
    )

    resp = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
        stream=False
    )
    return resp.choices[0].message.content

Because Oxlo.ai bills per request, a reranking prompt with a 100K token user history and fifty detailed product descriptions costs the same as a one-sentence greeting. On token-based platforms, including OpenRouter, Replicate, or Anyscale, that same call incurs a heavy input-token surcharge.

Long Context and Agentic Pipelines

Modern recommenders are moving beyond single-turn ranking. Agentic systems maintain a multi-turn conversation with tool use: querying a vector DB, checking inventory, filtering by price, and then refining suggestions. Models such as GLM 5, Minimax M2.5, and Kimi K2.6 support long-horizon agentic tasks and tool use. With Oxlo.ai, each tool-calling turn is one flat request. You can pass entire conversation histories, catalog pages, and API documentation into the context window without the cost escalation typical of metered inference.

DeepSeek V4 Flash is particularly useful here. Its 1M context window and efficient MoE architecture let you dump a large product catalog into the prompt for generative retrieval or in-context learning, while the flat request pricing keeps the experiment reproducible on a fixed budget.

End-to-End Example

Below is a minimal retrieve-then-rank pipeline using Oxlo.ai for both embeddings and reranking. The retrieval stage uses BGE-Large to find the top 20 candidates. The ranking stage uses Llama 3.3 70B with JSON mode to return the top 5.

import json

class LLMRecommender:
    def __init__(self, embed_model="bge-large", rank_model="llama-3.3-70b"):
        self.embed_model = embed_model
        self.rank_model = rank_model
        self.catalog = []      # list of item dicts
        self.vectors = None    # np.array of embeddings

    def index(self, items: list[dict]):
        """Items must have a 'description' field."""
        self.catalog = items
        texts = [i["description"] for i in items]
        self.vectors = embed_catalog(texts, self.embed_model)
        # build FAISS index here if desired

    def recommend(self, user_history: str, top_k: int = 5) -> list[dict]:
        # 1. Retrieve top-20 candidates via vector similarity
        q_vec = embed_catalog([user_history], self.embed_model)[0]
        scores = self.vectors @ q_vec
        top_20_idx = np.argpartition(scores, -20)[-20:]
        candidates = [self.catalog[i] for i in top_20_idx]

        # 2. LLM rerank
        result_json = rerank(
            user_history,
            [c["description"] for c in candidates],
            self.rank_model
        )
        result = json.loads(result_json)
        ranked = [candidates[i] for i in result["ranking"][:top_k]]
        return ranked

This architecture separates concerns. Embedding inference is cheap and stateless. Ranking inference is expensive but bounded, and because Oxlo.ai charges per request, you know the cost of 1,000 recommendations is exactly 1,000 ranking requests.

Evaluation and Metrics

Offline metrics such as NDCG@K and HitRate@K still apply, but LLM-based systems add a new dimension: prompt sensitivity. You should A/B test prompt templates as aggressively as you test model weights. For online evaluation, you can use an LLM-as-a-judge pattern where a separate model, such as Qwen 3 32B, scores recommendation lists for diversity, relevance, and safety. Oxlo.ai's flat pricing makes this double-inference setup practical; the judge model runs at the same per-request cost regardless of how verbose the candidate explanations are.

Infrastructure and Cost

Production recommender systems need low latency and no cold starts. Oxlo.ai serves popular models without cold starts, so latency stays stable under traffic spikes. Streaming responses let you flush partial results to the client while the model finishes reasoning, which improves perceived latency for long agentic chains.

If you are currently hosting on a token-based provider, Oxlo.ai's Enterprise plan offers dedicated GPUs and a guaranteed 30% reduction over your current spend. For most long-context recsys workloads, the savings can be significantly larger because a single request can subsume tens of thousands of input tokens. See https://oxlo.ai/pricing for plan details.

Conclusion

LLMs are not a drop-in replacement for collaborative filtering, but they are a powerful augmentation for semantic retrieval, reasoning-based ranking, and agentic recommendation. The main bottleneck in production is not model quality but cost control at scale. Oxlo.ai's request-based pricing removes the penalty for long user histories and large candidate sets, making it a natural inference backend for modern recommender systems. If your pipeline currently pays per token for multi-stage retrieval and ranking, moving the reasoning layer to Oxlo.ai is a straightforward way to flatten your cost curve without rewriting client code.

DEV Community