Building Recommender Systems with LLMs

#aiinfrastructure #oxlo #ai

Recommender systems have traditionally relied on matrix factorization, two-tower networks, or gradient-boosted trees to map users to items. Large language models introduce a different axis of capability: they can reason over unstructured text, interpret implicit signals in user histories, and re-rank candidates using zero-shot instructions. The practical challenge is not whether an LLM can rank items, but how to do it economically when a single request may contain thousands of tokens of user context, product descriptions, and system prompts.

Why LLMs Change the Retrieval-Ranking Pipeline

Classical recommenders excel at memorizing collaborative patterns, but they struggle with cold-start items, sparse metadata, and cross-domain reasoning. An LLM can ingest a user’s last twenty interactions, a product specification sheet, and a natural language business rule in a single prompt. This eliminates the engineering overhead of hand-crafted feature pipelines for every new content type.

The typical architecture remains two-stage. First, a fast retrieval model narrows a catalog of millions down to hundreds of candidates. Second, a heavy re-ranker orders the final slate. LLMs usually serve as the re-ranker, though embedding models fine-tuned for recommendation can also handle retrieval.

Stage One: Embedding-Based Retrieval with Oxlo.ai

For retrieval, you need dense embeddings that place semantically similar items close together in vector space. Oxlo.ai offers BGE-Large and E5-Large embedding models through a fully OpenAI-compatible endpoint. Because Oxlo.ai charges a flat rate per request rather than per token, embedding large batches of item descriptions or long user profiles does not inflate your bill.

The following Python snippet indexes a small catalog. You can point your existing OpenAI SDK client at Oxlo.ai by changing the base URL and API key.

import openai
import numpy as np

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

items = [
    {"id": "a1", "text": "Waterproof hiking boots with Gore-Tex lining..."},
    {"id": "a2", "text": "Ultra-light trail running shoe for mixed terrain..."},
    # ... catalog of thousands
]

def embed(texts):
    res = client.embeddings.create(
        model="bge-large",  # or e5-large
        input=texts
    )
    return [r.embedding for r in res.data]

# Batch embedding under flat per-request pricing
vectors = embed([i["text"] for i in items])
# Store vectors in your vector DB (Milvus, Pinecone, pgvector, etc.)

Stage Two: LLM Re-Ranking with Structured Output

After retrieval, pass the candidate set to an LLM along with the user history and a ranking instruction. You want the model to return a structured list so you can parse it deterministically. Oxlo.ai supports JSON mode across its chat models, including Llama 3.3 70B and Qwen 3 32B.

Because a re-ranking prompt can grow very long, token-based pricing creates unpredictable costs. With Oxlo.ai, one request costs the same whether you send a terse prompt or a detailed user session history plus full item descriptions. This predictability matters when you are serving millions of recommendations per day.

import json

user_history = "\n".join([
    "Purchased: Merino wool base layer, size M",
    "Viewed: Lightweight alpine tents (3 min)",
    "Searched: 'rain gear for pacific northwest'"
])

candidates = [
    {"id": "a1", "title": "Hardshell jacket, eVent fabric"},
    {"id": "a2", "title": "Down sleeping bag 15F"},
    # ... top 50 retrieved
])

candidate_lines = "\n".join(
    f'{c["id"]}: {c["title"]}' for c in candidates
)

prompt = f"""You are a ranking engine for an outdoor retailer.
Given the user history and candidate items, return the top 5 items in JSON format.

User history:
{user_history}

Candidates:
{candidate_lines}

Respond with JSON: {{"ranked_ids": ["..."]}}"""

res = client.chat.completions.create(
    model="llama-3.3-70b",  # or qwen3-32b
    messages=[{"role": "user", "content": prompt}],
    response_format={"type": "json_object"},
    temperature=0.1
)

ranking = json.loads(res.choices[0].message.content)
print(ranking["ranked_ids"])

Cost Predictability at Scale

Recommender workloads are inherently long-context. A user session may contain dozens of events, and each candidate item may include a lengthy description, attributes, and reviews. Under token-based billing, your spend scales linearly with that volume. Oxlo.ai uses request-based pricing: one flat cost per API call regardless of prompt length. For re-ranking stages that ingest rich context, this can be significantly cheaper than token-based alternatives. See https://oxlo.ai/pricing for current plan details.

When to Use LLMs versus Classical Recommenders

LLM re-ranking is most valuable when:

Item metadata is text-heavy and irregular,
Business rules change frequently and are easier to express in natural language than in code,
You need zero-shot adaptation to new categories without retraining a collaborative model,
User context extends beyond item IDs into free-form behavior.

For high-traffic, low-latency feeds where the candidate set is static and user behavior is dense, classical matrix factorization or two-tower models still win on raw throughput. Many production systems use a hybrid: classical retrieval, LLM re-ranking.

Getting Started on Oxlo.ai

Oxlo.ai provides 60 free requests per day across more than 16 models, including DeepSeek V3.2 on the free tier, so you can prototype a re-ranking pipeline without an upfront commitment. The platform is a drop-in replacement for the OpenAI SDK, which means you can test Oxlo.ai by changing two lines in your existing client configuration: the base URL and the API key. There are no cold starts on popular models, so latency remains consistent even when traffic spikes.