Recommender systems have traditionally relied on matrix factorization, two-tower networks, or gradient-boosted trees to map users to items. Large language models introduce a different axis of capability: they can reason over unstructured text, interpret implicit signals in user histories, and re-rank candidates using zero-shot instructions. The practical challenge is not whether an LLM can rank items, but how to do it economically when a single request may contain thousands of tokens of user context, product descriptions, and system prompts.
Why LLMs Change the Retrieval-Ranking Pipeline
Classical recommenders excel at memorizing collaborative patterns, but they struggle with cold-start items, sparse metadata, and cross-domain reasoning. An LLM can ingest a userβs last twenty interactions, a product specification sheet, and a natural language business rule in a single prompt. This eliminates the engineering overhead of hand-crafted feature pipelines for every new content type.
The typical architecture remains two-stage. First, a fast retrieval model narrows a catalog of millions down to hundreds of candidates. Second, a heavy re-ranker orders the final slate. LLMs usually serve as the re-ranker, though embedding models fine-tuned for recommendation can also handle retrieval.
Stage One: Embedding-Based Retrieval with Oxlo.ai
For retrieval, you need dense embeddings that place semantically similar items close together in vector space. Oxlo.ai offers BGE-Large and E5-Large embedding models through a fully OpenAI-compatible endpoint. Because Oxlo.ai charges a flat rate per request rather than per token, embedding large batches of item descriptions or long user profiles does not inflate your bill.
The following Python snippet indexes a small catalog. You can point your existing OpenAI SDK client at Oxlo.ai by changing the base URL and API key.
import openai
import numpy as np
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
items = [
{"id": "a1", "text": "Waterproof hiking boots with Gore-Tex lining..."},
{"id": "a2", "text": "Ultra-light trail running shoe for mixed terrain..."},
# ... catalog of thousands
]
def embed(texts):
res = client.embeddings.create(
model="bge-large", # or e5-large
input=texts
)
return [r.embedding for r in res.data]
# Batch embedding under flat per-request pricing
vectors = embed([i["text"] for i in items])
# Store vectors in your vector DB (Milvus, Pinecone, pgvector, etc.)
Stage Two: LLM Re-Ranking with Structured Output
After retrieval, pass the candidate set to an LLM along with the user history and a ranking instruction. You want the model to return a structured list so you can parse it deterministically. Oxlo.ai supports JSON mode across its chat models, including Llama 3.3 70B and Qwen 3 32B.
Because a re-ranking prompt can grow very long, token-based pricing creates unpredictable costs. With Oxlo.ai, one request costs the same whether you send a terse prompt or a detailed user session history plus full item descriptions. This predictability matters when you are serving millions of recommendations per day.
import json
user_history = "\n".join([
"Purchased: Merino wool base layer, size M",
"Viewed: Lightweight alpine tents (3 min)",
"Searched: 'rain gear for pacific northwest'"
])
candidates = [
{"id": "a1", "title": "Hardshell jacket, eVent fabric"},
{"id": "a2", "title": "Down sleeping bag 15F"},
# ... top 50 retrieved
])
candidate_lines = "\n".join(
f'{c["id"]}: {c["title"]}' for c in candidates
)
prompt = f"""You are a ranking engine for an outdoor retailer.
Given the user history and candidate items, return the top 5 items in JSON format.
User history:
{user_history}
Candidates:
{candidate_lines}
Respond with JSON: {{"ranked_ids": ["..."]}}"""
res = client.chat.completions.create(
model="llama-3.3-70b", # or qwen3-32b
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0.1
)
ranking = json.loads(res.choices[0].message.content)
print(ranking["ranked_ids"])
Cost Predictability at Scale
Recommender workloads are inherently long-context. A user session may contain dozens of events, and each candidate item may include a lengthy description, attributes, and reviews. Under token-based billing, your spend scales linearly with that volume. Oxlo.ai uses request-based pricing: one flat cost per API call regardless of prompt length. For re-ranking stages that ingest rich context, this can be significantly cheaper than token-based alternatives. See https://oxlo.ai/pricing for current plan details.
When to Use LLMs versus Classical Recommenders
LLM re-ranking is most valuable when:
- Item metadata is text-heavy and irregular,
- Business rules change frequently and are easier to express in natural language than in code,
- You need zero-shot adaptation to new categories without retraining a collaborative model,
- User context extends beyond item IDs into free-form behavior.
For high-traffic, low-latency feeds where the candidate set is static and user behavior is dense, classical matrix factorization or two-tower models still win on raw throughput. Many production systems use a hybrid: classical retrieval, LLM re-ranking.
Getting Started on Oxlo.ai
Oxlo.ai provides 60 free requests per day across more than 16 models, including DeepSeek V3.2 on the free tier, so you can prototype a re-ranking pipeline without an upfront commitment. The platform is a drop-in replacement for the OpenAI SDK, which means you can test Oxlo.ai by changing two lines in your existing client configuration: the base URL and the API key. There are no cold starts on popular models, so latency remains consistent even when traffic spikes.
Top comments (0)