Introduction to Retrieval-Augmented Generation (RAG) in LLM

#learnai #oxlo #ai

We will build a minimal RAG pipeline that ingests raw text documents, retrieves the most relevant chunks for a user query, and generates a grounded answer with Llama 3.3 70B on Oxlo.ai. This pattern is useful for internal knowledge bases, support bots, or any workflow where hallucination is expensive.

What you'll need

Python 3.10 or newer, the OpenAI SDK, and a few helpers. Grab an API key from the Oxlo.ai portal before you start.

pip install openai numpy scikit-learn

Step 1: Prepare and chunk documents

I split the source text into overlapping chunks so boundary context is not lost. A 300-word chunk size with 50 words of overlap works well for technical docs.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

def chunk_text(text, chunk_size=300, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

documents = [
    "Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length. This makes it significantly cheaper than token-based providers for long-context workloads.",
    "The platform supports 45+ models including Llama 3.3 70B, DeepSeek R1 671B MoE, Qwen 3 32B, and Kimi K2.6. All endpoints are fully OpenAI SDK compatible with no cold starts.",
    "Embeddings models available on Oxlo.ai include BGE-Large and E5-Large. These can be used for semantic search, clustering, and retrieval-augmented generation pipelines.",
]

chunks = []
for doc in documents:
    chunks.extend(chunk_text(doc))

Step 2: Generate embeddings

I send the chunks to the Oxlo.ai embeddings endpoint using BGE-Large. The returned vectors capture semantic meaning so similar concepts sit close together in vector space.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def get_embeddings(texts, model="bge-large"):
    response = client.embeddings.create(model=model, input=texts)
    return [item.embedding for item in response.data]

chunk_embeddings = get_embeddings(chunks)
embedding_matrix = np.array(chunk_embeddings)

Step 3: Index the vectors

I normalize the embedding matrix so cosine similarity is a simple dot product. This keeps the retrieval logic fast and free of extra dependencies.

norms = np.linalg.norm(embedding_matrix, axis=1, keepdims=True)
normalized_embeddings = embedding_matrix / norms

Step 4: Retrieve relevant context

I embed the user query with the same model, then score every chunk with cosine similarity. The top three chunks are injected into the generation prompt.

def retrieve_chunks(query, k=3):
    query_embedding = get_embeddings([query])[0]
    query_vec = np.array(query_embedding).reshape(1, -1)
    query_vec = query_vec / np.linalg.norm(query_vec)
    scores = cosine_similarity(query_vec, normalized_embeddings)[0]
    top_indices = np.argsort(scores)[::-1][:k]
    return [(chunks[i], float(scores[i])) for i in top_indices]

Step 5: Build the generator

I format the retrieved chunks into the system prompt and call Llama 3.3 70B through the Oxlo.ai chat completions endpoint. The model is instructed to answer only from the provided context.

SYSTEM_PROMPT = """You are a precise technical assistant. Answer the user's question using only the context provided below. If the context does not contain the answer, say that you do not have enough information. Do not make up facts.

Context:
{context}
"""

def ask(question):
    retrieved = retrieve_chunks(question, k=3)
    context = "\n\n".join(
        [f"Chunk {i+1} (score: {score:.3f}):\n{text}" for i, (text, score) in enumerate(retrieved)]
    )
    system = SYSTEM_PROMPT.format(context=context)

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": system},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content, retrieved

Run it

The script below ties retrieval and generation together. When I ask about embedding models on Oxlo.ai, the agent returns a grounded answer with source scores.

if __name__ == "__main__":
    question = "Which embedding models does Oxlo.ai support?"
    answer, sources = ask(question)

    print("=== Answer ===")
    print(answer)

    print("\n=== Retrieved Chunks ===")
    for text, score in sources:
        print(f"[{score:.3f}] {text[:200]}...")

Example output:

=== Answer ===
Oxlo.ai supports BGE-Large and E5-Large embedding models. These can be used for semantic search, clustering, and retrieval-augmented generation pipelines.

=== Retrieved Chunks ===
[0.912] Embeddings models available on Oxlo.ai include BGE-Large and E5-Large. These can be used for semantic search, clustering, and retrieval-augmented generation pipelines...
[0.734] Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length...
[0.621] The platform supports 45+ models including Llama 3.3 70B, DeepSeek R1 671B MoE, Qwen 3 32B, and Kimi K2.6...

Next steps

Swap the in-memory numpy index for a persistent vector database like pgvector or ChromaDB. You can also switch the generator to DeepSeek R1 671B MoE or Kimi K2.6 on Oxlo.ai to test whether advanced reasoning improves multi-hop answers.