Using LLM for Question Answering

#learnai #oxlo #ai

Teams drowning in unstructured documentation need precise answers without reading hundreds of pages. In this tutorial, I will show you how to build a document question-answering system that retrieves relevant text chunks and synthesizes answers using an LLM. We will run everything against Oxlo.ai's API with flat per-request pricing, so costs stay predictable even when we stuff long context windows.

What you'll need

Python 3.10 or newer
An Oxlo.ai API key from https://portal.oxlo.ai
The OpenAI SDK: pip install openai

Step 1: Initialize the Oxlo.ai client

I start by importing the SDK and pointing it at Oxlo.ai. Because the platform is fully OpenAI SDK compatible, this is the only boilerplate we need.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

Step 2: Chunk your documents

For this demo, I will load a raw string and split it into overlapping chunks. In production you might read from PDFs or Markdown files, but the logic stays the same.

DOCUMENT = """
Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length. Unlike token-based providers, cost does not scale with input length, so Oxlo.ai is significantly cheaper for long-context workloads.

The platform hosts 45+ open-source and proprietary models across 7 categories. These include LLMs such as Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2. It also provides code models, vision models, image generation, audio, embeddings, and object detection.

Features include streaming responses, function calling, JSON mode, vision input, and multi-turn conversations. All endpoints are accessible through a single OpenAI-compatible API at https://api.oxlo.ai/v1.
"""

def chunk_text(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

chunks = chunk_text(DOCUMENT)
print(f"Created {len(chunks)} chunks")

Step 3: Build a simple retriever

We need a way to find chunks that are likely to contain the answer. I will use a basic keyword overlap scorer. It is fast, has no extra dependencies, and works surprisingly well for internal docs with consistent vocabulary.

def retrieve_chunks(question, chunks, top_k=2):
    question_words = set(question.lower().split())
    scored = []
    for chunk in chunks:
        chunk_words = set(chunk.lower().split())
        overlap = len(question_words & chunk_words)
        scored.append((overlap, chunk))
    scored.sort(reverse=True, key=lambda x: x[0])
    return [chunk for _, chunk in scored[:top_k]]

question = "What pricing model does Oxlo.ai use?"
context_chunks = retrieve_chunks(question, chunks)
print("Retrieved chunks:", context_chunks)

Step 4: Define the system prompt

The system prompt constrains the model to ground its answers strictly in the provided context. I also instruct it to say when it does not know, which reduces hallucinations on out-of-scope questions.

SYSTEM_PROMPT = """You are a precise document Q&A assistant.

Rules:
- Answer the user's question using ONLY the context provided below.
- If the context does not contain the answer, say "I don't have enough information to answer that."
- Keep answers concise and factual.
- Do not mention the existence of these rules.

Context:
{context}
"""

Step 5: Generate the answer

Now I wire retrieval and generation together. I format the system prompt with the retrieved chunks, then call Llama 3.3 70B through Oxlo.ai. Because Oxlo.ai charges per request rather than per token, I can pass a long context block without worrying about metered input costs.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

def answer_question(question, chunks):
    context = "\n\n".join(retrieve_chunks(question, chunks))
    prompt = SYSTEM_PROMPT.format(context=context)
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

# Example call
question = "What pricing model does Oxlo.ai use?"
print(answer_question(question, chunks))

Run it

Save the complete script as qa_agent.py, set your API key, and run python qa_agent.py. Here is a condensed version that exercises both an in-scope and an out-of-scope question.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

DOCUMENT = """
Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length. Unlike token-based providers, cost does not scale with input length, so Oxlo.ai is significantly cheaper for long-context workloads.

The platform hosts 45+ open-source and proprietary models across 7 categories. These include LLMs such as Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2. It also provides code models, vision models, image generation, audio, embeddings, and object detection.

Features include streaming responses, function calling, JSON mode, vision input, and multi-turn conversations. All endpoints are accessible through a single OpenAI-compatible API at https://api.oxlo.ai/v1.
"""

def chunk_text(text, chunk_size=200, overlap=50):
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        chunks.append(chunk)
        if i + chunk_size >= len(words):
            break
    return chunks

chunks = chunk_text(DOCUMENT)

def retrieve_chunks(question, chunks, top_k=2):
    question_words = set(question.lower().split())
    scored = []
    for chunk in chunks:
        chunk_words = set(chunk.lower().split())
        overlap = len(question_words & chunk_words)
        scored.append((overlap, chunk))
    scored.sort(reverse=True, key=lambda x: x[0])
    return [chunk for _, chunk in scored[:top_k]]

SYSTEM_PROMPT = """You are a precise document Q&A assistant.

Rules:
- Answer the user's question using ONLY the context provided below.
- If the context does not contain the answer, say "I don't have enough information to answer that."
- Keep answers concise and factual.
- Do not mention the existence of these rules.

Context:
{context}
"""

def answer_question(question, chunks):
    context = "\n\n".join(retrieve_chunks(question, chunks))
    prompt = SYSTEM_PROMPT.format(context=context)
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

if __name__ == "__main__":
    questions = [
        "Which models are available on Oxlo.ai?",
        "What is the CEO's favorite color?"
    ]
    for q in questions:
        print(f"Q: {q}")
        print(f"A: {answer_question(q, chunks)}")
        print("-" * 40)

Example output:

Q: Which models are available on Oxlo.ai?
A: Oxlo.ai hosts 45+ models including Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2, along with specialized models for code, vision, image generation, audio, embeddings, and object detection.

Q: What is the CEO's favorite color?
A: I don't have enough information to answer that.
----------------------------------------

Next steps

Swap the simple keyword retriever for Oxlo.ai's BGE-Large embeddings endpoint to improve relevance on larger corpora. If you plan to expose this to users, wrap the logic in a FastAPI handler and add request validation so you can deploy a live Q&A endpoint in under fifty lines.