Teams drowning in unstructured documentation need precise answers without reading hundreds of pages. In this tutorial, I will show you how to build a document question-answering system that retrieves relevant text chunks and synthesizes answers using an LLM. We will run everything against Oxlo.ai's API with flat per-request pricing, so costs stay predictable even when we stuff long context windows.
What you'll need
- Python 3.10 or newer
- An Oxlo.ai API key from https://portal.oxlo.ai
- The OpenAI SDK:
pip install openai
Step 1: Initialize the Oxlo.ai client
I start by importing the SDK and pointing it at Oxlo.ai. Because the platform is fully OpenAI SDK compatible, this is the only boilerplate we need.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
Step 2: Chunk your documents
For this demo, I will load a raw string and split it into overlapping chunks. In production you might read from PDFs or Markdown files, but the logic stays the same.
DOCUMENT = """
Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length. Unlike token-based providers, cost does not scale with input length, so Oxlo.ai is significantly cheaper for long-context workloads.
The platform hosts 45+ open-source and proprietary models across 7 categories. These include LLMs such as Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2. It also provides code models, vision models, image generation, audio, embeddings, and object detection.
Features include streaming responses, function calling, JSON mode, vision input, and multi-turn conversations. All endpoints are accessible through a single OpenAI-compatible API at https://api.oxlo.ai/v1.
"""
def chunk_text(text, chunk_size=200, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
if i + chunk_size >= len(words):
break
return chunks
chunks = chunk_text(DOCUMENT)
print(f"Created {len(chunks)} chunks")
Step 3: Build a simple retriever
We need a way to find chunks that are likely to contain the answer. I will use a basic keyword overlap scorer. It is fast, has no extra dependencies, and works surprisingly well for internal docs with consistent vocabulary.
def retrieve_chunks(question, chunks, top_k=2):
question_words = set(question.lower().split())
scored = []
for chunk in chunks:
chunk_words = set(chunk.lower().split())
overlap = len(question_words & chunk_words)
scored.append((overlap, chunk))
scored.sort(reverse=True, key=lambda x: x[0])
return [chunk for _, chunk in scored[:top_k]]
question = "What pricing model does Oxlo.ai use?"
context_chunks = retrieve_chunks(question, chunks)
print("Retrieved chunks:", context_chunks)
Step 4: Define the system prompt
The system prompt constrains the model to ground its answers strictly in the provided context. I also instruct it to say when it does not know, which reduces hallucinations on out-of-scope questions.
SYSTEM_PROMPT = """You are a precise document Q&A assistant.
Rules:
- Answer the user's question using ONLY the context provided below.
- If the context does not contain the answer, say "I don't have enough information to answer that."
- Keep answers concise and factual.
- Do not mention the existence of these rules.
Context:
{context}
"""
Step 5: Generate the answer
Now I wire retrieval and generation together. I format the system prompt with the retrieved chunks, then call Llama 3.3 70B through Oxlo.ai. Because Oxlo.ai charges per request rather than per token, I can pass a long context block without worrying about metered input costs.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
def answer_question(question, chunks):
context = "\n\n".join(retrieve_chunks(question, chunks))
prompt = SYSTEM_PROMPT.format(context=context)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
# Example call
question = "What pricing model does Oxlo.ai use?"
print(answer_question(question, chunks))
Run it
Save the complete script as qa_agent.py, set your API key, and run python qa_agent.py. Here is a condensed version that exercises both an in-scope and an out-of-scope question.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
DOCUMENT = """
Oxlo.ai is a developer-first AI inference platform. It offers request-based pricing with one flat cost per API request regardless of prompt length. Unlike token-based providers, cost does not scale with input length, so Oxlo.ai is significantly cheaper for long-context workloads.
The platform hosts 45+ open-source and proprietary models across 7 categories. These include LLMs such as Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2. It also provides code models, vision models, image generation, audio, embeddings, and object detection.
Features include streaming responses, function calling, JSON mode, vision input, and multi-turn conversations. All endpoints are accessible through a single OpenAI-compatible API at https://api.oxlo.ai/v1.
"""
def chunk_text(text, chunk_size=200, overlap=50):
words = text.split()
chunks = []
for i in range(0, len(words), chunk_size - overlap):
chunk = " ".join(words[i:i + chunk_size])
chunks.append(chunk)
if i + chunk_size >= len(words):
break
return chunks
chunks = chunk_text(DOCUMENT)
def retrieve_chunks(question, chunks, top_k=2):
question_words = set(question.lower().split())
scored = []
for chunk in chunks:
chunk_words = set(chunk.lower().split())
overlap = len(question_words & chunk_words)
scored.append((overlap, chunk))
scored.sort(reverse=True, key=lambda x: x[0])
return [chunk for _, chunk in scored[:top_k]]
SYSTEM_PROMPT = """You are a precise document Q&A assistant.
Rules:
- Answer the user's question using ONLY the context provided below.
- If the context does not contain the answer, say "I don't have enough information to answer that."
- Keep answers concise and factual.
- Do not mention the existence of these rules.
Context:
{context}
"""
def answer_question(question, chunks):
context = "\n\n".join(retrieve_chunks(question, chunks))
prompt = SYSTEM_PROMPT.format(context=context)
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": question},
],
)
return response.choices[0].message.content
if __name__ == "__main__":
questions = [
"Which models are available on Oxlo.ai?",
"What is the CEO's favorite color?"
]
for q in questions:
print(f"Q: {q}")
print(f"A: {answer_question(q, chunks)}")
print("-" * 40)
Example output:
Q: Which models are available on Oxlo.ai?
A: Oxlo.ai hosts 45+ models including Llama 3.3 70B, Qwen 3 32B, DeepSeek R1 671B MoE, Kimi K2.6, and DeepSeek V3.2, along with specialized models for code, vision, image generation, audio, embeddings, and object detection.
Q: What is the CEO's favorite color?
A: I don't have enough information to answer that.
----------------------------------------
Next steps
Swap the simple keyword retriever for Oxlo.ai's BGE-Large embeddings endpoint to improve relevance on larger corpora. If you plan to expose this to users, wrap the logic in a FastAPI handler and add request validation so you can deploy a live Q&A endpoint in under fifty lines.
Top comments (0)