We are going to build a document-grounded question-answering agent that reads a provided knowledge base and answers questions with inline citations. This is useful for internal support bots, research assistants, or any workflow where you need answers tied to source material rather than model hallucinations.
What you'll need
- Python 3.10 or newer
- An Oxlo.ai API key from https://portal.oxlo.ai
- The OpenAI SDK:
pip install openai
Because Oxlo.ai bills per request rather than per token, stuffing the context window with long source documents does not inflate cost. See https://oxlo.ai/pricing for details.
Step 1: Connect to Oxlo.ai
First, I verify the client can reach Oxlo.ai and get a basic completion. I am using Llama 3.3 70B because it follows instructions reliably for QA tasks.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "What is the capital of France?"},
],
)
print(response.choices[0].message.content)
Step 2: Define the system prompt
The system prompt forces the model to cite its sources and refuse to guess when the context does not contain the answer. I keep this in its own variable so I can reuse it.
SYSTEM_PROMPT = """You are a precise question-answering agent.
Answer using ONLY the provided source documents.
Cite each claim with a bracketed reference like [Source 1] or [Source 2].
If the documents do not contain the answer, say: "The provided documents do not contain this information."
Do not use outside knowledge."""
Step 3: Inject source documents
Now I format the user message so it contains the knowledge base followed by the question. This is the simplest form of retrieval-augmented generation, and on Oxlo.ai the cost stays flat no matter how much text I include.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
SYSTEM_PROMPT = """You are a precise question-answering agent.
Answer using ONLY the provided source documents.
Cite each claim with a bracketed reference like [Source 1] or [Source 2].
If the documents do not contain the answer, say: "The provided documents do not contain this information."
Do not use outside knowledge."""
documents = [
"Source 1: Oxlo.ai offers flat per-request pricing for open-source LLMs. Unlike token-based providers, cost does not scale with prompt length.",
"Source 2: The platform supports over 45 models including Llama 3.3 70B, DeepSeek R1 671B, and Kimi K2.6. It is fully OpenAI SDK compatible.",
"Source 3: Oxlo.ai provides endpoints for chat, embeddings, image generation, audio transcription, and text-to-speech.",
]
context = "\n\n".join(documents)
question = "What models does Oxlo.ai support?"
user_message = f"Documents:\n{context}\n\nQuestion: {question}"
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
)
print(response.choices[0].message.content)
Step 4: Build a reusable agent class
To make this production-ready, I wrap the logic in a small Python class. It accepts a list of documents, numbers them automatically, formats the prompt, and returns the answer.
from openai import OpenAI
class DocumentQA:
def __init__(self, api_key: str, model: str = "llama-3.3-70b"):
self.client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key=api_key)
self.model = model
self.system_prompt = (
"You are a precise question-answering agent. "
"Answer using ONLY the provided source documents. "
"Cite each claim with a bracketed reference like [Source 1]. "
'If the documents do not contain the answer, say: '
'"The provided documents do not contain this information." '
"Do not use outside knowledge."
)
def ask(self, documents: list[str], question: str) -> str:
context = "\n\n".join(
f"Source {i + 1}: {doc}" for i, doc in enumerate(documents)
)
user_message = f"Documents:\n{context}\n\nQuestion: {question}"
response = self.client.chat.completions.create(
model=self.model,
messages=[
{"role": "system", "content": self.system_prompt},
{"role": "user", "content": user_message},
],
)
return response.choices[0].message.content
# Example usage
if __name__ == "__main__":
qa = DocumentQA(api_key="YOUR_OXLO_API_KEY")
docs = [
"Oxlo.ai uses request-based pricing, which can be significantly cheaper than token-based billing for long-context workloads.",
"Supported endpoints include chat/completions, embeddings, images/generations, audio/transcriptions, and audio/speech.",
"Flagship models include Qwen 3 32B for multilingual reasoning, DeepSeek V4 Flash with 1M context, and GLM 5 for agentic tasks.",
]
answer = qa.ask(docs, "What pricing model does Oxlo.ai use?")
print(answer)
Step 5: Handle out-of-scope questions
A good QA agent must refuse to hallucinate. I test the same agent against a question that is not covered by the documents to verify it follows the refusal instruction.
# continuing from the previous example
bad_question = "Who founded Oxlo.ai?"
refusal = qa.ask(docs, bad_question)
print(refusal)
Expected output:
The provided documents do not contain this information.
Run it
Save the full script as qa_agent.py, replace YOUR_OXLO_API_KEY with your key from the Oxlo.ai portal, and run python qa_agent.py. You should see an answer with citations for in-scope questions and a polite refusal for out-of-scope ones.
Example output for the pricing question:
Oxlo.ai uses request-based pricing, which means you pay a flat cost per API request regardless of prompt length [Source 1]. This can be significantly cheaper than token-based billing for long-context workloads [Source 1].
Next steps
Replace the hardcoded document list with a vector database like Chroma or pgvector so the agent retrieves only the most relevant chunks instead of sending the entire corpus. You could also swap the model to qwen-3-32b on Oxlo.ai if you need to answer questions across multilingual documents.
Top comments (0)