We are going to build a domain-specific language model that answers questions about an internal Python API. Instead of training weights, we assemble a system prompt, tool definitions, and structured output logic to create custom model behavior from scratch. All inference runs on Oxlo.ai, where flat per-request pricing keeps costs predictable even when we add multi-turn tool loops.
What you'll need
- Python 3.10+
pip install openai- An Oxlo.ai API key from https://portal.oxlo.ai
- Plan details are at https://oxlo.ai/pricing
Step 1: Connect to Oxlo.ai and verify the client
Before writing logic, confirm the client can reach the endpoint. I use llama-3.3-70b as the backbone because it follows instructions reliably.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "user", "content": "Reply with exactly: Connection OK"},
],
)
print(response.choices[0].message.content)
Step 2: Define the model persona with a system prompt
This prompt is the core of our custom model. It sets boundaries, mandates citations, and defines when to call the documentation tool.
SYSTEM_PROMPT = """You are PyDoc-LM, a technical documentation language model.
You answer questions about the internal 'widgets' Python package only.
Rules:
1. If you need documentation, call the get_docs tool with the exact module name.
2. Every factual claim must include a citation in the format [source: module.name].
3. If the answer is not in the documentation, say "I don't have that context".
4. Be concise. Use fenced code blocks for examples."""
print("Persona loaded:", len(SYSTEM_PROMPT), "characters")
Step 3: Give it tools to look up documentation
We simulate a docstore with a dictionary and register it as a function schema. Oxlo.ai supports function calling on llama-3.3-70b, so the model can request context as needed.
DOCS_DB = {
"widgets.core": "widgets.core provides the BaseWidget class. Initialize with BaseWidget(name, timeout=30).",
"widgets.http": "widgets.http.build_client() returns an async HTTP client with built-in retries and connection pooling.",
}
def get_docs(module: str) -> str:
return DOCS_DB.get(module, "No documentation found for that module.")
tools = [
{
"type": "function",
"function": {
"name": "get_docs",
"description": "Retrieve documentation for a widgets module.",
"parameters": {
"type": "object",
"properties": {
"module": {
"type": "string",
"description": "Module name, e.g. widgets.core"
}
},
"required": ["module"],
},
},
}
]
Step 4: Handle tool calls and generate grounded answers
Build the conversation loop. If the model emits a tool call, execute it locally and push the result back into the message history.
import json
def ask(question: str) -> str:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
if msg.tool_calls:
messages.append({
"role": "assistant",
"content": msg.content or "",
"tool_calls": [tc.model_dump() for tc in msg.tool_calls],
})
for tc in msg.tool_calls:
if tc.function.name == "get_docs":
args = json.loads(tc.function.arguments)
result = get_docs(args["module"])
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
)
return response.choices[0].message.content
return msg.content
print(ask("How do I initialize a BaseWidget?"))
Step 5: Structure the output with JSON mode
For programmatic use, force the final answer into a schema. I switch to deepseek-v3.2 here because it handles JSON mode and coding contexts well, and it is available on Oxlo.ai's free tier for experimentation.
def ask_json(question: str) -> dict:
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=messages,
tools=tools,
tool_choice="auto",
)
msg = response.choices[0].message
if msg.tool_calls:
messages.append({
"role": "assistant",
"content": msg.content or "",
"tool_calls": [tc.model_dump() for tc in msg.tool_calls],
})
for tc in msg.tool_calls:
if tc.function.name == "get_docs":
args = json.loads(tc.function.arguments)
result = get_docs(args["module"])
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
messages.append({
"role": "system",
"content": "Now respond in JSON with keys: answer (string), citations (list), code_example (string or null)."
})
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=messages,
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
print(json.dumps(ask_json("How do I initialize a BaseWidget?"), indent=2))
Run it
Wire everything into a short test harness. Because Oxlo.ai charges per request rather than per token, these multi-turn loops cost the same whether the user pastes a one-line question or a thousand-word specification.
if __name__ == "__main__":
tests = [
"How do I initialize a BaseWidget?",
"What does widgets.http do?",
"Can I set a custom timeout?",
]
for q in tests:
print(f"\nQ: {q}")
print("A:", ask_json(q))
Example output:
Q: How do I initialize a BaseWidget?
A: {
"answer": "Use BaseWidget(name, timeout=30) to initialize an instance.",
"citations": ["source: widgets.core"],
"code_example": "from widgets.core import BaseWidget\nw = BaseWidget(name='demo', timeout=30)"
}
Q: What does widgets.http do?
A: {
"answer": "widgets.http.build_client returns an async HTTP client with retries and connection pooling.",
"citations": ["source: widgets.http"],
"code_example": "from widgets.http import build_client\nclient = await build_client()"
}
Q: Can I set a custom timeout?
A: {
"answer": "I don't have that context",
"citations": [],
"code_example": null
}
Next steps
Replace DOCS_DB with a real vector database such as Chroma or pgvector, and retrieve chunks dynamically before calling the model. If you need vision support or longer context windows, swap deepseek-v3.2 for kimi-k2.6. For multilingual agent workflows, qwen-3-32b is a strong drop-in alternative. Because Oxlo.ai uses flat per-request pricing, adding retrieval steps or increasing prompt length will not inflate your bill the way token-based providers do.
Top comments (0)