Building a Language Model from Scratch

#learnai #oxlo #ai

We are going to build a domain-specific language model that answers questions about an internal Python API. Instead of training weights, we assemble a system prompt, tool definitions, and structured output logic to create custom model behavior from scratch. All inference runs on Oxlo.ai, where flat per-request pricing keeps costs predictable even when we add multi-turn tool loops.

What you'll need

Python 3.10+
pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai
Plan details are at https://oxlo.ai/pricing

Step 1: Connect to Oxlo.ai and verify the client

Before writing logic, confirm the client can reach the endpoint. I use llama-3.3-70b as the backbone because it follows instructions reliably.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "user", "content": "Reply with exactly: Connection OK"},
    ],
)

print(response.choices[0].message.content)

Step 2: Define the model persona with a system prompt

This prompt is the core of our custom model. It sets boundaries, mandates citations, and defines when to call the documentation tool.

SYSTEM_PROMPT = """You are PyDoc-LM, a technical documentation language model.
You answer questions about the internal 'widgets' Python package only.
Rules:
1. If you need documentation, call the get_docs tool with the exact module name.
2. Every factual claim must include a citation in the format [source: module.name].
3. If the answer is not in the documentation, say "I don't have that context".
4. Be concise. Use fenced code blocks for examples."""

print("Persona loaded:", len(SYSTEM_PROMPT), "characters")

Step 3: Give it tools to look up documentation

We simulate a docstore with a dictionary and register it as a function schema. Oxlo.ai supports function calling on llama-3.3-70b, so the model can request context as needed.

DOCS_DB = {
    "widgets.core": "widgets.core provides the BaseWidget class. Initialize with BaseWidget(name, timeout=30).",
    "widgets.http": "widgets.http.build_client() returns an async HTTP client with built-in retries and connection pooling.",
}

def get_docs(module: str) -> str:
    return DOCS_DB.get(module, "No documentation found for that module.")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_docs",
            "description": "Retrieve documentation for a widgets module.",
            "parameters": {
                "type": "object",
                "properties": {
                    "module": {
                        "type": "string",
                        "description": "Module name, e.g. widgets.core"
                    }
                },
                "required": ["module"],
            },
        },
    }
]

Step 4: Handle tool calls and generate grounded answers

Build the conversation loop. If the model emits a tool call, execute it locally and push the result back into the message history.

import json

def ask(question: str) -> str:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        messages.append({
            "role": "assistant",
            "content": msg.content or "",
            "tool_calls": [tc.model_dump() for tc in msg.tool_calls],
        })

        for tc in msg.tool_calls:
            if tc.function.name == "get_docs":
                args = json.loads(tc.function.arguments)
                result = get_docs(args["module"])
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": result,
                })

        response = client.chat.completions.create(
            model="llama-3.3-70b",
            messages=messages,
        )
        return response.choices[0].message.content

    return msg.content

print(ask("How do I initialize a BaseWidget?"))

Step 5: Structure the output with JSON mode

For programmatic use, force the final answer into a schema. I switch to deepseek-v3.2 here because it handles JSON mode and coding contexts well, and it is available on Oxlo.ai's free tier for experimentation.

def ask_json(question: str) -> dict:
    messages = [
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": question},
    ]

    response = client.chat.completions.create(
        model="deepseek-v3.2",
        messages=messages,
        tools=tools,
        tool_choice="auto",
    )

    msg = response.choices[0].message

    if msg.tool_calls:
        messages.append({
            "role": "assistant",
            "content": msg.content or "",
            "tool_calls": [tc.model_dump() for tc in msg.tool_calls],
        })

        for tc in msg.tool_calls:
            if tc.function.name == "get_docs":
                args = json.loads(tc.function.arguments)
                result = get_docs(args["module"])
                messages.append({
                    "role": "tool",
                    "tool_call_id": tc.id,
                    "content": result,
                })

    messages.append({
        "role": "system",
        "content": "Now respond in JSON with keys: answer (string), citations (list), code_example (string or null)."
    })

    response = client.chat.completions.create(
        model="deepseek-v3.2",
        messages=messages,
        response_format={"type": "json_object"},
    )

    return json.loads(response.choices[0].message.content)

print(json.dumps(ask_json("How do I initialize a BaseWidget?"), indent=2))

Run it

Wire everything into a short test harness. Because Oxlo.ai charges per request rather than per token, these multi-turn loops cost the same whether the user pastes a one-line question or a thousand-word specification.

if __name__ == "__main__":
    tests = [
        "How do I initialize a BaseWidget?",
        "What does widgets.http do?",
        "Can I set a custom timeout?",
    ]

    for q in tests:
        print(f"\nQ: {q}")
        print("A:", ask_json(q))

Example output:

Q: How do I initialize a BaseWidget?
A: {
  "answer": "Use BaseWidget(name, timeout=30) to initialize an instance.",
  "citations": ["source: widgets.core"],
  "code_example": "from widgets.core import BaseWidget\nw = BaseWidget(name='demo', timeout=30)"
}

Q: What does widgets.http do?
A: {
  "answer": "widgets.http.build_client returns an async HTTP client with retries and connection pooling.",
  "citations": ["source: widgets.http"],
  "code_example": "from widgets.http import build_client\nclient = await build_client()"
}

Q: Can I set a custom timeout?
A: {
  "answer": "I don't have that context",
  "citations": [],
  "code_example": null
}

Next steps

Replace DOCS_DB with a real vector database such as Chroma or pgvector, and retrieve chunks dynamically before calling the model. If you need vision support or longer context windows, swap deepseek-v3.2 for kimi-k2.6. For multilingual agent workflows, qwen-3-32b is a strong drop-in alternative. Because Oxlo.ai uses flat per-request pricing, adding retrieval steps or increasing prompt length will not inflate your bill the way token-based providers do.