Introduction to LLM Inference Engines

#learnai #oxlo #ai

We are going to build a minimal LLM inference engine in Python that routes user prompts to the best Oxlo.ai model for the task. If you have ever wondered how platforms schedule requests across different models, this will demystify the core mechanics and leave you with a working tool you can extend. Because Oxlo.ai uses flat per-request pricing, cost does not scale with input length, so routing large prompts to heavy models is significantly cheaper than with token-based providers.

What you'll need

Python 3.10 or newer
An Oxlo.ai API key from https://portal.oxlo.ai
The OpenAI SDK: pip install openai

Step 1: Configure the Oxlo.ai client

I start by instantiating the client. Oxlo.ai exposes a fully OpenAI-compatible API, so the official SDK works without changes. I only need to point the base URL at Oxlo.ai and swap in my key.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

Step 2: Define the model registry

An inference engine needs to know which backends are available. I keep a simple registry that maps task categories to Oxlo.ai model IDs. This is where the engine decides what weights to target.

MODEL_REGISTRY = {
    "coding": "qwen-3-32b",
    "reasoning": "llama-3.3-70b",
    "general": "deepseek-v3.2",
    "vision": "kimi-k2.6",
}

Step 3: Classify the task

Before routing, the engine needs to understand what the user is asking. I run a lightweight classification call against a fast model. This is the first stage of our inference pipeline. Here is the system prompt I use for the classifier.

CLASSIFIER_SYSTEM_PROMPT = """You are a routing layer inside an LLM inference engine.
Analyze the user's request and classify it into exactly one category: coding, reasoning, general, or vision.
Respond with only the category label and no punctuation."""

Now I wrap that in a function.

def classify_task(user_message: str) -> str:
    response = client.chat.completions.create(
        model="deepseek-v3.2",
        messages=[
            {"role": "system", "content": CLASSIFIER_SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        max_tokens=10,
    )
    category = response.choices[0].message.content.strip().lower()
    return category if category in MODEL_REGISTRY else "general"

Step 4: Execute inference with streaming

Once the engine knows the target model, it runs the real inference call. I enable streaming so the user sees tokens as they are generated, which is how production inference engines keep latency feeling low.

def run_inference(category: str, user_message: str):
    model = MODEL_REGISTRY.get(category, "llama-3.3-70b")
    
    SYSTEM_PROMPT = "You are a helpful assistant. Answer concisely and accurately."
    
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        stream=True,
    )
    
    print(f"\n[Engine routed to {model}]\n")
    for chunk in response:
        content = chunk.choices[0].delta.content
        if content:
            print(content, end="", flush=True)
    print("\n")

Step 5: Wire the engine into a CLI

Finally, I connect the classifier and executor in a loop so the engine can handle continuous requests.

def main():
    print("Tiny Inference Engine running. Type 'exit' to quit.")
    while True:
        user_input = input("\nPrompt> ").strip()
        if user_input.lower() in {"exit", "quit"}:
            break
        
        category = classify_task(user_input)
        run_inference(category, user_input)

if __name__ == "__main__":
    main()

Run it

Save the script as engine.py, export your key, and run it. Here is a sample session where the engine routes a coding question to Qwen 3 32B and a general question to DeepSeek V3.2.

$ export OXLO_API_KEY="sk-oxlo.ai-..."
$ python engine.py
Tiny Inference Engine running. Type 'exit' to quit.

Prompt> Write a Python function to merge two sorted lists

[Engine routed to qwen-3-32b]

def merge_sorted(a, b):
    merged = []
    i = j = 0
    while i < len(a) and j < len(b):
        if a[i] < b[j]:
            merged.append(a[i])
            i += 1
        else:
            merged.append(b[j])
            j += 1
    merged.extend(a[i:])
    merged.extend(b[j:])
    return merged

Prompt> Explain the halting problem

[Engine routed to deepseek-v3.2]

The halting problem asks whether there exists a general algorithm that can determine, for any arbitrary program and input, whether the program will eventually halt or continue to run forever. Alan Turing proved that no such general algorithm can exist...

Wrap-up

You now have a working inference router that demonstrates scheduling, model selection, and streaming execution. Two concrete next steps: add a JSON-mode schema validator to the classifier so it returns structured routing metadata, or cache the routing decision in a local dictionary to avoid repeated classification calls for identical prompts. If you want to run this in production without rewriting your stack, Oxlo.ai handles the underlying infrastructure and flat per-request pricing makes multi-model routing predictable.