Building Virtual Assistants with LLMs: A Step-by-Step Guide

#product #oxlo #ai

Virtual assistants built on large language models have moved from demos to production systems. Whether you are automating customer support, scheduling, or internal tooling, the architecture is straightforward: speech or text enters, a model reasons, and actions execute through external tools. What is less obvious is how your choice of inference provider shapes cost and responsiveness once conversations grow long or agentic loops multiply. Oxlo.ai offers a developer-first inference platform with flat per-request pricing, which can make long-context assistant workloads significantly more predictable than token-based alternatives.

Architecture of a Modern LLM Assistant

A typical assistant stack has five layers:

Input: Text, voice, or images. For voice, Oxlo.ai hosts Whisper Large v3, Turbo, and Medium for transcription via the audio/transcriptions endpoint.
Reasoning: The LLM core that interprets intent and plans actions. This is where Oxlo.ai fits.
Memory: Conversation history, user profiles, or retrieved documents.
Tools: External functions the model can call, such as calendar lookups or database queries.
Output: Text, structured JSON, or speech. Oxlo.ai offers Kokoro 82M text-to-speech through the audio/speech endpoint.

Selecting the Inference Backend

The backend determines latency, model choice, and pricing. Oxlo.ai provides 45+ open-source and proprietary models across 7 categories, fully compatible with the OpenAI SDK. There are no cold starts on popular models, so assistant responses begin immediately.

For general assistant tasks, Llama 3.3 70B is a strong default. If your users speak multiple languages or you need agent workflows, Qwen 3 32B is purpose-built for multilingual reasoning. For advanced reasoning and agentic coding, Kimi K2.6 offers a 131K context window. If you need deep reasoning or complex coding, DeepSeek R1 671B MoE is available. GLM 5, a 744B MoE model, targets long-horizon agentic tasks, while Minimax M2.5 excels at coding and tool use. For cost exploration, DeepSeek V3.2 sits on a free tier and handles coding and reasoning.

Because Oxlo.ai charges one flat cost per API request regardless of prompt length, assistant workloads with long system prompts, few-shot examples, or lengthy conversation history avoid the ballooning costs common with token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale.

Building the Core Chat Loop with Tool Use

Most assistants need to do more than chat. They must call calendars, query databases, or trigger actions. Oxlo.ai supports function calling, streaming, JSON mode, and multi-turn conversations out of the box.

Below is a minimal Python example using the OpenAI SDK pointed at Oxlo.ai. It defines a single tool and lets the model decide whether to call it.

import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_calendar_events",
            "description": "Fetch calendar events for a date",
            "parameters": {
                "type": "object",
                "properties": {
                    "date": {"type": "string", "description": "ISO 8601 date"}
                },
                "required": ["date"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a helpful assistant with access to a calendar."},
    {"role": "user", "content": "What meetings do I have tomorrow?"}
]

response = client.chat.completions.create(
    model="your-preferred-model",  # Llama 3.3 70B, Qwen 3 32B, Kimi K2.6, etc.
    messages=messages,
    tools=tools,
    tool_choice="auto",
    stream=False
)

print(response.choices[0].message)

If the model returns a tool call, your application executes the function, appends the result to messages, and sends a second request. Each request is one flat charge on Oxlo.ai, so iterative agent loops remain predictable.

Managing Memory and Context Windows

Virtual assistants die quickly without memory. You can either maintain conversation history in the context window or summarize externally and inject it back. Oxlo.ai hosts models that accommodate either strategy. Kimi K2.6 provides a 131K context window, and DeepSeek V4 Flash supports 1M context, letting you keep extensive history in a single request. Because Oxlo.ai does not scale cost with input length, filling the context window with prior turns, documentation, or user profiles does not increase the per-request