Building an intelligent virtual assistant requires more than wrapping a chat interface around a large language model. A production system needs structured reasoning, reliable tool use, persistent memory, and often multimodal input such as voice. In this guide, we will walk through a practical architecture, implementation patterns for function calling and memory, and deployment considerations that keep latency low and costs predictable.
Architecture of an LLM-Powered Virtual Assistant
A robust assistant typically follows a loop: ingest user input, enrich it with context from memory, reason over available tools, execute actions, and synthesize a response. For voice-enabled assistants, you also need speech-to-text at the ingress and text-to-speech at the egress.
Oxlo.ai provides a unified API surface for this entire pipeline. You can route speech through audio/transcriptions (Whisper Large v3, Turbo, or Medium), run reasoning and tool selection through any of the 45+ chat models, and generate voice output via audio/speech (Kokoro 82M). Because the platform is fully OpenAI SDK compatible, you can address all three stages with the same client configuration.
Selecting the Right Model
Not every assistant needs the largest model. Match the model to the task:
- General purpose and reliable instruction following: Llama 3.3 70B is a strong flagship choice for open-domain assistants.
- Multilingual agents and complex workflows: Qwen 3 32B excels at multilingual reasoning and agentic execution.
- Deep reasoning and coding tasks: DeepSeek R1 671B MoE or DeepSeek V3.2 handle complex logic, while Kimi K2.6 offers advanced agentic coding, vision, and a 131K context window.
- Long-horizon agentic tasks: GLM 5, a 744B MoE, is built for extended planning and tool orchestration.
Because Oxlo.ai exposes all of these through a single endpoint with OpenAI SDK compatibility, you can prototype with one model and swap to another by changing a single string in your client configuration.
Implementing Tool Use and Function Calling
The defining feature of an assistant is its ability to act. Function calling lets the model emit structured JSON to invoke external APIs, query databases, or control hardware. Oxlo.ai supports function calling and JSON mode across its major chat models.
Below is a minimal Python example using the OpenAI SDK pointed at Oxlo.ai. The assistant can check a weather endpoint and search a knowledge base.
import openai
import json
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
},
"required": ["location"]
}
}
},
{
"type": "function",
"function": {
"name": "search_docs",
"description": "Search internal documentation",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string"}
},
"required": ["query"]
}
}
}
]
def handle_tool_call(tool_call):
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
if name == "get_weather":
# Replace with real API call
return {"temperature": 22, "unit": "celsius", "condition": "sunny"}
elif name == "search_docs":
# Replace with real retrieval logic
return {"results": ["Doc A", "Doc B"]}
return {}
messages = [
{"role": "system", "content": "You are a helpful assistant with access to weather and documentation."},
{"role": "user", "content": "What is the weather in Berlin and do we have docs on SSO?"}
]
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
tools=tools,
tool_choice="auto"
)
# Execute tool calls and send results back
if response.choices[0].message.tool_calls:
messages.append(response.choices[0].message)
for tc in response.choices[0].message.tool_calls:
result = handle_tool_call(tc)
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"name": tc.function.name,
"content": json.dumps(result)
})
final = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
tools=tools
)
print(final.choices[0].message.content)
This pattern keeps the model in control of orchestration while your code remains the authority over external state. For stricter output shapes, enable JSON mode on models that support it.
Managing Memory and Conversation State
Stateless APIs require you to manage conversation history yourself. For short sessions, pass the full message list on every request. As conversations grow, you have two strategies to avoid quality degradation.
Conversation summarization. When the context exceeds a threshold, summarize the oldest turns into a single system message. This keeps the semantic intent without linear token growth.
Retrieval-augmented memory. Convert user facts, preferences, and prior decisions into embeddings and store them in a vector database. At inference time, retrieve the top-k relevant chunks and inject them into the system prompt. Oxlo.ai offers dedicated embedding endpoints through BGE-Large and E5-Large, so you can generate vectors without managing a separate provider.
For assistants with long system prompts, extensive few-shot examples, or large retrieved contexts, context length can inflate quickly. Oxlo.ai uses request-based pricing, meaning one flat cost per API request regardless of how much text you send. This makes long-context memory strategies economically viable in production.
Adding Voice Capabilities
Voice expands an assistant from a text widget into a hands-free interface. The pipeline is straightforward: transcribe user audio, process the text through your reasoning layer, and synthesize the response.
For speech-to-text, Oxlo.ai offers Whisper Large v3, Turbo, and Medium through the audio/transcriptions endpoint. For text-to-speech, Kokoro 82M is available via audio/speech. Both follow the same OpenAI SDK patterns, so integrating voice is a matter of adding two extra client calls.
audio_file = open("user_prompt.wav", "rb")
transcript = client.audio.transcriptions.create(
model="whisper-large-v3-turbo",
file=audio_file
)
# Pass transcript.text into your assistant pipeline,
# then synthesize the final response via audio/speech.
Because there are no cold starts on popular models, the round-trip from speech input to speech output remains consistent even under variable load.
Deployment and Cost Optimization
Virtual assistants often suffer from unpredictable inference costs. Token-based billing means every word in your system prompt, every turn of conversation, and every retrieved document adds to the bill. For agents that maintain long context windows or iterate over multiple tool calls, costs scale non-linearly with usage.
Oxlo.ai departs from token-based billing with flat per-request pricing. Whether you send a one-line greeting or a 50,000-token payload with full documentation and conversation history, the cost is the same per API call. This predictability is useful for:
- Assistants with verbose system prompts and safety guardrails.
- Multi-turn support sessions where context grows over time.
- Agentic workflows that pack large tool schemas and few-shot examples into every request.
Pricing details are available at https://oxlo.ai/pricing. The platform also offers a free tier with 60 requests per day and access to 16+ models, which is sufficient for early prototyping.
Putting It All Together
An intelligent virtual assistant is a system of interlocking components: transcription, reasoning, tool execution, memory, and synthesis. Oxlo.ai consolidates the infrastructure for these components behind a single, OpenAI-compatible API with request-based pricing. If you are building assistants that rely on long context, frequent tool calls, or extended conversations, the flat per-request model removes the cost penalty typically associated with rich context windows. You can prototype on the free tier, scale through Pro or Premium plans, and move to dedicated Enterprise infrastructure when needed.
Top comments (0)