Virtual assistants live or die by their ability to parse ambiguous user input into structured intent and entities. Large language models have replaced rigid NLU pipelines because they handle implicit context, coreference, and varied phrasing in a single forward pass. This article shows how to build a robust language understanding layer for assistants using modern LLMs, with concrete patterns you can deploy today.
Decomposing Language Understanding for Assistants
A production assistant typically needs four things from its language understanding layer: intent classification, slot filling, entity resolution, and dialog state tracking. Instead of maintaining separate models for each task, a capable LLM can perform all four in one shot when prompted with a clear schema.
The choice of model depends on your latency and complexity requirements. For fast intent classification and slot extraction, Llama 3.3 70B offers a strong balance of speed and accuracy. If your assistant handles multilingual users, Qwen 3 32B is purpose-built for multilingual reasoning and agent workflows. When users ask ambiguous questions that require deep reasoning to disambiguate, DeepSeek R1 671B MoE or Kimi K2.6 provide advanced chain-of-thought reasoning without external orchestration.
Prompt Engineering for Intent and Slots
The simplest approach is a system prompt that defines the available intents, slot types, and output rules, followed by the raw user utterance. Keep the system prompt static so you can reuse it efficiently across requests.
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
system_prompt = """You are the NLU engine for a travel assistant.
Classify the user intent into one of: [book_flight, cancel_booking, check_status].
Extract slots: origin, destination, date, cabin_class.
Respond in JSON. If a slot is missing, set its value to null."""
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": "I need to fly from Austin to Tokyo next Monday in business."}
],
response_format={"type": "json_object"}
)
print(response.choices[0].message.content)
This pattern works with any Oxlo.ai chat model. Because Oxlo.ai is fully OpenAI SDK compatible, you can drop this into an existing assistant backend without changing your HTTP client.
Structured Output and JSON Mode
Assistants cannot act on free text. They need structured data. Oxlo.ai supports JSON mode and function calling, which lets you constrain model outputs to valid schemas. For stricter guarantees, include a partial JSON skeleton in the system prompt and explicitly list required fields.
If your dialog policy expects a fixed set of slots, combine JSON mode with a required field list in the prompt. Models like DeepSeek V3.2 and Minimax M2.5 are particularly reliable at following coding and reasoning instructions, which translates to better schema adherence during slot extraction.
Managing Multi-Turn Context and State
Conversational assistants accumulate history. A user might say, "Book a flight to Seattle," then follow up with, "Make it refundable," referring back to the previous intent. Carrying full conversation history in every request is the easiest way to handle coreference, but it inflates token counts on token-based providers.
Oxlo.ai uses request-based pricing, so the cost per turn stays flat regardless of how much context you include. This means you can pass the full message array on every turn without engineering complex summarization layers just to save money. Popular models on Oxlo.ai have no cold starts, so adding context does not introduce latency spikes.
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": "Book a flight to Seattle."},
{"role": "assistant", "content": '{"intent": "book_flight", "slots": {"destination": "Seattle"}}'},
{"role": "user", "content": "Make it refundable."}
]
response = client.chat.completions.create(
model="qwen3-32b",
messages=messages,
response_format={"type": "json_object"}
)
Tool Use and Function Calling
Some slots cannot be filled from text alone. If a user says, "Reorder my usual," the assistant must call a user profile API to resolve "usual" into a concrete product ID. Oxlo.ai supports function calling on its chat endpoints, letting the model decide when to invoke external tools.
tools = [{
"type": "function",
"function": {
"name": "get_user_last_order",
"description": "Returns the user's most recent order items.",
"parameters": {
"type": "object",
"properties": {},
"required": []
}
}
}]
response = client.chat.completions.create(
model="deepseek-r1-671b",
messages=[{"role": "user", "content": "Reorder my usual."}],
tools=tools,
tool_choice="auto"
)
if response.choices[0].message.tool_calls:
# Execute the tool, append the result, and call the model again
pass
Using tool use for entity resolution keeps your prompts shorter and your business logic outside the model.
Speech and Vision Inputs
Modern assistants are multimodal. Oxlo.ai offers Whisper Large v3, Whisper Turbo, and Whisper Medium for audio transcription, plus Kokoro 82M for text-to-speech. You can transcribe user speech first, then feed the text into your NLU pipeline.
audio_file = open("user_command.wav", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file
)
# Feed transcription.text into the chat completions pipeline
For assistants that accept image uploads, vision models such as Kimi VL A3B and Gemma 3 27B can parse screenshots or photos to extract entities. A user could photograph a receipt and ask, "Add these expenses to my report," and the vision model will extract line items as structured slots.
Why Request-Based Pricing Matters for Assistants
Virtual assistants are inherently long-context workloads. Every turn appends messages to the conversation history, and system prompts for NLU often include lengthy schemas and examples. On token-based providers, costs scale linearly with that growth.
Oxlo.ai charges one flat cost per API request. Unlike token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale, your bill does not grow as you add more context turns or longer system prompts. For assistant workloads, this request-based model can be significantly cheaper. See the exact rates on the Oxlo.ai pricing page.
Getting Started with Oxlo.ai
Oxlo.ai is a drop-in replacement for the OpenAI SDK. Change the base_url to https://api.oxlo.ai/v1 and bring your existing assistant code without modifications. The free tier includes 60 requests per day across 16+ models, with a 7-day full-access trial so you can test long-context behavior before committing.
For production assistants, the Pro and Premium plans offer 1,000 and 5,000 requests per day respectively, with priority queue access on Premium. If you are migrating from a token-based provider, the Enterprise plan guarantees 30% savings over your current bill with dedicated GPU options.
Start by testing the prompt patterns above against Llama 3.3 70B or Qwen 3 32B on the free tier, then scale to reasoning models like Kimi K2.6 or DeepSeek V4 Flash as your dialog complexity grows.
Top comments (0)