Customer service teams are replacing rigid decision trees with large language models that can parse intent, retrieve policy documents, and execute refunds in a single conversation. Building this class of virtual assistant requires more than a prompt and a chat endpoint. You need deterministic tool use, reliable multi-turn memory, and a cost structure that does not punish long support transcripts. Oxlo.ai provides an inference platform built for exactly these workloads: flat per-request pricing, full OpenAI SDK compatibility, and a broad model catalog including long-context and agentic options.
Architecture of a Customer Service LLM Assistant
A production virtual assistant usually breaks down into four layers. First, an ingestion pipeline that converts knowledge bases, FAQs, and past tickets into vector embeddings. Second, a retrieval layer that surfaces the top-k relevant documents for each user query. Third, an orchestration layer that maintains conversation state, handles function calling, and decides whether to escalate to a human. Fourth, the inference layer where the model generates the response.
Because customer conversations can span dozens of turns and include lengthy CRM transcripts, the inference layer must support large context windows without exploding costs. On token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, and Anyscale, every additional line of conversation increases the bill, which discourages the rich context that actually makes an assistant accurate. Oxlo.ai uses request-based pricing, so the cost stays flat whether you send a one-line greeting or a full ticket history with ten prior turns. This matters for agentic workflows where the model may loop through tool calls and accumulate context automatically.
Implementation with the OpenAI SDK and Oxlo.ai
Oxlo.ai is fully OpenAI SDK compatible, so you can point your existing client at https://api.oxlo.ai/v1 and keep your current tooling. Below is a minimal example that defines two tools, get_order_status and initiate_refund, and lets the model decide when to call them.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("OXLO_API_KEY"),
base_url="https://api.oxlo.ai/v1"
)
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Retrieve the current status of a customer order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "initiate_refund",
"description": "Start a refund process for an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"reason": {"type": "string"}
},
"required": ["order_id", "reason"]
}
}
}
]
def run_assistant(conversation_history):
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=conversation_history,
tools=tools,
tool_choice="auto",
temperature=0.2
)
return response.choices[0].message
# Example conversation
history = [
{"role": "system", "content": "You are a helpful customer support agent. Use the available tools to assist users."},
{"role": "user", "content": "I need a refund for order ORD-9281. It arrived damaged."}
]
msg = run_assistant(history)
if msg.tool_calls:
for call in msg.tool_calls:
print(f"Tool called: {call.function.name} with args {call.function.arguments}")
# Execute business logic, append result to history, and re-run
else:
print(msg.content)
The loop is straightforward: if the model returns a tool_calls payload, execute your business logic, append the result to history, and call the chat endpoint again. Because Oxlo.ai charges per request rather than per token, these agentic loops remain predictable even as the conversation history grows.
Selecting the Right Model on Oxlo.ai
Oxlo.ai hosts more than 45 models across seven categories. For customer service, these are the most relevant starting points:
- Llama 3.3 70B: A strong general-purpose flagship for tool use and multi-turn chat.
- Qwen 3 32B: Ideal if your user base is multilingual or if you need agent workflows across languages.
- DeepSeek R1 671B MoE: Use this when the assistant must perform deep reasoning, such as interpreting complex return policies or debugging customer-reported errors.
- Kimi K2.6: Offers advanced reasoning, agentic coding, vision, and a 131K context window. Useful if your assistant needs to analyze screenshots or handle long document dumps.
- DeepSeek V4 Flash: An efficient MoE with a 1M context window and near state-of-the-art open-source reasoning. Excellent for loading entire ticket histories or knowledge bases into the prompt.
- DeepSeek V3.2: Available on the free tier, this model handles coding and reasoning well enough for prototyping.
For the retrieval layer, Oxlo.ai also provides embedding endpoints via BGE-Large and E5-Large, so you can keep the entire pipeline on one API key.
Handling Memory and Long Context
The biggest operational challenge in support automation is memory. Customers reference prior tickets, order numbers, and policies from weeks ago. Summarization is one approach, but it loses detail. A better approach is to include the full transcript and relevant documents directly in the prompt, which is only economical if your provider does not charge by the token for input length.
Token-based providers scale costs linearly with prompt size. For a busy support queue, that turns long context from a feature into a liability. Oxlo.ai flips this: because the price is flat per request, you can pass in full CRM records, prior conversations, and retrieved policy PDFs without worrying about metered input tokens. The platform also offers streaming responses, JSON mode for structured outputs, and no cold starts on popular models, so latency stays consistent even under load.
If you are building a prototype, the Oxlo.ai free tier includes 60 requests per day across more than 16 models with a 7-day full-access trial. For production, the Pro and Premium plans offer 1,000 and 5,000 requests per day respectively, with priority queue access on Premium.
Deploying to Production
A virtual language assistant is only as good as its ability to reason over rich context and interact with real systems. That requires an inference backend that supports function calling, multi-turn state, and long prompts without unpredictable bills. Oxlo.ai delivers all three: drop-in OpenAI SDK compatibility, a deep catalog of open-source and proprietary models, and request-based pricing that can be 10-100x cheaper than token-based alternatives for long-context workloads. If you are evaluating inference providers for your next customer service deployment, start at oxlo.ai/pricing and run your existing prompts against the API in minutes.
Top comments (0)