Customer service automation has moved past rigid decision trees and keyword matching. Modern support pipelines need LLMs that can parse long ticket threads, query internal knowledge bases, and invoke tools like refund APIs or CRM lookups, all while keeping latency low and costs predictable. For teams building these systems, the inference layer is where architecture decisions directly impact the bottom line.
Why LLMs Fail at Support, and How to Fix It
Most customer service failures stem from three constraints: context windows that truncate email chains, models that hallucinate policy answers, and chatbots that cannot take action. A support ticket often arrives with 10,000 tokens of history, including previous exchanges, order IDs, and screenshots. If your model lacks long-context capacity or your pricing penalizes every additional token, you either lose information or bleed budget.
The fix is an agentic architecture. The LLM should classify intent, retrieve verified answers from a knowledge base, and call functions to modify orders or schedule callbacks. This requires models that support function calling, JSON mode, and streaming, plus an inference backend that does not punish you for stuffing the full thread into the prompt.
Architecture Pattern: Agentic Tool Use with Function Calling
A practical support agent needs to do more than generate text. It needs to execute. Below is a minimal Python example using the OpenAI SDK pointed at Oxlo.ai. The agent has access to two tools: get_order_status and initiate_refund. The model decides which to call based on the user message.
import openai
import json
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Retrieve the current status of an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "initiate_refund",
"description": "Start a refund for an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"},
"reason": {"type": "string"}
},
"required": ["order_id", "reason"]
}
}
}
]
messages = [
{"role": "system", "content": "You are a support agent. Be concise. Use tools when needed."},
{"role": "user", "content": "I want a refund for order #88291. The item arrived damaged."}
]
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=messages,
tools=tools,
tool_choice="auto"
)
print(response.choices[0].message)
Because Oxlo.ai is fully OpenAI SDK compatible, this is a literal drop-in replacement. You can switch from another provider by changing the base_url and API key. The platform supports streaming responses, multi-turn conversations, and vision input for cases where users attach photos of damaged goods.
The Hidden Cost of Long Context Windows
Customer service prompts are inherently long. A single ticket can include the original complaint, three back-and-forth replies, internal notes, and a pasted section from the returns policy. Under token-based pricing, every one of those tokens adds cost, and agentic workflows often require multiple round trips.
Oxlo.ai uses request-based pricing: one flat cost per API request regardless of prompt length. For long-context and agentic workloads, this can be 10-100x cheaper than token-based alternatives because cost does not scale with input length. You can include the full ticket thread and knowledge base context on every turn without watching the meter run. See the pricing page for plan details.
Model Selection for Support Workloads
Oxlo.ai hosts 45+ models across seven categories. For customer service automation, these are the most relevant:
- General triage and chat: Llama 3.3 70B handles high-volume, straightforward interactions with low latency.
- Multilingual support: Qwen 3 32B is built for multilingual reasoning and agent workflows, useful if your user base spans regions.
- Complex reasoning and policy interpretation: DeepSeek R1 671B MoE and Kimi K2.6 excel at chain-of-thought reasoning and agentic coding. Use these when a ticket requires reading a long policy document and determining an exception.
- Vision: Kimi VL A3B and Gemma 3 27B process screenshots of errors or damaged products attached to tickets.
- Embeddings: BGE-Large and E5-Large power the retrieval layer of your RAG pipeline for knowledge-base lookups.
- Code-specific tasks: If your support involves debugging user scripts, Qwen 3 Coder 30B or Oxlo.ai Coder Fast are available.
There are no cold starts on popular models, so your agent responds immediately even under variable load.
Implementation: OpenAI SDK Drop-In
Beyond function calling, you will likely want structured outputs for routing. JSON mode lets you force the model to return a machine-readable object. This is useful for escalation logic.
response = client.chat.completions.create(
model="kimi-k2-6",
messages=[
{"role": "system", "content": "Analyze the ticket and return JSON with keys: sentiment, urgency, category, should_escalate."},
{"role": "user", "content": "Your product deleted my entire project and your documentation is useless."}
],
response_format={"type": "json_object"}
)
result = json.loads(response.choices[0].message.content)
# result: {"sentiment": "angry", "urgency": "high", "category": "data_loss", "should_escalate": true}
Streaming is supported out of the box. For chat UIs, set stream=True to deliver tokens as they are generated, reducing perceived latency for the end user.
Routing and Escalation with JSON Mode
A production support pipeline usually has multiple tiers. Tier 1 handles refunds and password resets. Tier 2 handles bugs and account recovery. Tier 3 is human. Use JSON mode to build a router that classifies incoming tickets and assigns them to the correct handler.
The advantage of doing this on Oxlo.ai is that you can pass the entire conversation history, the user profile, and relevant knowledge-base articles into a single request. Because pricing is per request, the cost is the same whether you send 500 tokens or 50,000 tokens. This lets you be generous with context, which improves routing accuracy and reduces false escalations.
Putting It Together
Building customer service automation that actually works requires more than a chat interface. It requires long context, tool use, structured outputs, and an inference backend that makes those patterns economically viable. Oxlo.ai provides the models and the pricing structure to support this: flat per-request pricing, OpenAI SDK compatibility, and a broad catalog including reasoning, vision, and embedding models. If your current provider charges per token and your support tickets keep getting longer, moving the inference layer to Oxlo.ai is a straightforward way to cut costs and keep context intact.
Top comments (0)