Customer service is one of the highest-impact production workloads for large language models. A well-built conversational AI system can resolve tier-one tickets, escalate complex issues, and maintain coherence across long support threads. The gap between a prototype and a production-ready deployment is determined by four architectural decisions: model selection, memory management, tool integration, and a pricing structure that remains predictable as context grows.
Architecture Overview
A robust conversational AI stack has four layers. The interaction layer accepts input through chat, email, or voice interfaces. The orchestration layer manages conversation state, memory, and routing logic. The inference layer executes the LLM and processes generation. The operations layer provides observability, guardrails, and cost tracking. Omitting any layer creates brittle systems that degrade when conversation complexity increases or when users switch topics mid-thread.
Model Selection for Customer Service
Support bots need models that handle multilingual users, follow precise instructions, and reason over long conversation histories.
Oxlo.ai offers several models suited for this. Llama 3.3 70B serves as a reliable general-purpose flagship for standard FAQs and account inquiries. If your user base spans multiple languages, Qwen 3 32B provides multilingual reasoning and agent workflow capabilities. For complex tickets requiring analysis of lengthy account histories or technical documentation, Kimi K2.6 offers advanced reasoning and a 131K context window. GLM 5, built on a 744B MoE architecture, handles long-horizon agentic tasks such as multi-step troubleshooting. With over 45 models across seven categories, Oxlo.ai lets you route simple queries to fast models and complex escalations to larger reasoning models without managing separate provider contracts.
Context Management and Memory
Support conversations are rarely single-turn. Customers reference order numbers from earlier messages, and agentic systems iterate through internal knowledge bases. This means your prompt grows with every turn.
On token-based platforms, cost scales directly with input length, so teams aggressively truncate history to control spend. In contrast to token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, and Anyscale, Oxlo.ai uses request-based pricing: one flat cost per API request regardless of prompt length. For customer service workloads with long system prompts, few-shot examples, or extensive conversation history, this can be 10-100x cheaper than token-based alternatives. You can preserve full context for better resolution rates without watching token meters. See https://oxlo.ai/pricing for plan details.
Tool Use and Function Calling
A support bot that only generates text has limited value. Production systems need to query orders, schedule callbacks, and create tickets. Function calling turns natural language into structured business actions.
Oxlo.ai supports function calling, JSON mode, and streaming responses across its chat models. The platform is fully OpenAI SDK compatible, so you can point your existing Python or Node.js client to https://api.oxlo.ai/v1 without rewriting your stack. There are no cold starts on popular models, which keeps latency low during traffic spikes.
import openai
import json
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Retrieve the current status of a customer order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "escalate_to_human",
"description": "Escalate the conversation to a human support agent",
"parameters": {
"type": "object",
"properties": {
"reason": {"type": "string"}
},
"required": ["reason"]
}
}
}
]
def process_support_turn(history, user_message):
history.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=history,
tools=tools,
tool_choice="auto"
)
message = response.choices[0].message
if message.tool_calls:
for tool_call in message.tool_calls:
name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
if name == "get_order_status":
result = {"order_id": args["order_id"], "status": "shipped"}
history.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(result)
})
elif name == "escalate_to_human":
return {"action": "escalate", "reason": args["reason"]}
final = client.chat.completions.create(
model="llama-3.3-70b",
messages=history,
tools=tools
)
return {"action": "reply", "content": final.choices[0].message.content}
return {"action": "reply", "content": message.content}
Guardrails and Structured Output
Production customer service systems cannot hallucinate refund policies or expose private data. You need guardrails at the prompt, model, and application layers.
Use system prompts to define tone, boundaries, and available tools. Oxlo.ai supports JSON mode, which forces the model to return valid JSON for downstream validation. Combine this with a Pydantic or JSON Schema validator in your application layer to ensure outputs conform to your business rules before any action executes. For sensitive escalations, always require human approval rather than autonomous execution.
End-to-End Implementation
Putting these pieces together, a minimal but production-ready loop looks like this. It maintains conversation state, handles tool execution, and returns responses to the user.
python
import openai
import json
client = openai.OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
tools = [
{
"type": "function",
"function": {
"name": "get_order_status",
"description": "Retrieve the current status of a customer order",
"parameters": {
"type": "object",
"properties": {
"order_id": {"type": "string"}
},
"required": ["order_id"]
}
}
}
]
system_prompt = (
"You are a helpful customer support agent. Use the available tools to "
"check order status or escalate when the user is frustrated. Be concise."
)
def
Top comments (0)