Building Conversational AI Systems for Customer Support with LLM

#aiinfrastructure #oxlo #ai

Building conversational AI for customer support requires more than prompt engineering. Production systems need to maintain state across multi-turn dialogues, invoke external tools to query knowledge bases, and handle context windows that swell as customers describe complex issues. Traditional token-based inference can make these workloads prohibitively expensive, especially when agentic loops or long conversation histories push input lengths into the tens of thousands of tokens. A predictable cost structure and a broad model catalog are essential for teams shipping reliable support automation.

Architecture Patterns for Support Bots

A robust support architecture typically combines retrieval-augmented generation, function calling, and persistent state management. The LLM acts as a reasoning layer, not a database. Customer context, order history, and policy documents live in external systems that the model accesses through structured tool calls. Multi-turn conversations require the application layer to append each exchange to a growing message history, which rapidly increases token count. This is where infrastructure choices directly impact both latency and cost.

Selecting the Right Model Stack

Not every support query needs the same capability. Oxlo.ai offers 45+ models across seven categories, letting you route tasks efficiently.

General triage and routing: Llama 3.3 70B provides fast, high-quality responses for common questions.
Complex reasoning and coding: DeepSeek R1 671B MoE and Kimi K2.6 handle advanced troubleshooting, log analysis, and chain-of-thought reasoning.
Multilingual support: Qwen 3 32B excels at multilingual reasoning and agent workflows for global user bases.
Long-context summarization: DeepSeek V4 Flash supports 1M context windows, while Kimi K2.6 offers 131K context. Both are useful for summarizing lengthy ticket threads without truncation.
Agentic orchestration: GLM 5 and Minimax M2.5 manage long-horizon tasks and tool-heavy workflows.
Vision inputs: If customers upload screenshots, Kimi VL A3B and Gemma 3 27B process image inputs alongside text.

This breadth lets you match the model to the ticket, rather than over-provisioning a single large endpoint for every request.

Why Request-Based Pricing Fits Support Workloads

Support conversations are inherently long. A customer might paste an error log, reference previous messages, or upload attachments. Under token-based pricing, every additional word in the prompt increases cost. For agentic systems that iterate over tool results, expenses compound quickly.

Oxlo.ai uses flat per-request pricing. One API call costs the same regardless of whether the prompt is 500 tokens or 50,000 tokens. For long-context and agentic support workloads, this can be significantly cheaper than token-based alternatives such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale. There are no cold starts on popular models, so latency stays consistent as volumes scale. You can view plan details at the Oxlo.ai pricing page.

Implementing Tool Use with the OpenAI SDK

Oxlo.ai is fully compatible with the OpenAI SDK, so you can point an existing client at the Oxlo.ai API without rewriting application logic. Below is a minimal example that defines two tools for a support bot, checks an order status, and returns a structured response.

import openai
import json

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

tools = [
    {
        "type": "function",
        "function": {
            "name": "check_order_status",
            "description": "Retrieve the current status of a customer order",
            "parameters": {
                "type": "object",
                "properties": {
                    "order_id": {"type": "string"}
                },
                "required": ["order_id"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "escalate_to_human",
            "description": "Hand off the conversation to a human agent",
            "parameters": {
                "type": "object",
                "properties": {
                    "reason": {"type": "string"}
                },
                "required": ["reason"]
            }
        }
    }
]

messages = [
    {"role": "system", "content": "You are a helpful support agent. Use tools to answer questions. Escalate if the user is frustrated."},
    {"role": "user", "content": "Where is my order #48291? It was supposed to arrive yesterday and I am getting worried."}
]

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=messages,
    tools=tools,
    tool_choice="auto"
)

print(response.choices[0].message)

The same pattern works for any Oxlo.ai chat model that supports function calling, including Qwen 3, DeepSeek V3.2, and Kimi K2.6. Because the endpoint is OpenAI-compatible, you can reuse existing evaluation frameworks, observability hooks, and retry logic.