Customer support agents built on large language models have moved past simple FAQ bots. A production system today combines retrieval, tool use, and multi-turn reasoning to resolve tickets without human intervention. In this tutorial, you will build an autonomous support agent in Python that can search a knowledge base, look up orders, and escalate complex issues. We will use Oxlo.ai as the inference backend because its request-based pricing and tool-supporting models remove the cost penalties usually tied to long conversation threads.
Architecture of a Modern Support Agent
A reliable support agent needs three components working together. First, a knowledge base stores articles and past resolutions as vector embeddings so the agent can retrieve relevant context. Second, a reasoning layer powered by an LLM interprets user intent, decides when to retrieve facts, and chooses whether to call external tools. Third, an action layer executes those tool calls against internal APIs and returns structured results to the reasoning layer.
Oxlo.ai covers each layer. For embeddings, you can use BGE-Large or E5-Large. For reasoning, models such as Llama 3.3 70B, Qwen 3 32B, or Kimi K2.6 support function calling, multi-turn conversations, and long context windows. If you need to transcribe voice messages, Whisper Large v3 is available through the same API.
Project Setup and SDK Configuration
Oxlo.ai is fully OpenAI SDK compatible, so you can use the official Python client with a single configuration change. Install the SDK and set your API key.
pip install openai
Then instantiate the client pointing to Oxlo.ai.
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key=os.environ["OXLO_API_KEY"]
)
Building the Knowledge Base with Embeddings
Start by turning your support articles into embeddings. The snippet below sends article chunks to Oxlo.ai and stores the vectors in a simple dictionary. In production, swap this for a vector database.
import numpy as np
articles = {
"returns": "You can return items within 30 days with the original receipt.",
"shipping": "Standard shipping takes 5 to 7 business days.",
}
def get_embedding(text):
resp = client.embeddings.create(
model="BGE-Large",
input=[text]
)
return resp.data[0].embedding
kb = {k: get_embedding(v) for k, v in articles.items()}
def retrieve(query, top_k=1):
q_emb = get_embedding(query)
# Cosine similarity via dot product on normalized vectors
scores = {
k: np.dot(q_emb, v) / (np.linalg.norm(q_emb) * np.linalg.norm(v))
for k, v in kb.items()
}
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
Because Oxlo.ai offers dedicated embedding endpoints, you do not need a separate provider for retrieval.
Defining Tools for Order Management and Escalation
Next, define the actions your agent can take. We will create two tools: one to look up an order by ID and another to escalate to a human agent. The LLM receives these schemas and decides when to invoke them.
tools = [
{
"type": "function",
"function": {
"name": "lookup_order",
"description": "Retrieve the status and tracking URL for an order",
"parameters": {
"type": "object",
"properties": {
"order_id": {
"type": "string",
"description": "The order identifier, e.g., ORD-12345"
}
},
"required": ["order_id"]
}
}
},
{
"type": "function",
"function": {
"name": "escalate_to_human",
"description": "Hand off the conversation to a human support representative",
"parameters": {
"type": "object",
"properties": {
"reason": {
"type": "string",
"description": "Why the escalation is needed"
}
},
"required": ["reason"]
}
}
}
]
Implementing the Agent Loop
The core of the agent is a loop that calls the chat completions endpoint, checks for tool calls, executes them, and feeds the results back to the model. We will use Llama 3.3 70B as the reasoning engine because it handles tool use reliably and responds quickly.
import json
def run_agent(user_message):
messages = [
{
"role": "system",
"content": (
"You are a helpful support agent. Answer questions using the knowledge base. "
"If the user asks about an order, use lookup_order. "
"If the user is angry or the issue is complex, use escalate_to_human."
)
},
{"role": "user", "content": user_message}
]
# Retrieve context and prepend it to the system message
hits = retrieve(user_message)
context = "\n".join([f"{k}: {articles[k]}" for k, _ in hits])
messages[0]["content"] += f"\n\nRelevant articles:\n{context}"
while True:
response = client.chat.completions.create(
model="Llama 3.3 70B",
messages=messages,
tools=tools,
tool_choice="auto"
)
choice = response.choices[0]
if choice.finish_reason == "tool_calls":
messages.append(choice.message)
for tc in choice.message.tool_calls:
fn_name = tc.function.name
args = json.loads(tc.function.arguments)
if fn_name == "lookup_order":
result = {"status": "shipped", "tracking_url": "https://track.example.com/123"}
elif fn_name == "escalate_to_human":
result = {"status": "escalated", "queue": "billing"}
else:
result = {"error": "unknown tool"}
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"name": fn_name,
"content": json.dumps(result)
})
else:
return choice.message.content
This loop continues until the model returns a text response instead of a tool call. The tool role messages provide the exact JSON the model needs to generate an accurate final answer.
Managing Long Context and Conversation History
Production support threads often contain dozens of turns. On token-based platforms, every additional message increases cost. Oxlo.ai uses request-based pricing, so you pay one flat cost per API request regardless of how many tokens are in the prompt. For agentic support workflows that accumulate long transcripts, this can reduce inference costs significantly compared to token-based billing.
Oxlo.ai also provides models with large context windows for keeping entire conversations in memory. Kimi K2.6 supports 131K tokens, and DeepSeek V4 Flash supports 1M tokens, which lets you feed lengthy ticket histories or large retrieved documents without aggressive truncation. For details on plan limits, see the Oxlo.ai pricing page.
Deployment Considerations
When moving to production, enable streaming responses so users see text as it generates rather than waiting for the full completion. Oxlo.ai supports standard SSE streaming through the OpenAI SDK. If you need structured output for downstream ticket routing, use JSON mode to constrain the model response to valid JSON.
Audio inputs are also common in support. You can route voice messages through Whisper Large v3 on Oxlo.ai and feed the transcript into the same agent loop. Because there are no cold starts on popular models, the first request after a quiet period returns immediately.
You now have a working pattern for an AI support agent: embed knowledge, define tools, and run a loop that reasons over context and acts on behalf of the user. Oxlo.ai fits this stack naturally. Its OpenAI-compatible SDK means you can adopt it without rewriting client code, its request-based pricing keeps long conversations affordable, and its model catalog covers everything from embeddings to transcription to reasoning. Start with the free tier to prototype your agent.
Top comments (0)