Today we are building a support ticket triage agent that reads a customer message and returns structured JSON with category, priority, and a draft reply. This is a practical introduction to LLM inference patterns, system prompts, JSON mode, and streaming, all running against Oxlo.ai's OpenAI-compatible API. If you support users at scale, this automates the first line of classification.
What you'll need
- Python 3.10 or newer
- The OpenAI SDK:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai
I will use llama-3.3-70b because it is a strong general-purpose flagship on Oxlo.ai, but you can swap in qwen-3-32b or kimi-k2.6 without changing any other code.
Step 1: Make a basic inference call
First, I verify the client can reach Oxlo.ai and return a plain text completion. I keep the message list minimal: a system instruction and a raw customer message.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
customer_message = "My account was double-charged on January 12th and I still haven't received the refund. This is urgent."
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": "You are a helpful support assistant."},
{"role": "user", "content": customer_message},
],
)
print(response.choices[0].message.content)
Step 2: Design the system prompt
Free-form text is hard to script against. I define a strict system prompt that forces the model to act as a classifier and never deviate from the required schema. Keeping this in its own constant makes it easy to iterate without touching business logic.
SYSTEM_PROMPT = """You are a support ticket triage agent.
Analyze the customer message and produce a JSON object with exactly these keys:
- category: one of Billing, Technical, Account, or General
- priority: one of Low, Medium, High, or Critical
- draft_reply: a concise, professional response under 100 words
Do not include markdown formatting, explanations, or anything outside the JSON object."""
Step 3: Enforce JSON mode
Most providers support an explicit JSON mode. On Oxlo.ai, passing response_format={"type": "json_object"} instructs the model to constrain token generation to valid JSON. This removes brittle regex parsing and cuts downstream errors.
import json
customer_message = "My account was double-charged on January 12th and I still haven't received the refund. This is urgent."
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": customer_message},
],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))
Step 4: Stream tokens
For an agent that runs inside a web service, waiting for the full response feels slow. I enable streaming so tokens arrive as they are generated. The snippet below accumulates chunks and parses the JSON only after the stream ends.
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": customer_message},
],
response_format={"type": "json_object"},
stream=True,
)
content = ""
for chunk in response:
delta = chunk.choices[0].delta.content
if delta:
content += delta
result = json.loads(content)
print(json.dumps(result, indent=2))
Step 5: Handle long context
Real tickets often contain ten-message threads. On token-based providers, a long input can make a single inference call expensive. Oxlo.ai uses flat per-request pricing, so the cost stays the same whether the prompt is 200 tokens or 20,000. I pass the entire thread as a single user message to show how inference scales.
long_thread = """Subject: Integration failing after update
Customer: The API returns 403 after yesterday's deploy.
Support: Are you using the new v2 header?
Customer: Yes, but only on the staging endpoint. Production still uses v1.
Support: v1 was deprecated. Please rotate your key and retry.
Customer: I rotated it and now both environments return 403. Full logs attached..."""
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": long_thread},
],
response_format={"type": "json_object"},
)
result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))
Because Oxlo.ai charges per request rather than per token, long-context triage like this is significantly cheaper than with token-based providers. You can see the exact pricing at https://oxlo.ai/pricing.
Step 6: Package the agent
I now wrap everything into a clean function that accepts a message string and returns a validated dict. This is the version I would actually import into a FastAPI route or Celery task.
def triage_ticket(message: str, model: str = "llama-3.3-70b") -> dict:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": message},
],
response_format={"type": "json_object"},
)
return json.loads(response.choices[0].message.content)
Run it
Here is a short script that triages two tickets and prints the results. The second ticket is intentionally vague to test how the model handles edge cases.
if __name__ == "__main__":
tickets = [
"My account was double-charged on January 12th and I still haven't received the refund. This is urgent.",
"I am not sure which plan I should pick. Can you help me decide?",
]
for ticket in tickets:
out = triage_ticket(ticket)
print(f"Ticket: {ticket[:50]}...")
print(json.dumps(out, indent=2))
print()
Example output:
Ticket: My account was double-charged on January 12th ...
{
"category": "Billing",
"priority": "High",
"draft_reply": "I apologize for the double charge. I have escalated this to our billing team and you should see the refund within 2 business days."
}
Ticket: I am not sure which plan I should pick. Can you ...
{
"category": "General",
"priority": "Low",
"draft_reply": "I would be happy to help you choose the right plan. Could you tell me how many team members you have and your expected usage volume?"
}
Next steps
Replace the static SYSTEM_PROMPT with a dynamic prompt loaded from a template engine like Jinja2 so you can A/B test triage strategies without redeploying code. After that, add function calling so the agent can query your CRM or refund API directly through Oxlo.ai's tool-use support.
Top comments (0)