We'll build a production-ready text classification pipeline that categorizes support tickets by topic and sentiment in a single structured call. This replaces brittle keyword classifiers and scales to any taxonomy you define without retraining.
What you'll need
- Python 3.10 or newer
- The OpenAI SDK:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai
Step 1: Initialize the Oxlo.ai client
I always start by verifying the connection with a lightweight model before burning requests on heavier workloads. Oxlo.ai exposes a fully OpenAI-compatible endpoint, so the client setup is a one-liner.
from openai import OpenAI
import json
import time
client = OpenAI(
base_url="https://api.oxlo.ai/v1",
api_key="YOUR_OXLO_API_KEY"
)
# Quick connectivity check
response = client.chat.completions.create(
model="deepseek-v3.2",
messages=[{"role": "user", "content": "ping"}],
max_tokens=5
)
print(response.choices[0].message.content)
Step 2: Lock down the taxonomy and system prompt
The prompt is the contract. I force JSON mode so downstream code never has to parse free text. The categories and sentiment labels are explicit, and I require a confidence score so I can flag low-trust predictions for human review.
SYSTEM_PROMPT = """You are a text classification engine.
Analyze the user's message and return a JSON object with exactly these keys:
- category: one of [Billing, Technical, Account, General]
- sentiment: one of [Positive, Neutral, Negative]
- confidence: a float between 0.0 and 1.0
- reasoning: one sentence explaining the classification
Respond with valid JSON only. No markdown fences, no commentary."""
Step 3: Build the structured classifier
I wrap the call in a small function that enforces JSON mode and handles the schema. I use Llama 3.3 70B here because it follows structured instructions reliably and has no cold starts on Oxlo.ai.
def classify_text(text: str, model: str = "llama-3.3-70b") -> dict:
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": text},
],
response_format={"type": "json_object"},
temperature=0.1,
max_tokens=256
)
raw = response.choices[0].message.content
return json.loads(raw)
# Test on a single ticket
sample = "I was charged twice last month and the refund still hasn't shown up."
result = classify_text(sample)
print(json.dumps(result, indent=2))
Step 4: Prepare a labeled benchmark dataset
To run a comparative study, I need a small ground-truth set. I created five support tickets with known expected labels. This lets us measure accuracy across models rather than eyeballing outputs.
BENCHMARK = [
{"text": "I was charged twice last month and the refund still hasn't shown up.", "expected_category": "Billing", "expected_sentiment": "Negative"},
{"text": "How do I export my data to CSV? The docs mention an API but I can't find the button.", "expected_category": "Technical", "expected_sentiment": "Neutral"},
{"text": "Your team resolved my outage in under ten minutes. Incredible support.", "expected_category": "Technical", "expected_sentiment": "Positive"},
{"text": "Can I add a second admin to my account without upgrading?", "expected_category": "Account", "expected_sentiment": "Neutral"},
{"text": "Just wanted to say the new dashboard looks great!", "expected_category": "General", "expected_sentiment": "Positive"},
]
Step 5: Run the comparative benchmark across models
Now I run the same workload through three different Oxlo.ai models: Llama 3.3 70B for general instruction following, Qwen 3 32B for multilingual reasoning, and DeepSeek V3.2 for coding-adjacent structured outputs. Because Oxlo.ai uses per-request pricing, running this benchmark on long tickets costs the same flat rate regardless of prompt length.
MODELS = ["llama-3.3-70b", "qwen-3-32b", "deepseek-v3.2"]
def evaluate_model(model: str) -> dict:
correct = 0
total = len(BENCHMARK)
latencies = []
for item in BENCHMARK:
start = time.time()
try:
pred = classify_text(item["text"], model=model)
latency = time.time() - start
latencies.append(latency)
cat_match = pred.get("category") == item["expected_category"]
sent_match = pred.get("sentiment") == item["expected_sentiment"]
if cat_match and sent_match:
correct += 1
print(f"[{model}] {pred} | latency={latency:.2f}s")
except Exception as e:
print(f"[{model}] ERROR: {e}")
return {
"model": model,
"accuracy": correct / total,
"avg_latency": sum(latencies) / len(latencies) if latencies else 0
}
results = [evaluate_model(m) for m in MODELS]
print("\n--- Comparative Results ---")
for r in results:
print(f"{r['model']}: accuracy={r['accuracy']:.0%}, avg_latency={r['avg_latency']:.2f}s")
Step 6: Add a confidence threshold filter
In production, I do not blindly trust every label. I add a confidence gate so anything below 0.85 gets queued for manual review. This is where the per-request pricing on Oxlo.ai pays off: adding retry logic or ensemble calls does not inflate costs the way token-based billing would.
def classify_with_fallback(text: str, threshold: float = 0.85) -> dict:
primary = classify_text(text, model="llama-3.3-70b")
if primary.get("confidence", 0) < threshold:
# Secondary opinion from a different model family
secondary = classify_text(text, model="kimi-k2.6")
primary["fallback_review"] = secondary
primary["review_flag"] = True
else:
primary["review_flag"] = False
return primary
# Test a borderline case
borderline = "The thing with the stuff isn't working right."
print(json.dumps(classify_with_fallback(borderline), indent=2))
Run it
Putting it all together, I run the full pipeline on a fresh ticket stream. Here is the main entrypoint and the output I see on my end.
if __name__ == "__main__":
new_tickets = [
"My SAML SSO stopped working after the weekend deploy.",
"Love the new dark mode, but can you add keyboard shortcuts?",
"Invoice #9922 has the wrong tax rate applied.",
]
for ticket in new_tickets:
out = classify_text(ticket)
flag = "REVIEW" if out.get("confidence", 1) < 0.85 else "OK"
print(f"[{flag}] {out['category']} | {out['sentiment']} | {ticket[:50]}...")
Example output:
[OK] Technical | Negative | My SAML SSO stopped working after the weeken...
[OK] General | Positive | Love the new dark mode, but can you add keyb...
[OK] Billing | Neutral | Invoice #9922 has the wrong tax rate applied....
Wrap-up
You now have a working classifier that benchmarks multiple models on Oxlo.ai and gates low-confidence predictions. Two concrete next steps: wire this into a FastAPI endpoint so your support stack can call it in real time, or expand the taxonomy and use the flagged low-confidence samples to fine-tune a smaller specialist model. For pricing details on running this at scale, see https://oxlo.ai/pricing.
Top comments (0)