shashank ms

Posted on Jun 22

LLM for Text Classification and Sentiment Analysis: A Comparative Study

#learnai #oxlo #ai

We'll build a production-ready text classification pipeline that categorizes support tickets by topic and sentiment in a single structured call. This replaces brittle keyword classifiers and scales to any taxonomy you define without retraining.

What you'll need

Python 3.10 or newer
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai

Step 1: Initialize the Oxlo.ai client

I always start by verifying the connection with a lightweight model before burning requests on heavier workloads. Oxlo.ai exposes a fully OpenAI-compatible endpoint, so the client setup is a one-liner.

from openai import OpenAI
import json
import time

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

# Quick connectivity check
response = client.chat.completions.create(
    model="deepseek-v3.2",
    messages=[{"role": "user", "content": "ping"}],
    max_tokens=5
)
print(response.choices[0].message.content)

Step 2: Lock down the taxonomy and system prompt

The prompt is the contract. I force JSON mode so downstream code never has to parse free text. The categories and sentiment labels are explicit, and I require a confidence score so I can flag low-trust predictions for human review.

SYSTEM_PROMPT = """You are a text classification engine.
Analyze the user's message and return a JSON object with exactly these keys:
- category: one of [Billing, Technical, Account, General]
- sentiment: one of [Positive, Neutral, Negative]
- confidence: a float between 0.0 and 1.0
- reasoning: one sentence explaining the classification

Respond with valid JSON only. No markdown fences, no commentary."""

Step 3: Build the structured classifier

I wrap the call in a small function that enforces JSON mode and handles the schema. I use Llama 3.3 70B here because it follows structured instructions reliably and has no cold starts on Oxlo.ai.

def classify_text(text: str, model: str = "llama-3.3-70b") -> dict:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": text},
        ],
        response_format={"type": "json_object"},
        temperature=0.1,
        max_tokens=256
    )
    
    raw = response.choices[0].message.content
    return json.loads(raw)

# Test on a single ticket
sample = "I was charged twice last month and the refund still hasn't shown up."
result = classify_text(sample)
print(json.dumps(result, indent=2))

Step 4: Prepare a labeled benchmark dataset

To run a comparative study, I need a small ground-truth set. I created five support tickets with known expected labels. This lets us measure accuracy across models rather than eyeballing outputs.

BENCHMARK = [
    {"text": "I was charged twice last month and the refund still hasn't shown up.", "expected_category": "Billing", "expected_sentiment": "Negative"},
    {"text": "How do I export my data to CSV? The docs mention an API but I can't find the button.", "expected_category": "Technical", "expected_sentiment": "Neutral"},
    {"text": "Your team resolved my outage in under ten minutes. Incredible support.", "expected_category": "Technical", "expected_sentiment": "Positive"},
    {"text": "Can I add a second admin to my account without upgrading?", "expected_category": "Account", "expected_sentiment": "Neutral"},
    {"text": "Just wanted to say the new dashboard looks great!", "expected_category": "General", "expected_sentiment": "Positive"},
]

Step 5: Run the comparative benchmark across models

Now I run the same workload through three different Oxlo.ai models: Llama 3.3 70B for general instruction following, Qwen 3 32B for multilingual reasoning, and DeepSeek V3.2 for coding-adjacent structured outputs. Because Oxlo.ai uses per-request pricing, running this benchmark on long tickets costs the same flat rate regardless of prompt length.

MODELS = ["llama-3.3-70b", "qwen-3-32b", "deepseek-v3.2"]

def evaluate_model(model: str) -> dict:
    correct = 0
    total = len(BENCHMARK)
    latencies = []
    
    for item in BENCHMARK:
        start = time.time()
        try:
            pred = classify_text(item["text"], model=model)
            latency = time.time() - start
            latencies.append(latency)
            
            cat_match = pred.get("category") == item["expected_category"]
            sent_match = pred.get("sentiment") == item["expected_sentiment"]
            if cat_match and sent_match:
                correct += 1
                
            print(f"[{model}] {pred} | latency={latency:.2f}s")
        except Exception as e:
            print(f"[{model}] ERROR: {e}")
    
    return {
        "model": model,
        "accuracy": correct / total,
        "avg_latency": sum(latencies) / len(latencies) if latencies else 0
    }

results = [evaluate_model(m) for m in MODELS]

print("\n--- Comparative Results ---")
for r in results:
    print(f"{r['model']}: accuracy={r['accuracy']:.0%}, avg_latency={r['avg_latency']:.2f}s")

Step 6: Add a confidence threshold filter

In production, I do not blindly trust every label. I add a confidence gate so anything below 0.85 gets queued for manual review. This is where the per-request pricing on Oxlo.ai pays off: adding retry logic or ensemble calls does not inflate costs the way token-based billing would.

def classify_with_fallback(text: str, threshold: float = 0.85) -> dict:
    primary = classify_text(text, model="llama-3.3-70b")
    
    if primary.get("confidence", 0) < threshold:
        # Secondary opinion from a different model family
        secondary = classify_text(text, model="kimi-k2.6")
        primary["fallback_review"] = secondary
        primary["review_flag"] = True
    else:
        primary["review_flag"] = False
    
    return primary

# Test a borderline case
borderline = "The thing with the stuff isn't working right."
print(json.dumps(classify_with_fallback(borderline), indent=2))

Run it

Putting it all together, I run the full pipeline on a fresh ticket stream. Here is the main entrypoint and the output I see on my end.

if __name__ == "__main__":
    new_tickets = [
        "My SAML SSO stopped working after the weekend deploy.",
        "Love the new dark mode, but can you add keyboard shortcuts?",
        "Invoice #9922 has the wrong tax rate applied.",
    ]
    
    for ticket in new_tickets:
        out = classify_text(ticket)
        flag = "REVIEW" if out.get("confidence", 1) < 0.85 else "OK"
        print(f"[{flag}] {out['category']} | {out['sentiment']} | {ticket[:50]}...")

Example output:

[OK] Technical | Negative | My SAML SSO stopped working after the weeken...
[OK] General | Positive | Love the new dark mode, but can you add keyb...
[OK] Billing | Neutral | Invoice #9922 has the wrong tax rate applied....

Wrap-up

You now have a working classifier that benchmarks multiple models on Oxlo.ai and gates low-confidence predictions. Two concrete next steps: wire this into a FastAPI endpoint so your support stack can call it in real time, or expand the taxonomy and use the flagged low-confidence samples to fine-tune a smaller specialist model. For pricing details on running this at scale, see https://oxlo.ai/pricing.

DEV Community