shashank ms

Posted on Jun 19

Introduction to LLM Inference: Basics and Best Practices

#learnai #oxlo #ai

Today we are building a support ticket triage agent that reads a customer message and returns structured JSON with category, priority, and a draft reply. This is a practical introduction to LLM inference patterns, system prompts, JSON mode, and streaming, all running against Oxlo.ai's OpenAI-compatible API. If you support users at scale, this automates the first line of classification.

What you'll need

Python 3.10 or newer
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai

I will use llama-3.3-70b because it is a strong general-purpose flagship on Oxlo.ai, but you can swap in qwen-3-32b or kimi-k2.6 without changing any other code.

Step 1: Make a basic inference call

First, I verify the client can reach Oxlo.ai and return a plain text completion. I keep the message list minimal: a system instruction and a raw customer message.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

customer_message = "My account was double-charged on January 12th and I still haven't received the refund. This is urgent."

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful support assistant."},
        {"role": "user", "content": customer_message},
    ],
)

print(response.choices[0].message.content)

Step 2: Design the system prompt

Free-form text is hard to script against. I define a strict system prompt that forces the model to act as a classifier and never deviate from the required schema. Keeping this in its own constant makes it easy to iterate without touching business logic.

SYSTEM_PROMPT = """You are a support ticket triage agent.
Analyze the customer message and produce a JSON object with exactly these keys:
- category: one of Billing, Technical, Account, or General
- priority: one of Low, Medium, High, or Critical
- draft_reply: a concise, professional response under 100 words
Do not include markdown formatting, explanations, or anything outside the JSON object."""

Step 3: Enforce JSON mode

Most providers support an explicit JSON mode. On Oxlo.ai, passing response_format={"type": "json_object"} instructs the model to constrain token generation to valid JSON. This removes brittle regex parsing and cuts downstream errors.

import json

customer_message = "My account was double-charged on January 12th and I still haven't received the refund. This is urgent."

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": customer_message},
    ],
    response_format={"type": "json_object"},
)

result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))

Step 4: Stream tokens

For an agent that runs inside a web service, waiting for the full response feels slow. I enable streaming so tokens arrive as they are generated. The snippet below accumulates chunks and parses the JSON only after the stream ends.

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": customer_message},
    ],
    response_format={"type": "json_object"},
    stream=True,
)

content = ""
for chunk in response:
    delta = chunk.choices[0].delta.content
    if delta:
        content += delta

result = json.loads(content)
print(json.dumps(result, indent=2))

Step 5: Handle long context

Real tickets often contain ten-message threads. On token-based providers, a long input can make a single inference call expensive. Oxlo.ai uses flat per-request pricing, so the cost stays the same whether the prompt is 200 tokens or 20,000. I pass the entire thread as a single user message to show how inference scales.

long_thread = """Subject: Integration failing after update
Customer: The API returns 403 after yesterday's deploy.
Support: Are you using the new v2 header?
Customer: Yes, but only on the staging endpoint. Production still uses v1.
Support: v1 was deprecated. Please rotate your key and retry.
Customer: I rotated it and now both environments return 403. Full logs attached..."""

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": long_thread},
    ],
    response_format={"type": "json_object"},
)

result = json.loads(response.choices[0].message.content)
print(json.dumps(result, indent=2))

Because Oxlo.ai charges per request rather than per token, long-context triage like this is significantly cheaper than with token-based providers. You can see the exact pricing at https://oxlo.ai/pricing.

Step 6: Package the agent

I now wrap everything into a clean function that accepts a message string and returns a validated dict. This is the version I would actually import into a FastAPI route or Celery task.

def triage_ticket(message: str, model: str = "llama-3.3-70b") -> dict:
    response = client.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": message},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Run it

Here is a short script that triages two tickets and prints the results. The second ticket is intentionally vague to test how the model handles edge cases.

if __name__ == "__main__":
    tickets = [
        "My account was double-charged on January 12th and I still haven't received the refund. This is urgent.",
        "I am not sure which plan I should pick. Can you help me decide?",
    ]

    for ticket in tickets:
        out = triage_ticket(ticket)
        print(f"Ticket: {ticket[:50]}...")
        print(json.dumps(out, indent=2))
        print()

Example output:

Ticket: My account was double-charged on January 12th ...
{
  "category": "Billing",
  "priority": "High",
  "draft_reply": "I apologize for the double charge. I have escalated this to our billing team and you should see the refund within 2 business days."
}

Ticket: I am not sure which plan I should pick. Can you ...
{
  "category": "General",
  "priority": "Low",
  "draft_reply": "I would be happy to help you choose the right plan. Could you tell me how many team members you have and your expected usage volume?"
}

Next steps

Replace the static SYSTEM_PROMPT with a dynamic prompt loaded from a template engine like Jinja2 so you can A/B test triage strategies without redeploying code. After that, add function calling so the agent can query your CRM or refund API directly through Oxlo.ai's tool-use support.

DEV Community