DEV Community

shashank ms
shashank ms

Posted on

Building an LLM-Powered Chatbot for Business

Business chatbots have moved beyond simple FAQ retrieval. Modern implementations handle multi-turn reasoning, tool orchestration, and long-document analysis. The difference between a prototype and a production system usually comes down to inference architecture: how you manage context, latency, and cost at scale.

Architecture Overview

A production chatbot needs three layers: a stateful conversation manager, a reasoning engine, and a tool layer for external actions. The conversation manager handles user sessions and history. The reasoning engine processes context and decides when to respond or invoke tools. The tool layer connects to CRMs, databases, or internal APIs.

For the reasoning engine, you need an inference provider that supports streaming, function calling, and high context limits without unpredictable scaling costs. This is where your choice of backend directly impacts both user experience and operating budget.

Selecting a Model and Inference Provider

Most providers bill by the token, which means long system prompts, retrieved documents, and multi-turn history inflate costs linearly. Oxlo.ai uses request-based pricing: one flat cost per API call regardless of input length. For business chatbots that process lengthy support tickets, legal documents, or extended agentic workflows, this structure makes costs predictable and can be significantly cheaper than token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale.

Oxlo.ai offers 45+ models across seven categories, all accessible through a fully OpenAI-compatible API. For general business chat, Llama 3.3 70B works well as a reliable flagship. If your bot operates across languages or executes agentic workflows, Qwen 3 32B is a strong alternative. For deep reasoning over complex internal documentation, DeepSeek R1 671B MoE or Kimi K2.6 provide advanced chain-of-thought capabilities.

Setting Up the API Client

Because Oxlo.ai is a drop-in replacement for the OpenAI SDK, you can use existing Python or Node.js code with only a base URL change.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY")
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful business assistant."},
        {"role": "user", "content": "Summarize our Q3 revenue trends."}
    ],
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
Enter fullscreen mode Exit fullscreen mode

No custom client logic is required. The same pattern works for chat completions, embeddings, image generation, and audio endpoints.

Implementing Multi-Turn Memory

Business chatbots need to maintain state across sessions. Store conversation history in Redis or PostgreSQL, then append new messages to the array on each request.

conversation_history = [
    {"role": "system", "content": "You are a support agent for a B2B SaaS platform."}
]

def chat_turn(user_message, history):
    history.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=history,
        max_tokens=1024
    )

    assistant_message = response.choices[0].message.content
    history.append({"role": "assistant", "content": assistant_message})
    return assistant_message, history
Enter fullscreen mode Exit fullscreen mode

With token-based providers, each additional turn increases the bill because the full history is resent on every request. On Oxlo.ai, the cost remains flat per request even as the conversation history grows, which makes long support sessions economically predictable.

Adding Tool Use for Business Actions

Function calling lets your chatbot query CRMs, check inventory, or schedule meetings. Oxlo.ai supports function calling on compatible models.

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_account_status",
            "description": "Retrieve the billing status for a customer",
            "parameters": {
                "type": "object",
                "properties": {
                    "account_id": {"type": "string"}
                },
                "required": ["account_id"]
            }
        }
    }
]

response = client.chat.completions.create(
    model="qwen3-32b",
    messages=conversation_history,
    tools=tools,
    tool_choice="auto"
)
Enter fullscreen mode Exit fullscreen mode

If the model returns a tool call, execute it in your backend, append the result to the message array, and send a follow-up request to generate the user-facing response.

Handling Long Context Efficiently

Enterprise chatbots often ingest large documents: contracts, technical manuals, or conversation exports. Models like DeepSeek V4 Flash support 1M context windows, and Kimi K2.6 handles 131K contexts with vision and reasoning. On a token-based platform, sending a full contract as context can be expensive. Because Oxlo.ai charges per request rather than per token, you can pass entire documents into the prompt without the cost scaling with document length. This is particularly useful for RAG pipelines where you want to include large retrieved chunks for accuracy.

Deployment Patterns

For production, deploy your chatbot service behind a FastAPI or Express.js backend. Use streaming to improve perceived latency. Oxlo.ai has no cold starts on popular models, so the first request after idle time returns immediately, which is critical for interactive business applications.

If you need embeddings for a RAG pipeline, Oxlo.ai provides BGE-Large and E5-Large through the same API. For image analysis, Gemma 3 27B or Kimi VL A3B handle vision inputs via the chat completions endpoint.

Cost Optimization

The primary cost driver for business chatbots is usually input tokens: system prompts, retrieved context, and conversation history. Token-based pricing penalizes long inputs. Oxlo.ai's request-based model removes this variable. You can see the exact structure on the Oxlo.ai pricing page.

For bots with predictable volume, the Pro plan includes 1,000 requests per day across all models. For high-volume deployments, Premium offers 5,000 requests per day with priority queueing. Enterprise plans provide dedicated GPUs and unlimited usage.

Conclusion

Building a business chatbot requires more than a prompt template. You need reliable tool use, stateful memory, and an inference backend that does not punish long context. Oxlo.ai provides the model variety and OpenAI-compatible API you need, with a pricing structure that aligns cost to business value rather than token count.

Top comments (0)