Using LLM for Text Generation

#aiinfrastructure #oxlo #ai

Large language models have become the default interface for automated text generation, but moving from prototype to production requires more than a prompt. You need deterministic sampling, structured output, streaming latency, and a pricing model that does not penalize long contexts. Oxlo.ai provides a developer-first inference platform with flat per-request pricing and full OpenAI SDK compatibility, so you can switch your text generation pipeline without rewriting client code.

Choosing a Model for Text Generation

Oxlo.ai hosts 45+ open-source and proprietary models across seven categories. For general text generation, Llama 3.3 70B serves as a reliable flagship. If you need deep reasoning or complex coding, DeepSeek R1 671B MoE and DeepSeek V4 Flash offer strong performance, with V4 Flash supporting up to 1M tokens of context. Qwen 3 32B excels at multilingual reasoning and agent workflows, while Kimi K2.6 provides advanced reasoning with a 131K context window. All models are available through a single endpoint with no cold starts.

Basic Completion with the Chat Completions API

Because Oxlo.ai is fully OpenAI SDK compatible, you can point your existing client to the Oxlo.ai base URL and start generating text immediately.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="your_oxlo_api_key"
)

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "Write a Python function that validates email addresses."}],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)

Controlling Randomness with Temperature and Top_p

Text quality depends heavily on sampling parameters. Temperature scales the probability distribution: values near 0.2 produce focused, deterministic output for technical tasks, while 0.7 to 0.9 yield more creative variations. Top_p, or nucleus sampling, restricts the model to the smallest set of tokens whose cumulative probability exceeds the threshold. In practice, adjust temperature first, then use top_p as a secondary filter. Avoid setting both to their extremes simultaneously.

Structured Output with JSON Mode

Production applications rarely consume raw prose. Oxlo.ai supports JSON mode and function calling, which lets you constrain the model to valid schemas. This is useful for extracting entities, generating configuration files, or returning structured arguments to downstream tools.

response = client.chat.completions.create(
    model="deepseek-v3.2",
    messages=[{
        "role": "user",
        "content": "Extract the meeting date, attendees, and action items from the following text: ..."
    }],
    response_format={"type": "json_object"},
    temperature=0.2
)

Streaming Responses for Real-Time UX

Latency matters in user-facing applications. Enabling streaming returns tokens as they are generated, letting you render partial output instead of waiting for the full response. The implementation requires iterating over the response delta.

stream = client.chat.completions.create(
    model="qwen3-32b",
    messages=[{"role": "user", "content": "Explain recursion in Python."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Multi-Turn Conversations and System Prompts

Maintaining context across exchanges requires passing the full message history on each request. A system prompt sets the behavior boundary at the start of the array. Keep the history pruned to the model's context window, or summarize older turns to prevent degradation. Oxlo.ai models such as DeepSeek V4 Flash and Kimi K2.6 support extensive context lengths, making them suitable for long-running dialogues and document analysis.

Long-Context and Agentic Workloads

Agentic pipelines and retrieval-augmented generation often inject thousands of tokens into the prompt. On token-based providers such as Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale, these workloads incur proportional costs that scale with every additional document chunk. Oxlo.ai uses request-based pricing: one flat cost per API request regardless of prompt length. For long-context and agentic use cases, this can be significantly cheaper because your bill does not grow as you add more context.

Cost Considerations for Production

Predictable pricing simplifies capacity planning. Oxlo.ai offers a Free tier at $0 per month with 60 requests per day and access to 16+ models, including DeepSeek V3.2 on the free tier. Paid plans scale to Pro at $80 per month for 1,000 requests per day, Premium at $350 per month for 5,000 requests per day with priority queue access, and custom Enterprise tiers with dedicated GPUs. Because the platform bills per request rather than per token, you can send long system prompts, few-shot examples, or full document contexts without watching metered costs accumulate. See the Oxlo.ai pricing page for current plan details.