Using LLM for Speech Generation: A Comprehensive Guide

#aiinfrastructure #oxlo #ai

Speech generation has moved beyond rigid, rules-based synthesizers. Modern pipelines now use large language models to author, structure, and contextualize spoken content before it ever reaches a text-to-speech engine. Whether you are building voice agents, audiobook pipelines, or automated commentary systems, the most robust architecture pairs an LLM for reasoning with a dedicated audio model for rendering. Oxlo.ai provides both layers through a single OpenAI-compatible API, backed by request-based pricing that stays flat even when your prompts grow.

Native Audio LLMs vs. Orchestrated Pipelines

Researchers are training multimodal LLMs that emit audio tokens directly, but for production workloads the dominant pattern remains orchestration. In this design, an LLM generates or refines text, then a specialized model converts that text to speech. This separation of concerns lets you optimize each stage independently: the LLM handles personality, memory, and tool use, while the TTS model handles phonetics, pacing, and voice stability.

Oxlo.ai supports this pattern natively. You can run reasoning workloads on models such as Qwen 3 32B, Llama 3.3 70B, or DeepSeek V3.2, then hand the resulting text to Kokoro 82M through the audio/speech endpoint. Both stages use the same API key and base URL, so your client code does not need to juggle multiple providers.

High-Fidelity TTS with Kokoro on Oxlo.ai

Kokoro 82M is a lightweight yet expressive text-to-speech model. Because it runs on Oxlo.ai with no cold starts, you can treat speech generation as a synchronous step inside a larger agent loop rather than a batch job.

The audio/speech endpoint is fully OpenAI SDK compatible. A minimal Python call looks like this:

import openai

client = openai.OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

response = client.audio.speech.create(
    model="kokoro-82m",
    voice="af",
    input="The system is now online. All diagnostics passed."
)

response.stream_to_file("output.mp3")

You can swap voices by changing the voice parameter, and because Oxlo.ai charges per request rather than per token, a ten-word prompt and a thousand-word script cost the same to synthesize. See https://oxlo.ai/pricing for current plan details.

LLM-Driven Speech Workflows

The real power emerges when you let an LLM decide what to say. For example, a customer-support voice agent might need to check an order status, summarize it, and then render the summary as speech. The LLM stage can enforce tone constraints, inject user-specific variables, and format the output for optimal pronunciation.

Below is a compact pattern that chains chat completion to speech synthesis on Oxlo.ai:

# Stage 1: Generate the spoken text
chat = client.chat.completions.create(
    model="deepseek-v3.2",
    messages=[
        {"role": "system", "content": "You are a calm, precise support agent. Reply in one sentence."},
        {"role": "user", "content": "Where is order 9912?"}
    ]
)
script = chat.choices[0].message.content

# Stage 2: Synthesize
speech = client.audio.speech.create(
    model="kokoro-82m",
    voice="af",
    input=script
)
speech.stream_to_file("response.mp3")

Because the chat model and the audio model sit behind the same base URL, you do not need to manage separate authentication flows or SDK versions.

Using Transcription to Feed Context

Speech generation is often one half of a conversation. To build a voice interface that listens and responds, you first need speech-to-text. Oxlo.ai hosts Whisper Large v3, Whisper Turbo, and Whisper Medium on the audio/transcriptions endpoint.

A complete listen-think-speak loop looks like this:

# 1. Transcribe user audio
transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("user_query.mp3", "rb")
)

# 2. Reason
reply = client.chat.completions.create(
    model="qwen3-32b",
    messages=[
        {"role": "system", "content": "Be concise."},
        {"role": "user", "content": transcript.text}
    ]
)

# 3. Speak
client.audio.speech.create(
    model="kokoro-82m",
    voice="af",
    input=reply.choices[0].message.content
).stream_to_file("reply.mp3")

This three-stage pipeline runs entirely on Oxlo.ai. The transcript, system prompt, and conversation history can grow quite long in agentic scenarios. On token-based providers, that growth directly increases cost. Oxlo.ai’s flat per-request pricing removes that penalty, which makes iterative voice agents economically viable.

Why Request-Based Pricing Matters for Audio

Audio workloads are deceptive. A short utterance might require a massive system prompt, few-shot examples, or a long transcript for context. Under token-based billing, you pay for every input token before the model generates a single word of speech. For agents that maintain multi-turn memory or process long-form narration, those input tokens accumulate quickly.

Oxlo.ai uses request-based pricing: one flat cost per API call regardless of prompt length. For speech-generation pipelines