Building Speech-to-Text Systems with LLMs

#aiinfrastructure #oxlo #ai

Speech-to-text pipelines have moved beyond simple transcription. Modern systems combine audio models with large language models to clean up output, add speaker labels, structure timestamps, and extract entities. For developers building these stacks, the underlying inference platform determines both latency and cost, especially when you are processing long interviews, podcasts, or agentic voice workflows.

Architecture of an LLM-Enhanced STT Pipeline

A production speech-to-text system usually has three stages. First, an audio preprocessing stage handles normalization, voice activity detection, and chunking. Second, a transcription model converts audio to raw text. Third, an LLM post-processor refines the transcript, optionally injecting structure or context. This hybrid approach outperforms monolithic transcription because LLMs correct domain-specific terminology, format outputs, and merge fragmented utterances into coherent paragraphs.

Transcription Layer: Whisper on Oxlo.ai

For the transcription stage, OpenAI's Whisper remains the practical standard. Oxlo.ai hosts Whisper Large v3, Whisper Turbo, and Whisper Medium behind a fully OpenAI-compatible audio/transcriptions endpoint. Because Oxlo.ai maintains no cold starts on popular models, the first request after idle time returns at full speed.

You can call it with the standard OpenAI SDK by switching the base URL and API key.

from openai import OpenAI

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key="YOUR_OXLO_API_KEY"
)

with open("earnings_call.wav", "rb") as audio_file:
    transcript = client.audio.transcriptions.create(
        model="whisper-large-v3",
        file=audio_file,
        response_format="verbose_json",
        timestamp_granularities=["word"]
    )

The verbose JSON format gives word-level timestamps, which are essential for aligning LLM-generated summaries with the original audio timeline.

Chunking Strategies for Long-Form Audio

Whisper models have practical context windows, so hour-long recordings must be split. A common pattern is to use voice activity detection to produce 30-second chunks, transcribe each independently, and then concatenate results. The challenge is that concatenation introduces boundary errors and lost context.

This is where the post-processing LLM becomes critical. You can feed the entire concatenated transcript, which may run to tens of thousands of tokens, into a long-context model to smooth transitions and resolve coreferences. On token-based providers, that post-processing step becomes expensive because cost scales with input length. Oxlo.ai uses request-based pricing with one flat cost per API request regardless of prompt length, so long transcripts do not trigger runaway costs. For transcription-heavy and agentic workloads, this model can be significantly cheaper than token-based alternatives.

Post-Processing with LLMs

After transcription, an LLM can perform tasks that audio models alone handle poorly. Examples include redacting personally identifiable information, converting timestamps to structured JSON, generating speaker diarization hypotheses from contextual cues, and correcting industry jargon.

Oxlo.ai offers 45+ models across seven categories, including general-purpose flagships like Llama 3.3 70B and DeepSeek V4 Flash, which supports one-million-token context. You can call these through the same OpenAI SDK client using JSON mode or function calling to enforce output schemas.

completion = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a post-processing assistant. Fix transcription errors and return strict JSON."},
        {"role": "user", "content": f"Transcript:\n{transcript.text}\n\nReturn speaker segments with corrected text."}
    ],
    response_format={"type": "json_object"}
)

Because the input transcript can be arbitrarily long, running this on Oxlo.ai avoids the per-token metering that makes large post-processing jobs unpredictable on providers like Together AI, Fireworks AI, OpenRouter, Replicate, or Anyscale.

Cost Engineering with Request-Based Pricing

Speech-to-text systems are inherently long-context workloads. A single hour of audio can produce 10,000 to 15,000 tokens of raw text, and agentic pipelines may iterate over that text multiple times. Token-based billing means every preprocessing step, correction pass, and summarization loop adds to the bill.

Oxlo.ai flips this by charging one flat cost per API request regardless of prompt length. For transcription and subsequent LLM refinement, that means a heavy post-processing job costs the same as a short greeting. Request-based pricing can be 10-100x cheaper than token-based for long-context workloads. You can explore the exact tiers on the Oxlo.ai pricing page, which includes a free plan with 60 requests per day and a 7-day full-access trial for prototyping.

Putting It Together: A Minimal Example

The following script ties transcription and LLM cleanup into a single pipeline. It assumes you have an Oxlo.ai API key and the Python OpenAI SDK installed.

import json
from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def transcribe_and_structure(audio_path: str) -> dict:
    # Step 1: Transcribe with Whisper
    with open(audio_path, "rb") as f:
        raw = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=f,
            response_format="text"
        )
    
    # Step 2: Structure with an LLM
    prompt = (
        "Given the following transcript, produce a JSON object with keys: "
        "title, speakers (array), and segments (array of objects with start_time, end_time, text). "
        "If times are unknown, use null.\n\nTranscript:\n" + raw
    )
    
    structured = client.chat.completions.create(
        model="qwen3-32b",
        messages=[
            {"role": "system", "content": "You format transcripts into strict JSON."},
            {"role": "user", "content": prompt}
        ],
        response_format={"type": "json_object"}
    )
    
    return json.loads(structured.choices[0].message.content)

if __name__ == "__main__":
    result = transcribe_and_structure("interview.wav")
    print(json.dumps(result, indent=2))

This pattern works because Oxlo.ai is fully OpenAI SDK compatible. You do not need custom clients or adapter layers. Whether you are building a voice agent, a podcast search engine, or a meeting summarizer, the same request-based pricing applies, making costs predictable even as your context lengths grow.