Building Voice Assistants with LLMs: A Step-by-Step Guide

#learnai #oxlo #ai

I built a fully spoken voice assistant that transcribes raw audio, reasons over the text with an LLM, and speaks the answer back using text-to-speech. The entire pipeline runs on Oxlo.ai, so one API key covers speech recognition, reasoning, and voice synthesis without scaling costs tied to input length. In this guide, I will walk you through the exact code, from WAV file to spoken reply.

What you'll need

Python 3.10 or newer
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai
A 16 kHz mono WAV file named user_prompt.wav for testing

Step 1: Initialize the Oxlo.ai client

I start by creating an OpenAI-compatible client pointed at Oxlo.ai. This single client will handle transcription, chat, and speech endpoints.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

# Verify the connection with a minimal request
test = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[{"role": "user", "content": "say ok"}],
)
print("Client ready:", test.choices[0].message.content)

Step 2: Transcribe audio with Whisper

I send the user's WAV file to Oxlo.ai's Whisper endpoint and get back plain text. Oxlo.ai hosts Whisper Large v3 with no cold starts, so the first request is as fast as any other.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

audio_file = open("user_prompt.wav", "rb")

transcription = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=audio_file,
)

audio_file.close()
user_message = transcription.text
print("User said:", user_message)

Step 3: Define the assistant personality

Voice assistants fail when they ramble. I lock behavior with a system prompt that forces brevity and pronunciation-friendly output.

SYSTEM_PROMPT = """You are an Oxlo.ai voice assistant.
Keep every answer to one or two short sentences.
Never use markdown, lists, or symbols that are hard to speak aloud."""

Step 4: Generate a response with Llama 3.3 70B

I pass the transcript and system prompt to Llama 3.3 70B. Because Oxlo.ai uses request-based pricing, a long voice transcript does not inflate the cost of this call, which matters when users speak in long paragraphs.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": user_message},
    ],
)

assistant_text = response.choices[0].message.content
print("Assistant:", assistant_text)

Step 5: Synthesize speech with Kokoro

Finally, I stream the assistant's text into Oxlo.ai's Kokoro 82M endpoint and write the returned audio bytes to a WAV file.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

speech = client.audio.speech.create(
    model="kokoro-82m",
    voice="af_bella",  # replace with your preferred Oxlo.ai voice id
    input=assistant_text,
)

with open("assistant_response.wav", "wb") as f:
    f.write(speech.content)

print("Saved assistant_response.wav")

Run it

Here is the complete script I run from the terminal. I keep my API key in the script constant for local tests and pass a short question like "What is the capital of Norway?" as my test WAV.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

SYSTEM_PROMPT = """You are an Oxlo.ai voice assistant.
Keep every answer to one or two short sentences.
Never use markdown, lists, or symbols that are hard to speak aloud."""

def voice_assist(input_wav, output_wav):
    # 1. Transcribe
    with open(input_wav, "rb") as f:
        transcript = client.audio.transcriptions.create(
            model="whisper-large-v3",
            file=f,
        )
    user_msg = transcript.text
    print("User said:", user_msg)

    # 2. Reason
    chat = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_msg},
        ],
    )
    reply = chat.choices[0].message.content
    print("Assistant:", reply)

    # 3. Speak
    speech = client.audio.speech.create(
        model="kokoro-82m",
        voice="af_bella",  # use a supported voice from your Oxlo.ai dashboard
        input=reply,
    )
    with open(output_wav, "wb") as f:
        f.write(speech.content)
    print(f"Audio written to {output_wav}")

if __name__ == "__main__":
    voice_assist("user_prompt.wav", "assistant_response.wav")

Example terminal output:

User said: What is the capital of Norway?
Assistant: The capital of Norway is Oslo.
Audio written to assistant_response.wav

Wrap-up and next steps

You now have a working voice assistant that runs entirely on Oxlo.ai. Two concrete ways to push it further are to add a microphone input loop for real-time conversation, or to switch the LLM step to deepseek-v3.2 or kimi-k2.6 when you need stronger reasoning or coding help. For cost details on running this pipeline at scale, see https://oxlo.ai/pricing.