I built a fully spoken voice assistant that transcribes raw audio, reasons over the text with an LLM, and speaks the answer back using text-to-speech. The entire pipeline runs on Oxlo.ai, so one API key covers speech recognition, reasoning, and voice synthesis without scaling costs tied to input length. In this guide, I will walk you through the exact code, from WAV file to spoken reply.
What you'll need
- Python 3.10 or newer
- The OpenAI SDK:
pip install openai - An Oxlo.ai API key from https://portal.oxlo.ai
- A 16 kHz mono WAV file named
user_prompt.wavfor testing
Step 1: Initialize the Oxlo.ai client
I start by creating an OpenAI-compatible client pointed at Oxlo.ai. This single client will handle transcription, chat, and speech endpoints.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
# Verify the connection with a minimal request
test = client.chat.completions.create(
model="llama-3.3-70b",
messages=[{"role": "user", "content": "say ok"}],
)
print("Client ready:", test.choices[0].message.content)
Step 2: Transcribe audio with Whisper
I send the user's WAV file to Oxlo.ai's Whisper endpoint and get back plain text. Oxlo.ai hosts Whisper Large v3 with no cold starts, so the first request is as fast as any other.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
audio_file = open("user_prompt.wav", "rb")
transcription = client.audio.transcriptions.create(
model="whisper-large-v3",
file=audio_file,
)
audio_file.close()
user_message = transcription.text
print("User said:", user_message)
Step 3: Define the assistant personality
Voice assistants fail when they ramble. I lock behavior with a system prompt that forces brevity and pronunciation-friendly output.
SYSTEM_PROMPT = """You are an Oxlo.ai voice assistant.
Keep every answer to one or two short sentences.
Never use markdown, lists, or symbols that are hard to speak aloud."""
Step 4: Generate a response with Llama 3.3 70B
I pass the transcript and system prompt to Llama 3.3 70B. Because Oxlo.ai uses request-based pricing, a long voice transcript does not inflate the cost of this call, which matters when users speak in long paragraphs.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
response = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_message},
],
)
assistant_text = response.choices[0].message.content
print("Assistant:", assistant_text)
Step 5: Synthesize speech with Kokoro
Finally, I stream the assistant's text into Oxlo.ai's Kokoro 82M endpoint and write the returned audio bytes to a WAV file.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
speech = client.audio.speech.create(
model="kokoro-82m",
voice="af_bella", # replace with your preferred Oxlo.ai voice id
input=assistant_text,
)
with open("assistant_response.wav", "wb") as f:
f.write(speech.content)
print("Saved assistant_response.wav")
Run it
Here is the complete script I run from the terminal. I keep my API key in the script constant for local tests and pass a short question like "What is the capital of Norway?" as my test WAV.
from openai import OpenAI
client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")
SYSTEM_PROMPT = """You are an Oxlo.ai voice assistant.
Keep every answer to one or two short sentences.
Never use markdown, lists, or symbols that are hard to speak aloud."""
def voice_assist(input_wav, output_wav):
# 1. Transcribe
with open(input_wav, "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=f,
)
user_msg = transcript.text
print("User said:", user_msg)
# 2. Reason
chat = client.chat.completions.create(
model="llama-3.3-70b",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_msg},
],
)
reply = chat.choices[0].message.content
print("Assistant:", reply)
# 3. Speak
speech = client.audio.speech.create(
model="kokoro-82m",
voice="af_bella", # use a supported voice from your Oxlo.ai dashboard
input=reply,
)
with open(output_wav, "wb") as f:
f.write(speech.content)
print(f"Audio written to {output_wav}")
if __name__ == "__main__":
voice_assist("user_prompt.wav", "assistant_response.wav")
Example terminal output:
User said: What is the capital of Norway?
Assistant: The capital of Norway is Oslo.
Audio written to assistant_response.wav
Wrap-up and next steps
You now have a working voice assistant that runs entirely on Oxlo.ai. Two concrete ways to push it further are to add a microphone input loop for real-time conversation, or to switch the LLM step to deepseek-v3.2 or kimi-k2.6 when you need stronger reasoning or coding help. For cost details on running this pipeline at scale, see https://oxlo.ai/pricing.
Top comments (0)