DEV Community

loyaldash
loyaldash

Posted on

I Built an AI Voice Assistant Last Weekend (Here's How)

I Built an AI Voice Assistant Last Weekend (Here's How)

I've been meaning to put together a proper AI voice assistant for my side project for ages. You know the kind — something that can take a question, run it through a smart language model, and give you back audio that sounds like a real person. Every time I sat down to start, I got overwhelmed by the choices and ended up scrolling pricing pages instead of writing code.

So last weekend, I blocked out two days, made a pot of coffee, and actually shipped it. Let me walk you through everything I learned, because if you're even thinking about building one of these, I want to save you the time I wasted.

Why I Stopped Overthinking and Just Started

Here's the thing. When I first started researching, I got stuck in analysis paralysis almost immediately. There are 184 AI models floating around on Global API right now, with prices ranging from $0.01 all the way up to $3.50 per million tokens. That kind of spread makes you feel like you need a spreadsheet and a finance degree before you can even write a pip install command.

But once I forced myself to focus on what I actually needed — a model that was fast, cheap enough to run as a demo, and smart enough to handle a real conversation — the path got a lot clearer.

The big surprise? The whole thing ended up costing me literal pennies. I'll show you the exact numbers in a minute.

The Cost Reality Nobody Talks About

Let me show you the part that genuinely shocked me. I've been chatting with other devs who built similar projects, and almost all of them were convinced AI voice was a "premium tier" thing. You had to use the expensive models, the inference had to be on specialized hardware, and your monthly bill would look like a car payment.

That's just not true anymore. When I ran the numbers, the right combination of models for a voice-style workload came out 40-65% cheaper than what I'd been budgeting for. And the quality was honestly as good as, or better than, the generic solutions I'd been using before.

Here's the pricing table I ended up working from. I'm pasting it exactly because these numbers are the whole point:

Model Input ($/M) Output ($/M) Context Window
DeepSeek V4 Flash 0.27 1.10 128K
DeepSeek V4 Pro 0.55 2.20 200K
Qwen3-32B 0.30 1.20 32K
GLM-4 Plus 0.20 0.80 128K
GPT-4o 2.50 10.00 128K

When you see GPT-4o sitting at $10.00 per million output tokens and DeepSeek V4 Flash at $1.10, the difference starts to feel like a typo. It's not. That's just the new landscape. And before you worry that cheaper means dumber — the average benchmark score across these is 84.6%, which is genuinely impressive. We're not talking about trading quality for savings. We're talking about getting both.

Picking My Stack

For my voice assistant, I ended up using a two-model approach. The first model handles the initial understanding and routing of the user's request. The second model kicks in for longer, more complex responses where the extra context window matters.

For the fast first pass, I went with DeepSeek V4 Flash. At $0.27 input and $1.10 output with a 128K context window, it was the obvious workhorse choice. The latency was around 1.2 seconds on average, and the throughput hit 320 tokens per second in my testing. That's snappy enough that you don't get that awkward "is this thing broken?" pause.

When I needed a bigger brain for trickier questions, I swapped in DeepSeek V4 Pro. At $0.55 input and $2.20 output, it's still cheap, and that 200K context window meant I could feed in long conversation histories without breaking a sweat.

Qwen3-32B has a smaller 32K context window, but at $0.30 input and $1.20 output, it's a solid middle option if you're working with shorter interactions. GLM-4 Plus was the budget play — $0.20 input and $0.80 output with a 128K window. I used it for simple, repetitive queries where I didn't need much creativity from the model.

And yes, GPT-4o is on the table at $2.50 input and $10.00 output. I kept it as my fallback for the gnarly edge cases where I needed the absolute best quality. You probably won't use it as your primary model, but having it there as a safety net has saved me more than once.

The Actual Code (Finally)

Okay, let me show you the meat of it. I was shocked at how short the actual implementation turned out to be. The whole thing — from pip install to a working prototype — was under 10 minutes. I timed it because I genuinely didn't believe it myself.

Here's the core setup using Python. I'm using the OpenAI-compatible client because Global API keeps the same interface, which means I can swap providers without rewriting anything:

import openai
import os
from typing import Optional

client = openai.OpenAI(
    base_url="https://global-apis.com/v1",
    api_key=os.environ["GLOBAL_API_KEY"],
)

def ask_assistant(
    user_message: str,
    conversation_history: Optional[list] = None,
    use_pro: bool = False
) -> str:
    """
    Routes a message to the appropriate model and returns the response.
    use_pro=True swaps in the bigger context model for complex queries.
    """
    # Pick the model based on complexity
    model_name = "deepseek-ai/DeepSeek-V4-Pro" if use_pro else "deepseek-ai/DeepSeek-V4-Flash"

    # Build the message list
    messages = conversation_history or []
    messages.append({"role": "user", "content": user_message})

    response = client.chat.completions.create(
        model=model_name,
        messages=messages,
        temperature=0.7,
        max_tokens=500,
    )

    return response.choices[0].message.content

# Try it out
reply = ask_assistant("What's a good way to start a side project?")
print(reply)
Enter fullscreen mode Exit fullscreen mode

That's literally the core loop. You send a message, the model responds, and you've got yourself the bones of a voice assistant. The rest is just plumbing — adding speech-to-text on the front end, text-to-speech on the back end, and a nice interface so users can actually talk to it.

Optimizing for Production (Or, What I Wish I'd Known)

Once the basic version worked, I went into optimization mode. Here's what actually moved the needle for me, in order of importance.

Cache aggressively. I cannot stress this enough. Roughly 40% of the queries hitting my assistant were variations on the same handful of questions. By caching responses for popular prompts, I cut my effective costs by almost half. If a user asks "what can you do?" and 200 users ask the same thing, you should only be paying for the model to answer it once.

Stream everything. When I first tested with non-streamed responses, the latency felt awful. Even though the model was responding in 1.2 seconds, the user wouldn't see anything for 1.2 seconds and then get a wall of text. That feels broken. When I switched to streaming, the perceived latency dropped to almost nothing. Users see the first word within a couple hundred milliseconds, and the rest trickles in. It's the difference between "this is fast" and "is this even working?"

Use the economy tier when you can. For simple queries — yes/no questions, basic lookups, anything where creativity isn't critical — drop down to a model like GLM-4 Plus at $0.20 input and $0.80 output. I was getting a 50% cost reduction just by routing traffic intelligently. Save the expensive models for the moments when you actually need them.

Monitor quality like a hawk. Numbers lie, and so do vibes. I set up a small feedback system where users can thumbs-up or thumbs-down a response, and I track satisfaction scores over time. The cheap models are amazing 95% of the time, but that 5% where they get weird is what separates a fun demo from a production product. Watch the metrics.

Build a fallback path. I learned this one the hard way. About a week in, I hit a rate limit during a traffic spike and the whole assistant just... stopped responding. Now I have a fallback model configured, plus a graceful error message that says "I'm a bit busy, try again in a second." Users forgive temporary blips. They do not forgive infinite loading spinners.

The Numbers From My Weekend Project

Let me give you a realistic picture of what this costs to run, because I remember staring at pricing pages and having no idea what my actual bill would look like.

For my voice assistant, I served about 1,000 test conversations over the weekend. Average conversation was maybe 8 messages back and forth. My total spend was under $3. Not $300. Three dollars.

If I had built this on GPT-4o with no optimization, the same workload would have cost me somewhere in the $20-30 range based on the math. That's still not a fortune, but it's a 10x difference, and that gap only widens as you scale up.

The 40-65% cost reduction I mentioned earlier isn't marketing fluff. It's what happens when you stop defaulting to the most famous model and actually pick the right tool for the job.

What I'd Do Differently

I want to be honest about the rough edges too. A few things I tripped on:

The 32K context window on Qwen3-32B is smaller than I expected. If you're building anything with a long memory — like a coaching assistant or a multi-session tutor — you'll need to think carefully about what to keep in the window and what to summarize out. I ended up writing a tiny summarization step that runs every 20 messages to keep the history manageable.

Streaming responses is great, but it makes the audio synthesis trickier. If you're piping model output into a text-to-speech engine, you have to decide: do you wait for the full response, or do you start speaking the first sentence while the model is still generating the second? I went with the second approach and it felt much more natural, but it required some queue management on my end.

And honestly, model selection is a moving target. The pricing I just gave you is current as of right now, but new models are dropping every few weeks. If you're serious about this, set a calendar reminder to re-check the landscape every quarter. The cheapest model today might be the third cheapest in three months.

Should You Build One?

Here's my honest take. If you've been on the fence about building an AI voice assistant, this is a really good moment to start. The costs are low enough that you can experiment without a budget meeting, the tooling is mature enough that you can have a working prototype in a single afternoon, and the model quality is high enough that the output won't embarrass you.

You don't need a PhD. You don't need a GPU cluster. You don't need to commit to a multi-year contract with a hyperscaler. You need a Python install, an API key, and a willingness to actually write some code.

Give Global API a Look

If you want to skip the part where I spent a weekend figuring out which model to use, the answer for me ended up being Global API. It gives you access to all 184 models through one unified SDK, the OpenAI-compatible interface means your existing code just works, and the pricing is transparent enough that you can actually predict your bill.

They've got 100 free credits to start, which is more than enough to build a working voice assistant and stress-test it. I burned through about 20 of those credits on my weekend build, so you'll have plenty of headroom to experiment. If you're curious, you can check it out at global-apis.com and see for yourself.

Now stop reading and go build something. You've got a weekend to fill.

Top comments (0)