Akritah Sahu

Posted on Apr 15

I Built a Voice AI Agent in 72 Hours — Here's Every Decision I'd Make Differently

#llm #agents #python #memoai

A post-mortem on local speech-to-text, LLM intent classification, compound commands, and why I integrated the product of the very company I'm applying to.

LLMs forget everything the moment a session closes. You tell your coding assistant you prefer Python. Next day, it asks again. You paste the same context into ChatGPT for the third time this week. The statelessness isn't a bug — it's a fundamental property of how these models work. But it doesn't have to be a property of the systems built on top of them.

This is the story of building a voice-controlled local AI agent that transcribes speech, understands intent, executes real actions on your machine, and — critically — remembers what it learns about you across sessions. It took 72 hours, broke in exactly the ways I expected and several I didn't, and taught me more about local AI pipelines than six months of tutorials.

The full code is on GitHub →

What the agent actually does

You speak (or type, if you're testing). The agent transcribes your audio locally, sends the text to a local LLM running via Ollama, gets back a structured intent JSON, and routes to one of four tool handlers: create a file, generate and save code, summarize text, or have a conversation.

The pipeline in one line:

Audio → faster-whisper → Ollama/llama3 → intent JSON → executor → output/

But the interesting parts aren't in that line. They're in every decision made along the way.

Decision 1: faster-whisper over original Whisper

The assignment said to use a HuggingFace model. The obvious choice is OpenAI's Whisper via the transformers library. I benchmarked it first.

On my laptop (no dedicated GPU), whisper-base via HuggingFace took 11 seconds to transcribe a 10-second audio clip. That's a non-starter for an interactive voice agent.

faster-whisper uses CTranslate2, a C++ inference engine that applies int8 quantization and kernel fusion. Same model weights, completely different execution path. The same transcription took 1.9 seconds. 5.8× faster on CPU, using half the memory.

Model	Method	Time (10s clip)	Approx. WER
whisper-tiny	HuggingFace	6.2s	~14%
whisper-base	HuggingFace	11.1s	~9%
whisper-base	faster-whisper (int8)	1.9s	~8%
whisper-small	faster-whisper (int8)	4.1s	~5%
whisper-large-v3	Groq API	0.4s	~3%

base with int8 is the sweet spot: fast enough to feel responsive, accurate enough for voice commands (which tend to be short and use common vocabulary). For production I'd upgrade to small or large-v3 via Groq.

The Groq fallback is worth highlighting. I added it for one reason: not everyone running this will have 4GB of RAM to spare. Setting GROQ_API_KEY in .env flips the STT module to Groq's API automatically — no code changes. The system degrades gracefully rather than refusing to run.

def transcribe_audio(audio_path: str):
    if os.getenv("GROQ_API_KEY"):
        return _transcribe_groq(audio_path)   # fast cloud path
    return _transcribe_local(audio_path)       # default: local

Decision 2: Why keyword matching for intent is a trap

My first instinct was a lookup table:

if "create" in text and "file" in text:
    return "create_file"
elif "write" in text or "code" in text:
    return "write_code"

This works for the exact examples in any demo. It breaks immediately in real usage.

"Can you please write me a retry helper in Python?" — doesn't match "write" in text because the sentence continues after "write". "I'd like to generate a sorting function" — contains neither "write" nor "code".

The real problem: intent is semantic, not lexical. The meaning of a sentence doesn't live in its keywords, it lives in the relationship between its tokens. That's exactly what LLMs are built to understand.

I switched to a structured JSON prompt sent to Ollama at temperature: 0.1:

Analyze this command and return ONLY a JSON object:
{
  "primary_intent": "write_code | create_file | summarize_text | general_chat",
  "sub_intents": [...],
  "confidence": "high | medium | low",
  "suggested_filename": "...",
  "language": "..."
}

Low temperature makes the model deterministic and JSON-compliant. High temperature makes it creative and unpredictable — exactly wrong for structured output.

Intent accuracy on a 50-sample test set:

Method	Accuracy	Latency
Keyword rules	74%	< 1ms
llama3 (8B, Ollama)	94%	~2.8s
mistral (7B, Ollama)	91%	~2.3s
codellama (7B)	87%	~2.1s

The 20-point accuracy gap between rules and LLM is the difference between a demo and something usable. The 2.8s latency is worth it.

The rule-based classifier still lives in the codebase as a fallback. If Ollama isn't running, the system downgrades to 74% accuracy rather than crashing. That's the design philosophy throughout: every component has an escape hatch.

Decision 3: The compound intent problem nobody mentions

Most intent classifiers assume one intent per utterance. Voice interfaces don't work that way.

"Summarize this article and save it to summary.txt" contains two intents: summarize_text and create_file. "Write a retry function in Python and call it retry.py" contains write_code and an implicit create_file.

The insight is that the JSON schema itself can model this:

{
  "primary_intent": "summarize_text",
  "sub_intents": ["create_file"],
  "suggested_filename": "summary.txt",
  "is_compound": true
}

The executor then chains the handlers in order: run the primary intent, pass its output to the sub-intent handler. For the summarize-and-save case:

result = _handle_summarize(context, model)
if "create_file" in sub_intents and result["success"]:
    _save_text(result["output"], suggested_filename)

The LLM understands "and save it" as a sub-intent naturally, because that's what temperature: 0.1 and a clear schema produce. The executor chains the results. Clean separation of concerns.

Decision 4: Integrating Mem0 — and why it was the obvious choice

Here's the thing about this project: I'm applying to Mem0, a company whose entire product is AI memory. Their core thesis — stated on their homepage, in their YC application, in every piece of content they publish — is that AI agents should remember what they learn about you across sessions. Using the product I'm applying to help build, in a project they assigned me, wasn't clever positioning. It was just the right technical choice.

Without memory, the agent is stateless. Every session starts fresh. You tell it you prefer Python, it forgets. You tell it you always use snake_case filenames, it forgets. You use it every day for a week, and it still doesn't know you.

With Mem0:

from mem0 import MemoryClient

client = MemoryClient(api_key=os.getenv("MEM0_API_KEY"))

def save_interaction(command, intent, result):
    client.add([
        {"role": "user",      "content": command},
        {"role": "assistant", "content": f"Intent: {intent}. Output: {result[:200]}"}
    ], user_id="voice-agent-user")

def get_relevant_context(command):
    memories = client.search(command, user_id="voice-agent-user", limit=5)
    return "\n".join(f"- {m['memory']}" for m in memories)

After a few sessions, get_relevant_context("write a search function") returns facts like:

"User prefers Python over JavaScript"
"User uses snake_case for function names"
"User's last code request was for a retry helper in utils/"

That context gets prepended to the intent classification prompt. The agent now writes Python without being asked, uses snake_case without being reminded, and suggests utils/search.py without being told where to put things.

This is exactly the problem Mem0 was built to solve, and it works exactly as advertised. Three lines of integration code. Persistent memory that outlives the session. Semantic retrieval that finds relevant facts, not just exact matches.

Decision 5: The human-in-the-loop confirmation

The agent can create files and write code. That means it can overwrite things. I wanted a hard pause before any file write that requires explicit confirmation.

Implementing this in Streamlit is non-trivial because Streamlit reruns the entire script on every interaction. The pattern that works: store the pending operation in st.session_state, rerun, display the confirmation UI, and only execute after an explicit button click.

if intent in {"write_code", "create_file"}:
    st.session_state.pending_confirmation = {
        "intent_data": intent_data,
        "detail": f"output/{filename}"
    }
    st.rerun()

The UI shows: About to write to output/bubble_sort.py. Confirm? with a green Confirm button and a red Cancel button. Nothing executes until the user explicitly chooses.

This isn't just a safety feature. It's the right UX. Voice commands are imprecise. The agent might mishear "create utils/retry.py" as "create utils/write.py". The confirmation step catches that before data is written.

The three things I'd build differently

1. Streaming LLM output. Right now the code generation step freezes the UI for 3-8 seconds while the LLM produces the full response. stream: True in the Ollama API call, combined with Streamlit's st.write_stream(), would show tokens appearing in real time. That's the difference between "is it broken?" and "I can see it working."

2. Whisper large-v3 for production. The base model misses technical terms. "Create a memoization function" comes back as "Create a memorization function." For a coding agent, that matters. Large-v3 via Groq costs fractions of a cent per request and has dramatically better accuracy on domain-specific vocabulary.

3. SQLite for action history instead of session state. The session history panel currently lives in st.session_state and resets on every restart. Twenty lines of SQLite would make it persistent: every command, intent, result, and file path, queryable across sessions. Combined with Mem0's semantic memory, you'd have both structured logs and unstructured context — exactly what a production agent needs.

What the hardest part actually was

Not the AI components. The glue.

Getting audio bytes into a temp file and cleaning it up correctly. Stripping markdown fences from LLM output that appeared even when explicitly told not to. Managing Streamlit's rerun loop without creating infinite confirmation dialogs. Parsing JSON from LLM responses that occasionally had explanatory text before the opening brace.

Every one of these problems cost more debugging time than implementing the actual AI features. This is the part that tutorials skip: the connective tissue between components is where production systems succeed or fail. Getting that tissue right — with proper fallbacks, error messages that actually tell you what went wrong, and state management that survives edge cases — is the real engineering work.

Running it yourself

git clone https://github.com/YOUR_USERNAME/voice-agent
cd voice-agent
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env

# Pull a model (pick one)
ollama pull llama3

# (Optional) get free Mem0 key at app.mem0.ai
# (Optional) get free Groq key at console.groq.com

streamlit run app.py

The agent works without Mem0 or Groq keys — they unlock the memory layer and faster STT respectively, but the core pipeline runs fully locally with just Ollama and faster-whisper.

Final thought

The thing I learned building this isn't about any specific model or library. It's about what "local AI" actually means in 2025.

Running intelligence locally means accepting tradeoffs: slower inference, smaller models, more engineering work per feature. But it also means no API costs, no data leaving your machine, and no dependency on someone else's uptime. For a voice agent that can write files and run code on your system, those tradeoffs point clearly toward local-first.

The tools have reached the point where "local-first AI agent" is a weekend project, not a research paper. faster-whisper, Ollama, and a bit of careful prompt engineering is all it takes. Add Mem0 for memory, and you have something that genuinely improves the more you use it.

That's a fundamentally different category of software than anything we had three years ago.

Built as part of the Mem0 AI/ML Generative AI Developer Intern assignment.
Full code: GitHub | Demo: YouTube

DEV Community