DEV Community

Athila Ashraf
Athila Ashraf

Posted on

VOXEN — Voice-Controlled Local AI Agent

I Built a Voice Agent That Listens to You and Actually Does Things — Here's How It Went
So I had this assignment. Build a voice-controlled AI agent that takes audio input, figures out what the user wants, and executes it — create files, write code, summarize text, whatever. Sounds straightforward on paper. It wasn't.
This is me walking through what I built, why I made certain decisions, and the parts that made me want to close my laptop and never open it again.

What the thing actually does
The project is called VOXEN. You give it an audio file (or record directly from your mic), it transcribes what you said, figures out your intent, and then does the action — writes code to a file, creates a folder, summarizes text, or just chats back. Everything shows up in a Streamlit UI.
The pipeline is pretty simple when you draw it out:
Audio → Transcription → Intent Detection → Tool Execution → UI Output
That's it. Four stages. But each one had its own little surprises.

The architecture
I split the project into focused modules rather than dumping everything into one file.
stt.py handles transcription. intent.py handles the classification. tools.py has all the action handlers. app.py is the Streamlit frontend that ties everything together. And there's an output/ folder where all generated files go — hardcoded that restriction in the code so nothing accidentally writes to system paths.
Clean separation made debugging way less painful. When something broke, I knew exactly which file to look at.

Why I used Groq instead of a local model
Okay, honest answer — my laptop cannot handle running Whisper or a full LLM locally without turning into a space heater. I tried. It was slow enough to be genuinely unusable.
The assignment actually anticipated this. It said if your local machine can't handle it, use an API-based service and document why. So that's what I did — Groq. Their API is fast, the free tier is generous enough for a project like this, and the Whisper endpoint is solid for transcription accuracy.
For the LLM side, I'm using llama3-8b-8192 through Groq. Intent classification and code generation both go through this same model with different system prompts.
The tradeoff is obvious — this isn't "fully local" in the strict sense. But for the scope of this project, it works cleanly and honestly, running a 7B model locally on a mid-range laptop for a demo is more pain than it's worth.

The intent classification part
This is the part I spent the most time thinking about. The LLM needs to read the transcribed text and return structured JSON telling me what the user wants. Something like:
json{"intent": "write_code", "details": {"filename": "add.py", "language": "python", "content": "add two numbers"}}
Simple enough. Except LLMs don't always return clean JSON. Sometimes they wrap it in markdown code blocks. Sometimes they add a little explanation before the JSON. Sometimes they just... don't give valid JSON at all.
My fix was layered — strip markdown fences first, then find the JSON object by looking for the first { and last }, then parse. And if that still fails, fall back to treating the whole thing as a general chat. Not elegant, but it holds up.
The system prompt has to be very specific. I kept the temperature at 0.1 for intent detection — you want deterministic behavior here, not creativity. For code generation, I bumped it to 0.3 because you want some variation but not chaos.

Building the UI
I wanted the UI to feel like a proper product, not a student project. So I spent time on the design side — custom CSS inside Streamlit, a gradient hero section, glassmorphism-style result cards, animated SVG logo.
The trickiest part was the input mode toggle. I wanted two big clickable buttons — one for upload, one for mic — that actually look like buttons and not the default Streamlit widgets. Streamlit doesn't really let you do custom HTML buttons that trigger Python logic directly, so I did a workaround: the HTML buttons trigger hidden st.button elements via JavaScript onclick, which then update session_state and rerun. Those actual Streamlit buttons are pushed off-screen with CSS so the user never sees them.
It's a bit hacky. But it works perfectly and the UX is exactly what I wanted.

Challenges, honestly
The JSON parsing thing — already covered above, but it deserves emphasis. The LLM was my biggest source of bugs. Getting it to consistently return clean, parseable JSON took more prompt engineering than I expected.
Microphone on different environments — the audiorecorder library works fine locally but has environment-specific issues. On some setups it just doesn't initialize. I wrapped the whole mic section in a try-except that falls back to showing a warning and pointing the user to file upload. Not ideal, but better than a crash.
State management in Streamlit — Streamlit reruns the whole script on every interaction, which means you have to be careful about what you store in session_state. I had a bug early on where uploading a new file wasn't being detected because I wasn't comparing the filename — it would just reuse the cached path. Fixed it by tracking uploaded_file_name in session state and comparing on every upload.
Temp file cleanup — every audio file gets saved to a temp path for processing. I had to make sure those get deleted after processing. Tiny thing, but if you forget it, you're just accumulating audio files silently.

Task Model Why
Speech-to-Text whisper-large-v3 via Groq Fast, accurate, no local GPU needed
Intent Classification llama3-8b-8192 via Groq Good instruction following, consistent JSON
Code Generation llama3-8b-8192 via Groq Same model, different prompt, works well
Chat llama3-8b-8192 via Groq Conversational, no issues

What I'd do differently
If I was building v2, I'd add compound command support — something like "write a retry function and save it to utils.py" where the system detects multiple intents and chains the tools. Right now it handles one intent per input. The architecture supports it, the intent JSON just needs to return an array instead of a single object.
I'd also add a confirmation step before file writes. Right now it just executes. Adding a "here's what I'm about to do — confirm?" step would make it feel safer and more intentional.

Final thoughts
This was genuinely fun to build. Voice interfaces have this quality where, when they work, they feel like magic — you say something and something happens. Getting that pipeline to run end to end, watching it transcribe correctly and then actually write valid Python to a file based on what I said — that moment was satisfying in a way that regular CRUD apps just aren't.
The code is on GitHub if you want to look at it. Every module is under 200 lines so it's easy to follow.
If you're building something similar and hit the same JSON parsing nightmare I did — keep the temperature low, be extremely explicit in your system prompt, and always have a fallback. That alone will save you a few hours.

Built for the Mem0 AI/ML & Generative AI Developer Intern Assignment.

Top comments (0)