Introduction
Voice interfaces mostly relying on cloud APIs for every interaction raises privacy concerns and introduces latency. For a recent project, I set out to build a fully local voice-controlled AI agent.
The goal? Create a system that can accept audio input, accurately classify the user's intent, execute the appropriate local tools and functions(for eg. write code, create files, summary and general chat), display entire pipeline result and safely execute local system tasks—all without sending a single byte of data to the cloud.
The Architecture
Here’s the breakdown:
Frontend (Streamlit): Handles the UI, audio input and keeps track of the session.
Speech-to-Text: HuggingFace Whisper carrying the transcription.
The Brain (brain.py): Acts as the router. It feeds the transcript to an LLM and forces it to pick an intent (e.g., create a file, summarize) instead of just giving a generic response.
Tool Executor (actions.py): Takes that intent and actually gets to work running the Python functions.
The Challenges and Fixes
Spit out pure JSON output so Python can read it. I had to write a robust fallback mechanism in Python that aggressively strips out markdown formatting before doing anything and defaults to a safe "chat" intent if the model hallucinates the formatting.
Giving an AI the ability to write files to your hard drive is inherently dangerous. If a user asked it to "overwrite my system config," it might try. I implemented a strict path-sanitization rule in the actions.py execution layer. All generated files use os.path.basename to strip out directory traversal attempts and are forced into a dedicated output/ sandbox directory.
Streamlit reruns the entire script from top to bottom every time the user interacts with a widget. This meant my STT and LLM models were reloading on every click, taking 10 seconds per interaction.I used Streamlit's @st.cache_resource to load the Whisper model and initialize the classes only once. I also implemented st.session_state to keep a persistent memory log of the conversation.
Whisper is incredibly sensitive. If you leave the microphone recording and just type on your keyboard, cough, or let a fan blow in the background, Whisper will sometimes hallucinate weird transcripts. (A famous quirk of Whisper is that it sometimes translates dead silence into "Thank you for watching!" because it trained on YouTube videos).In a production app, you'd need to add a Voice Activity Detection (VAD) library or an energy threshold so the mic only actually records when someone is actively speaking.
When you use ChatGPT or Alexa, the response is almost instant because it runs on massive server farms. Running Whisper plus Llama 3 on a standard laptop CPU means you speak, and then you sit there waiting 5 to 15 seconds for the system to transcribe, think, and execute. It breaks the illusion of a fluid conversation. To make it truly snappy, you'd need a dedicated GPU (like an NVIDIA RTX card), or you'd have to stream the LLM's output token-by-token to the UI so the user at least sees it "typing" in real-time.
Whisper relies on a system-level tool called ffmpeg to process audio files. You can't just pip install it easily; users have to install it on their actual operating system (via Homebrew on Mac, or by downloading the binaries and editing System Environment Variables on Windows). It is the number one reason beginners can't get audio projects to run on their first try.Writing extremely clear, OS-specific setup instructions in the README.md so whoever grades your assignment doesn't crash on step one.
Top comments (0)