From audio input to file creation — a complete walkthrough of the architecture, models, and hard lessons learned
When I set out to build a voice-controlled AI agent, I thought the hard part would be the AI. It wasn't. The hard part was getting every layer of the pipeline — audio, transcription, intent classification, tool execution, and UI — to talk to each other reliably. This article walks through exactly how I did it, what broke, and what I'd do differently.
What the Agent Does
The agent accepts voice input (either a recorded audio file or live microphone), transcribes it to text, figures out what the user wants, and executes the right action on the local machine. The supported actions are:
- Create a file — creates a new file in a sandboxed output folder
- Write code — generates code with an LLM and saves it to a file
- Summarize text — produces a bullet-point summary of provided content
- General chat — conversational responses to anything else
- Compound commands — multiple of the above in a single utterance
The entire pipeline is displayed in a clean Streamlit UI showing each step: transcription → intent → action → result.
Architecture Overview
The system is split into four clean modules:
Audio Input
│
▼
STT Module (utils/stt.py)
│ Groq Whisper large-v3
▼
Intent Module (utils/intent.py)
│ Groq LLaMA 3.3 70B → structured JSON
▼
Executor Module (utils/executor.py)
│ file ops / code gen / summarization / chat
▼
Streamlit UI (app.py)
Each module is independently swappable. The STT module, for example, supports local HuggingFace Whisper, Groq's hosted Whisper, or OpenAI's Whisper — selected automatically based on what's available.
The Models I Chose
Speech-to-Text: Groq Whisper large-v3
The assignment recommended running a HuggingFace Whisper model locally. I started there. The problem: running openai/whisper-large-v3 locally requires a CUDA-capable GPU with ~6 GB VRAM. On a standard laptop, CPU inference takes 30–60 seconds per utterance — which completely kills the user experience for a demo.
Groq's hosted Whisper API solves this. It uses the same model weights but runs on Groq's custom LPU hardware, returning transcriptions in under 3 seconds. It has a generous free tier, requires no local GPU, and the API is drop-in compatible with the OpenAI Whisper format.
Trade-off: You need an internet connection and a Groq API key. For a local-first setup, the code supports falling back to local HuggingFace inference by setting STT_BACKEND=local in the environment config.
Intent Classification & Generation: Groq LLaMA 3.3 70B
For the LLM, the assignment recommended Ollama for local inference. Again, I tried it. LLaMA 3 70B requires ~40 GB of disk space and significant RAM. On a development machine, this is impractical for a demo submission.
Groq's hosted LLaMA 3.3 70B solves the same problem — same model quality, ~200ms response time, free tier available.
The intent classifier uses a structured JSON prompt that forces the model to return exactly the schema I need:
{
"intents": ["write_code"],
"filename": "retry.py",
"language": "python",
"summary_source": "none",
"chat_reply": ""
}
Using "response_format": {"type": "json_object"} in the API call ensures the model never returns markdown fences or prose — just clean JSON every time.
The Intent Classification Design
The most interesting design challenge was intent classification. A naive approach would be to check for keywords ("create", "write", "summarize"). This breaks immediately on real speech — people say "can you make me a Python script" not "write code".
Instead, I wrote a system prompt that gives the LLM the full schema, the allowed intent values, and explicit rules:
- If the user mentions multiple actions, return multiple intents (compound command support)
- Infer the programming language from context
- Suggest a meaningful filename based on what was requested
- If summarization is requested but no extra text is provided, summarize the transcription itself
This approach handles natural, conversational language robustly. "Hey, can you write me something that retries failed HTTP requests and save it as a Python file?" correctly returns ["write_code", "create_file"] with filename: "retry.py" and language: "python".
The Executor and Safety Constraints
The executor maps intents to actions. The critical safety constraint: all file I/O is restricted to an output/ folder. This is enforced by:
- Resolving all file paths relative to the
output/directory - Sanitizing filenames to strip path separators and traversal sequences (
../) - Never accepting absolute paths from the LLM output
def _sanitize(filename: str) -> str:
safe = re.sub(r"[^\w.\-]", "_", filename)
return Path(safe).name # strips any directory component
This means even if the LLM hallucinated a filename like ../../system32/important.dll, the executor would write to output/______system32_important.dll instead.
The UI
I built the frontend in Streamlit. It displays the four pipeline stages in cards — transcription, detected intents (shown as colour-coded badges), action taken, and final output. A session history panel at the bottom shows all previous interactions in the current session.
Two bonus features worth highlighting:
Human-in-the-Loop: Before executing any file operation, the UI shows a confirmation prompt with the suggested filename. The user can review what the agent is about to do and cancel if needed. This is a small addition but makes the agent feel trustworthy rather than dangerous.
Compound Commands: If the LLM detects multiple intents in one utterance, the executor runs them sequentially and combines the results. "Write a retry function and summarize how it works" triggers both write_code and summarize_text in one go.
Challenges I Faced
1. Environment variables not loading
The biggest time sink was a subtle Python import order bug. The STT_BACKEND and LLM_BACKEND variables were being read at module import time (top-level os.getenv() calls), before load_dotenv() had been called. The fix was moving all environment variable reads inside the functions that use them, so they execute after dotenv has loaded.
2. Model deprecation mid-build
Midway through development, Groq deprecated llama3-70b-8192. Any in-flight API calls started returning model_decommissioned errors. The fix was switching to llama-3.3-70b-versatile, which Groq now recommends as the replacement. Lesson: always read the deprecation docs before submitting.
3. Audio format handling
Streamlit's st.audio_input (microphone recording) and st.file_uploader return different data types. The microphone returns a BytesIO object; the uploader returns an UploadedFile. Both need to be written to a temporary file on disk before passing to the Whisper API, because Whisper expects a file path or binary file handle, not a stream.
4. LLM JSON reliability
Early versions of the intent prompt would occasionally return JSON wrapped in markdown fences (json ...), or include explanatory prose before the JSON. I fixed this by using Groq's response_format: json_object parameter and adding a regex strip as a fallback:
cleaned = re.sub(r"```
(?:json)?", "", raw).strip("` \n")
```
---
## What I'd Do Differently
- **Add streaming** — the LLM responses currently wait for the full completion. Streaming the output to the UI would make the agent feel much more responsive.
- **Persistent memory across sessions** — right now history resets when the app restarts. Adding a SQLite or JSON file backend would make the agent genuinely useful over time.
- **Voice output** — completing the loop with text-to-speech so the agent speaks its response back would make this feel like a true voice assistant.
- **Better error recovery** — if transcription produces garbled text, the current system passes it to the LLM which then classifies it as `general_chat`. A confidence score from Whisper could be used to prompt the user to re-record instead.
---
## Final Thoughts
The most valuable thing I learned building this wasn't about AI — it was about pipeline design. Every layer needs clean, well-defined inputs and outputs. When something breaks (and it will), you need to be able to isolate which layer is the problem in seconds, not hours.
The modularity of separating STT, intent, and execution into independent files paid off every single time something went wrong. I could test each component in isolation, swap backends without touching the UI, and add new intents without rewriting the executor.
The full source code is available on GitHub: **[link]**
Video demo: **[link]**
---
*Built with Streamlit, Groq Whisper, and LLaMA 3.3 70B.*
Top comments (0)