Published by Mansi | April 2026
GitHub: https://github.com/Mansi-2703/VoiceBot
Introduction
Most AI assistants today ship your voice to a remote server, process it in the cloud, and return a response. That works — until you care about privacy, latency, or simply running offline. This project takes a different approach: every component, from speech recognition to code generation, runs entirely on local hardware.
VoiceBot is a voice-controlled AI agent built as part of the Mem0 AI/ML and Generative AI Developer Intern assignment. The goal was to build a system that accepts audio input, classifies the user's intent, executes the appropriate local tool, and presents the full pipeline result in a clean UI. This article documents the architecture decisions, model choices, implementation challenges, and benchmarked performance of the final system.
What the System Does
A user speaks or uploads an audio file. The system transcribes the speech, classifies the intent into one of four categories (create a file, write code, summarize text, or chat), executes the corresponding tool, and displays the result in a Streamlit dashboard.
A single utterance like "Write a Python function to reverse a string and save it to reverse.py" triggers the full pipeline: transcription, compound intent detection, code generation, and file creation — all without a single outbound API call.
Architecture Overview
The system is composed of four sequential modules, each responsible for a single concern.
Audio Input (mic or file)
|
[ STT Module ] src/stt.py
OpenAI Whisper (local)
Returns: transcript string
|
[ Intent Classifier ] src/intent.py
Mistral 7B via Ollama
Returns: intent(s) + extracted params + confidence
|
[ Tool Executor ] src/tools.py
create_file / write_code / summarize_text / general_chat
Returns: structured result dict
|
[ Streamlit UI ] app.py
Renders: transcript, intent badge, action, result
The pipeline orchestrator in src/pipeline.py wires these together, manages session history, and handles compound commands — multiple intents detected in a single utterance.
Module 1: Speech-to-Text with OpenAI Whisper
Model choice
The STT layer uses OpenAI's Whisper base model, running entirely locally via the openai-whisper Python package. Whisper is a transformer-based encoder-decoder model trained on 680,000 hours of multilingual audio. The base model is approximately 140 MB and supports 99 languages with automatic language detection.
The assignment permitted an API-based fallback (Groq or OpenAI) for machines that cannot run Whisper efficiently. The final implementation uses local Whisper because the base model runs acceptably on CPU for the audio lengths typical in voice commands, and keeping the STT local is essential to the privacy-first design goal.
Implementation
The transcribe() function in src/stt.py accepts either raw bytes (from a microphone recording) or a file path string (from an uploaded file). It writes bytes to a temporary file when needed, calls whisper.load_model("base"), runs model.transcribe(), and returns a standardised dict:
{
"transcript": str,
"source": "mic" | "file",
"error": str | None
}
Error handling catches two failure modes: unintelligible audio (empty or very short transcript) and model loading failures. Both return error in the dict rather than raising exceptions, so the pipeline can surface a user-friendly warning instead of crashing.
Performance
| Hardware | Latency per minute of audio | Notes |
|---|---|---|
| CPU (quad-core) | 15–20 seconds | Acceptable for short commands |
| GPU (RTX 3060) | 2–3 seconds | Near real-time |
For voice commands under 10 seconds, CPU latency is 2–4 seconds, which is acceptable for this use case.
Module 2: Intent Classification with Mistral 7B
Model choice
Intent classification uses Mistral 7B, served locally via Ollama. Mistral 7B is a 7-billion-parameter transformer that outperforms larger models on several reasoning benchmarks while fitting in approximately 4.4 GB of memory in 4-bit quantised form. Ollama provides a local HTTP server that exposes a simple REST API, making it straightforward to call from Python without managing model loading directly.
The alternative considered was a smaller dedicated classifier (e.g., a fine-tuned BERT or DistilBERT). Mistral 7B was chosen instead because it handles the compound command case well — it can decompose "summarize this and save it to summary.txt" into two ordered intents in a single forward pass, without any multi-step prompting chain.
Prompt design
The system prompt instructs the model to return only valid JSON. This is a known reliability pattern for structured outputs from instruction-tuned models: constraining the output format in the system turn dramatically reduces parse failures compared to asking for JSON in the user turn alone.
The output schema is:
{
"intents": ["write_code", "create_file"],
"confidence": 0.96,
"extracted_params": {
"write_code": {
"filename": "reverse.py",
"language": "python",
"description": "function to reverse a string"
},
"create_file": {
"filename": "reverse.py"
}
}
}
For single-intent utterances, intents contains one element. The pipeline loops over the array and dispatches each intent sequentially, which is the compound command implementation.
Fallback behaviour
If JSON parsing fails (rare but possible with long, ambiguous utterances), the classifier defaults to general_chat with the raw transcript passed as the message. This ensures the pipeline always produces a response rather than a silent failure.
Performance
| Hardware | Latency per query | Notes |
|---|---|---|
| CPU (quad-core) | 8–12 seconds | Bottleneck on CPU builds |
| GPU (RTX 3060) | 1–2 seconds | Comfortable for interactive use |
Module 3: Tool Execution
The src/tools.py module implements four tools, each returning a standardised result dict with a success boolean and a message string for error surfacing.
create_file
Creates a plain text file inside the output/ directory. Path traversal is prevented with an os.path.abspath check: if the resolved path does not start with the output/ directory's absolute path, the operation is refused and the error is returned to the pipeline. This is a critical security constraint because the intent classifier extracts filenames from raw user speech — an adversarial or malformed input could otherwise write to arbitrary paths.
write_code
Sends a code generation prompt to Mistral 7B via Ollama, asking for clean code matching the user's description in the specified language. The generated code is saved to output/{filename}. The result dict includes a code_preview field (first 200 characters) for display in the UI, and the full code in the file.
summarize_text
Passes the extracted text to Mistral 7B with a concise summarisation prompt. The result dict contains the summary string. This tool is typically the first step in a compound command like "summarize this and save it to notes.txt", where the pipeline chains summarize_text output into the content parameter of create_file.
general_chat
Sends the user's message to Mistral 7B with a conversational system prompt. The last three session history entries are included as context, giving the agent memory across exchanges within a session. The result dict contains the response string.
Module 4: Pipeline Orchestration
src/pipeline.py contains the run_pipeline() function, which is the single entry point called by the UI.
def run_pipeline(audio_input: bytes | str) -> dict:
# 1. Transcribe
stt_result = transcribe(audio_input)
if stt_result["error"]:
return early_error_response(stt_result["error"])
# 2. Classify
intent_result = classify_intent(stt_result["transcript"])
# 3. Execute each intent in order
results = []
for intent in intent_result["intents"]:
params = intent_result["extracted_params"].get(intent, {})
tool_result = dispatch_tool(intent, params)
results.append(tool_result)
# 4. Append to session history
session_history.append({...})
return build_response(stt_result, intent_result, results)
The session history list holds the last 20 exchanges and is passed to the general_chat tool on each call, enabling conversational continuity.
The return dict contains all keys the UI needs: transcript, intent, intents, confidence, action_taken, actions_taken, result, and error. This flat contract between pipeline and UI avoids the UI needing to know about the internal structure of any individual module.
The Streamlit UI
The UI (app.py) is a two-column Streamlit layout with no third-party UI libraries. The left column contains the audio input widgets and the run button. The right column contains four output cards rendered as raw HTML via st.markdown(..., unsafe_allow_html=True).
Key implementation decisions:
The output cards bypass Streamlit's default container and expander widgets entirely. Those widgets render a visible header bar that conflicts with the custom card design. Raw HTML divs with inline styles give full control over the visual output.
The run button uses type="primary" with a CSS override via button[kind="primary"] selectors to enforce the dark background. Streamlit's theming system does not expose button background color as a first-class config value, so the override must be injected in the global stylesheet block.
Session state holds all pipeline results across reruns. Streamlit's execution model reruns the entire script on every interaction, so all mutable state must live in st.session_state. The pipeline is only called when the run button is clicked, guarded by the if run_button: block.
The sidebar displays the last five session history entries in [intent] — transcript preview... format, giving the user a quick audit trail of their session.
Compound Commands
This was the most technically interesting feature to implement. Supporting compound commands required changes at two levels.
At the intent classifier level, the system prompt was updated to return an array of intents rather than a single string, and to extract parameters for each intent independently. The key prompt instruction is: "If the user's command contains multiple distinct actions, return all of them in the intents array in the order they should be executed."
At the pipeline level, the sequential loop over intent_result["intents"] handles execution. The output of one tool can be wired to the input of the next: when summarize_text precedes create_file, the pipeline passes the summary string as the content parameter for file creation, rather than requiring the user to repeat the content.
Example compound command flow:
User: "Summarize this article and save it to summary.txt"
Intent classifier returns:
intents: ["summarize_text", "create_file"]
params:
summarize_text: { text: "..." }
create_file: { filename: "summary.txt" }
Pipeline executes:
Step 1: summarize_text → summary string
Step 2: create_file with content = summary string
Output: output/summary.txt created
Testing
The test suite contains over 70 test cases across four test files, organised under tests/.
| Module | Test count | Coverage |
|---|---|---|
test_stt.py |
12 | 85%+ |
test_intent.py |
20+ | 90%+ |
test_tools.py |
20+ | 92%+ |
test_pipeline.py |
15+ | 88%+ |
| Total | 70+ | ~90% |
Unit tests mock external dependencies (Whisper model loading, Ollama HTTP calls) so they run without GPU or Ollama installed. Integration tests run against live services and are marked with @pytest.mark.slow so they can be excluded from fast CI runs via pytest tests/ -m "not slow".
Security tests verify path traversal prevention: attempts to write to ../etc/passwd or absolute paths outside output/ are confirmed to return success: false without writing any file.
The project includes pytest.ini for configuration and requirements-dev.txt for test-only dependencies (pytest, pytest-cov, pytest-mock), keeping test infrastructure separate from the production dependency graph.
Model Benchmarking
The benchmark.py script measures latency and throughput for each model independently and for the full pipeline. Sample results on an RTX 3060:
Speech-to-Text (Whisper base)
Average latency: 3,200 ms
Min / Max: 2,800 ms / 3,600 ms
Samples: 10
Intent Classification (Mistral 7B)
Average latency: 1,500 ms
Min / Max: 1,100 ms / 2,100 ms
Samples: 5
Tool Execution
create_file: 8 ms
write_code: 5,320 ms (includes Mistral generation)
summarize_text: 7,150 ms (includes Mistral generation)
Full Pipeline (GPU, warm models)
Average: 4–8 seconds end-to-end
On CPU-only hardware, full pipeline latency is 25–40 seconds. For a local agent where privacy is the primary constraint, this is acceptable. For production use, GPU acceleration is strongly recommended.
Challenges Faced
Structured JSON output from Mistral. Getting a local model to reliably return only JSON without preamble or explanation required careful system prompt engineering. The final approach uses a strict system prompt, a format: json parameter in the Ollama API call where supported, and a try/except parser that strips markdown code fences (\\json) before parsing. The fallback togeneral_chat` on parse failure means the user always gets a response.
Streamlit's execution model. Streamlit reruns the entire script on every widget interaction, including audio recording. This caused the pipeline to re-execute unexpectedly on certain state changes. The fix was to gate all pipeline logic strictly inside if run_button: and rely on st.session_state for all persistent state, never on module-level variables.
Whisper model warm-up. The first transcription in a session takes 3–5 seconds longer than subsequent ones because the model must be loaded into memory. For a better user experience, the model could be loaded at startup rather than on the first call. This is a known optimisation left for a future iteration.
Path safety for file operations. The intent classifier extracts filenames from raw natural language, which means user input directly influences file paths. The os.path.abspath comparison guard was added after identifying this risk during security test writing — not before. Writing security tests first would have caught this earlier.
Why Local Models Over Cloud APIs
The cost and privacy advantages are concrete. Running Whisper and Mistral locally has zero ongoing cost after the initial model download. An equivalent cloud setup using OpenAI Whisper API and GPT-4o would cost approximately $0.006 per minute of audio plus $5–15 per million tokens for intent classification and tool calls. For a developer running hundreds of test iterations, local models eliminate a meaningful expense and remove the risk of sending sensitive content to third-party servers.
The tradeoff is hardware requirements. The system needs approximately 8 GB of RAM and 5 GB of disk space, plus a GPU for comfortable interactive latency. On CPU-only machines, the pipeline is functional but slow.
Project Structure
plaintext
voice-agent/
├── app.py Streamlit UI (437 lines)
├── benchmark.py Performance measurement script
├── requirements.txt Production dependencies
├── requirements-dev.txt Test dependencies
├── pytest.ini Test configuration
├── src/
│ ├── stt.py Whisper transcription
│ ├── intent.py Mistral intent classification
│ ├── tools.py Four tool implementations
│ └── pipeline.py Orchestration + session memory
├── tests/
│ ├── conftest.py Pytest fixtures
│ ├── test_stt.py
│ ├── test_intent.py
│ ├── test_tools.py
│ └── test_pipeline.py
└── output/ All generated files written here
Setup in Five Steps
`bash
1. Install Ollama from https://ollama.ai, then pull the model
ollama pull mistral
2. Create and activate a virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
3. Install dependencies
pip install -r requirements.txt
4. Start the Ollama server (keep this terminal open)
ollama serve
5. Run the application
streamlit run app.py
`
What I Would Do Differently
If building this again, I would write the security tests for file operations before writing the create_file tool, not after. The path traversal risk is obvious in retrospect but was not the first thing I thought about when implementing the feature.
I would also add a Whisper model warm-up call at application startup — loading the model into a module-level variable so the first user request does not experience the cold-start penalty.
For the intent classifier, I would explore whether a smaller, fine-tuned model (a DistilBERT classifier trained on 500 examples per intent) could match Mistral 7B's accuracy for the four supported intents at a fraction of the latency. The 8–12 second CPU latency for Mistral is the single biggest usability problem on low-end hardware, and a dedicated small classifier would likely close it to under 1 second.
Conclusion
VoiceBot demonstrates that a production-quality voice agent — with compound command support, session memory, security constraints, and a full test suite — can be built entirely on local hardware using open-source models. The architecture is modular: each of the four components (STT, intent classification, tool execution, UI) can be replaced independently. Swapping Whisper for a faster local model, or Mistral for a fine-tuned classifier, requires changing a single module without touching the others.
The full source code is available at https://github.com/Mansi-2703/VoiceBot.
Built using OpenAI Whisper, Mistral 7B via Ollama, Streamlit, and PyTorch. All models run locally.
Top comments (0)