If you ask most AI demos to do something useful, they usually stop right before the interesting part.
They can transcribe your speech, explain what they think you meant, and generate a polished response. But they often do not cross the line into safe, visible action on a real machine.
For my Mem0 AI/ML & Generative AI Developer Intern assignment, I wanted to build something more practical: a local-first voice-controlled AI agent that could listen to spoken commands, understand user intent, execute local tools, and expose the whole pipeline in a simple UI.
That project became VoxAgent.
What VoxAgent Does
VoxAgent is a local-first AI agent that supports:
- microphone input
- uploaded audio files
- local speech-to-text
- local intent understanding
- safe file and folder creation
- code generation into files
- text summarization
- general chat
- a UI that shows the full pipeline from audio to action
The key requirement was not just to generate responses, but to actually perform useful tasks while staying within safe local boundaries.
The Core Idea
The architecture is simple:
- Accept voice input
- Convert speech to text
- Classify the user’s intent
- Route the request to a local tool
- Show the transcript, decision, and final output in the UI
That sounds straightforward, but the interesting engineering work was in making the system:
- local-first
- safe by default
- resilient when local models are slow or unavailable
Tech Stack
I used the following stack:
- Streamlit for the frontend
- faster-whisper for local speech-to-text
- Ollama for local intent routing and generation
- Python for orchestration and tool execution
This gave me a stack that was practical to run on a normal laptop without turning the project into a hosted API workflow.
System Architecture
VoxAgent is split into five layers.
1. Audio ingestion
The Streamlit UI accepts audio in two ways:
- direct microphone recording
- uploaded
.wav,.mp3,.m4a, or.oggfiles
This made the app easier to demo and easier to test. If browser microphone behavior is inconsistent, uploaded audio still works.
2. Local speech-to-text
For transcription, I used faster-whisper with CPU-friendly defaults:
- model:
base.en - device:
cpu - compute type:
int8
I initially considered using a larger Whisper setup, but in practice local responsiveness matters more than chasing the largest possible model.
3. Intent planning with Ollama
Once the transcript is available, it is sent to a local Ollama model. Instead of asking for a free-form answer, I ask the model for a strict JSON action plan.
That plan includes:
- intent
- file or folder target if needed
- code generation instruction if needed
- text to summarize if needed
- whether confirmation should be required
This structure makes the agent much easier to reason about than a plain-text routing step.
4. Tool execution
The agent supports four main intents:
create_filewrite_codesummarize_textgeneral_chat
These actions are executed through a local tool layer, not directly through the UI.
5. UI and memory
The interface shows:
- transcribed text
- detected intents
- planned actions
- final results
- execution notes
- backend information
- timing information
The app also stores lightweight session history in JSON so recent interactions remain visible.
Safety Constraints
This part mattered a lot.
Once an AI agent can write files locally, the execution boundary has to be extremely clear. I added three explicit safeguards:
1. All writes are restricted to output/
The agent cannot create files outside the repository’s output/ folder.
2. Path traversal is blocked
Any path containing .., /, or \ is rejected before execution.
3. File actions require confirmation
The UI includes a human-in-the-loop approval checkbox before file creation or code writing happens.
That kept the project aligned with the assignment and prevented the most obvious local-risk failure modes.
What Broke During Real Testing
The first version worked in structure, but real end-to-end testing exposed some weaknesses.
Problem 1: Fallback logic was incomplete
At first, I had a rule-based fallback for intent planning if Ollama was unavailable. But some execution paths still depended on the local LLM later in the pipeline.
That meant the app could recover during planning and still fail during summarization or code generation.
I fixed this by adding a local fallback responder:
- summarization can fall back locally
- chat can fall back locally
- code generation can fall back to a safe template
This made degraded execution much more predictable.
Problem 2: Planning latency was too high
In local testing, planning timeouts could make the app feel frozen even when a fallback path was available.
I improved this by separating:
- planner timeout
- generation timeout
The shorter planning timeout lets the UI fall back faster if the local model stalls, while generation still gets its own bounded budget.
Demo Flows I Verified
I verified the app with actual local audio runs, not just unit tests.
Demo 1: Summarization
Voice input:
Summarize this text. Local AI agents combine speech recognition, reasoning, and tool execution to automate tasks.
Observed behavior:
- transcript recognized correctly
- intent detected as
summarize_text - planner fell back after timeout
- summarization still completed successfully
Final summary:
Local AI agents use speech recognition, reasoning, and tool execution to automate tasks.
Demo 2: Code generation
Voice input:
Create a Python file named retry helper dot py with a retry function.
Observed behavior:
- transcript normalized to
retryhelper.py - intents detected as
create_fileandwrite_code - file created safely inside
output/ - fallback code generation completed successfully
Generated output:
"""Generated in fallback mode because the local LLM was unavailable."""
import time
def retry(operation, attempts=3, delay_seconds=1):
last_error = None
for _ in range(attempts):
try:
return operation()
except Exception as error:
last_error = error
time.sleep(delay_seconds)
raise last_error
Challenges I Faced
Running everything locally without making it painful
Local AI systems sound attractive, but latency and hardware limits show up immediately. The practical solution was to tune for reliability:
- smaller Whisper model
- CPU-friendly quantization
- bounded timeouts
- local fallback behavior
Keeping flexibility without sacrificing safety
Natural language is flexible. File system actions are not.
So the architecture separates:
- transcript and interpretation
- validated action planning
- constrained execution
That separation kept the agent much easier to trust.
Making the system observable
I did not want the UI to behave like a black box. Showing the transcript, action plan, notes, backend used, and timing information made debugging much easier and also made the demo much stronger.
What I Would Improve Next
If I continue iterating on VoxAgent, the next improvements would be:
- richer extraction of filenames and structured parameters
- compound actions like “summarize this and save it to summary.txt”
- stronger local templates for multiple programming languages
- benchmarking different Whisper and Ollama combinations
- better multi-step approval flows before execution
Final Takeaway
The most useful lesson from this project was that a strong local AI agent is not just a model wrapped in a UI.
It is a pipeline with:
- explicit safety boundaries
- predictable fallbacks
- observable execution
- practical local defaults
That is what I tried to build with VoxAgent.
If you want to check out the code, the repository is here:
Top comments (0)