In today's world of massive cloud-API dependencies, I decided to challenge myself: Could I build a fully functional, intelligent, voice-controlled AI agent entirely natively, running on my localized hardware without relying on paid OpenAI or Anthropic endpoints?
The answer is yes! . Meet Vexa, an industry-grade program file builder that listens to your voice, understands your intent, and physically writes code on your machine while keeping your data 100% private.
Here is a breakdown of how the architecture came together, my model tiering choices, and the immense engineering hurdles overcome along the way.
The System Architecture
The architecture is split into a robust FastAPI backend and a visually stunning React frontend.
The Ears (Speech-to-Text): When a user speaks into the microphone (or uploads an audio file), the React frontend bundles the blob and shoots it to the FastAPI backend. Here, a locally hosted HuggingFace Whisper (openai/whisper-tiny) pipeline kicks in. It processes the raw tensors on either CUDA or CPU (dynamic fallback) to instantly grab a clean transcript.
The Brain (Intent Parsing): The transcript is passed into Ollama, specifically running Microsoft's lightweight Phi-3 model. The prompt instructs the model to act as an agent and strictly classify the intent into exact endpoints: CREATE_FILE, WRITE_CODE, SUMMARIZE_TEXT, or GENERAL_CHAT.
The Hands (Tool Execution): Depending on the extracted intent, the backend relies on an isolated tool_executor.py. If Vexa realizes supposed to write code, it evaluates the raw content, performs strict path-traversal safety checks (blocking ../ manipulation), ensures valid extension types (.py, .md), and generates the file in an air-gapped backend/output directory.
End-to-End Workflow
- User provides audio input (microphone or file upload)
- Whisper converts audio → text
- Phi-3 classifies intent and extracts structured data
- Tool executor performs the required action
- The UI displays:
- Transcription
- Detected intent
- Action taken
- Final output
Why I Chose Phi-3 and Whisper
I chose Whisper-tiny because I needed hyper-low latency. Because this is a multimodal desktop agent, the speech-to-text translation could not be the bottleneck. It easily runs via PyTorch on a local CPU for low-latency performance on local hardware.
For the LLM, I chose Ollama's Phi-3. Rather than running heavier Llama-3 8B models, Phi-3 offered immense intelligence at a fraction of the hardware cost, making the Vexa agent highly accessible even on mid-tier laptops while offering exceptional reasoning capabilities.
The Biggest Challenge: Hallucinations vs. JSON Formatting
The biggest challenge was system brittleness caused by the LLM attempting to "chat" when it shouldn't.
Vexa relies entirely on a structured pipeline. The LLM must output pure JSON so the backend can seamlessly extract the target intent and content. However, local models inherently want to act like conversational assistants. Early models would frequently output:
"Sure, I can create that file for you! Here is your JSON: { "intent": ... }"
This extra preamble broke exactly what the backend expected (raw code).
The Solution: I enforced a hard "format": "json" payload boundary directly inside the Ollama API wrapper (text_model.py). As a secondary defense-in-depth shield, I wrote string-slicing regex into the agent service to capture strictly between { and }. The result? A bulletproof backend that never crashes due to formatting inconsistencies!
Focus on Safety
Building an AI that has the power to write to your file system is inherently dangerous. To counter this, my Tool Executor was fundamentally built on "Default No Overwrite". If a user commands Vexa to write a script on a file that already exists, the system automatically loops through suffixes, creating _v2, _v3 variants defensively.
Try It Out
Building Vexa proved that localized agentic workflows are not only possible but incredibly fast, secure, and beautiful. Feel free to drop by my GitHub repo and try talking to Vexa yourself!
Top comments (0)