Building VoiceForge AI: A Local Voice-Powered Agent with Compound Intent & Safe Execution

#projects #ai #machinelearning #python

In the era of massive cloud-based AI, building tools that run entirely on your local machine often feels like an afterthought. However, privacy, speed, and zero recurring API costs are incredible advantages for developers building AI agents.

I recently set out to build VoiceForge AI, a locally hosted, voice-powered coding assistant and file manager that utilizes Speech-to-Text (STT) and Large Language Models (LLMs) right on my personal hardware.

Here is a dive into the architecture, the local AI models I utilized, and how I successfully tackled the major engineering challenges—including unlocking complex "bonus" features like compound intents and safety-focused human-in-the-loop file execution.

The Core Architecture
The goal of VoiceForge AI is simple: you speak into your microphone, the app transcribes your speech, extracts exactly what you want it to do (intent analysis), and then executes those actions safely within a sandboxed environment.

I chose a Python-heavy stack to tie everything together:

Streamlit (Frontend): To handle microphone input directly in the browser and display a premium, reactive UI.
faster-whisper (STT): To quickly and accurately convert voice bytes into raw text locally without server delays.
Ollama (LLM / Intent Parsing): Utilizing local models to extract JSON data.
Custom Router: A Python class restricted rigidly to writing files inside an internal output/ directory so my system files are never overwritten.
Model Choices & Hardware Workarounds
When building a local AI application, your biggest constraint is usually hardware (VRAM). I wanted to ensure this app could run even if you aren't using a high-end computing rig.

Speech-to-Text with Whisper Instead of relying on OpenAI's expensive STT API, I used faster-whisper and loaded the small model. For maximum compatibility on CPU-only machines, I forced the quantization to integer math: compute_type="int8" and device="cpu". Transcription occurs in under a few seconds — a slight trade-off in milliseconds for massive gains in local system compatibility.
Intent Extraction with LLaMA I utilized Ollama running the lightweight llama3.2:1b model. Because small parameter models can sometimes output broken JSON, I didn't rely purely on the model's structural awareness. Instead, I gave the model a heavily rigid System Prompt with few-shot examples and implemented a custom _repair_json() regex fallback in Python to successfully salvage poorly formatted outputs.

Achieving the "Bonus" Constraints
A basic AI agent triggers one tool upon request. I wanted VoiceForge AI to act like a true assistant. Here is how I tackled the tough "bonus" requirements internally:

Compound Commands
The Challenge: A user might say, "Summarize this text, and also write a Python script that calculates factorials." Normal routers break down under dual requests. The Fix: I instructed the LLM to return an actions array rather than a single dictionary. By looping over this JSON list, the router executes every tool required sequentially on a single mic capture.
Human-in-the-Loop (HITL) Execution
The Challenge: Voice AI should not magically spawn files on a system without user verification — it's a massive security vulnerability. The Fix: Before my ToolRouter executes any file creation logic, Streamlit pauses its execution state using st.session_state.pending_actions. It visually prompts the user with exactly what is going to be written and asks for a manual "Confirm & Execute" button press before committing to the hard drive.
Graceful Degradation
The Challenge: What if the audio is entirely silent or unintelligible? What if the LLM hallucinates a command the system doesn’t support? The Fix: If the audio fails, the pipeline immediately stops and surfaces the faster-whisper error string. If the LLM produces a gibberish intent, my logic defaults action to "none", bypassing the file operations and instead returning a graceful chat response.
Memory & Session Tracking
Streamlit reruns its script completely over every interaction. To keep a persistent session history of previous tasks, failed files, and commands, I built an appendable dictionary log saved inside Streamlit's st.session_state. This populates a stylish tabular history board at the bottom of the dashboard.
Advanced Benchmarking
To guarantee the hardware workarounds remained effective, I mapped time.time() logs strictly bridging the STT execution and the Ollama execution. Then, utilizing Streamlit's custom HTML capabilities, I visualize this benchmarking data with horizontal status bars so the user can easily observe any inference bottlenecks!

Conclusion
Building a multi-modal AI agent locally has never been easier. By combining faster-whisper for fast audio transcription and lightweight quantized models via Ollama, you can completely replicate cloud-like utility without ever sharing your microphone's output or hard-earned API credits with a third party

You can check out the source code natively here:https://github.com/suryansh0512/voiceforgeAI

DEV Community

Building VoiceForge AI: A Local Voice-Powered Agent with Compound Intent & Safe Execution

Top comments (0)