Building a Voice-Controlled Local AI Agent

Abhigyan Pal — Wed, 15 Apr 2026 17:55:05 +0000

I built a voice-controlled AI agent that takes spoken input, converts it to text, classifies intent, executes local tools, and shows the full pipeline in a Gradio UI. The goal was to make the system practical, safe, and easy to debug end-to-end.

Architecture

The system follows a simple 4-stage pipeline:

Input Layer (UI)

Users can provide commands through microphone recording, audio upload, or text input (for quick testing).
Speech-to-Text (STT)

Audio is transcribed using AssemblyAI.
Intent Understanding (LLM Router)

Transcribed text is sent to a Groq-hosted Llama 3.3 70B model, which returns structured JSON intents such as:
- create_file
- write_code
- summarize
- general_chat
Tool Execution Layer

The app routes each detected intent to a tool:
- file creation
- code generation and save
- summarization
- chat response

The UI then displays:

transcribed text
detected intents
action(s) taken
final result

All file operations are sandboxed to an output/ directory for safety.

Models and Why I Chose Them

1) AssemblyAI for STT

I initially considered local models (Whisper/wav2vec), but for this machine and timeline, API-based STT was more reliable and faster to integrate.

Why AssemblyAI:

generous free tier
strong transcription quality
simple Python SDK
avoids local GPU dependency

2) Groq + Llama 3.3 70B for Intent + Generation

For intent classification and text/code generation, I used Groq’s hosted Llama model.

Why Groq:

fast inference latency
good structured-output behavior (JSON intent schema)
strong instruction following for routing + generation
straightforward integration in Python

Key Challenges I Faced

1) STT model configuration mismatches

A major challenge was AssemblyAI configuration compatibility:

speech_model was deprecated by API expectations
speech_models required specific values (universal-3-pro, universal-2)
enum values in SDK and accepted server values were not always intuitive

I resolved this by explicitly setting supported speech_models values.

2) Language drift (Hindi vs English output)

During testing, English speech was sometimes transcribed in Hindi script due to language auto-detection. This cascaded into Hindi LLM responses.

I fixed this by forcing English in STT config and aligning LLM prompts to respond in English.

3) Intent ambiguity in compound commands

User prompts like “create a file and write your capabilities” can be interpreted as:

create file + generate code
create file + write plain text
chat + file write

Because the current intent set does not include a dedicated write_text intent, the model sometimes chose write_code, producing code when plain text was expected. This highlighted an important product gap: intent taxonomies must match real user phrasing.

4) Safety vs usability

I needed to enable local file actions while minimizing risk. Restricting writes to output/ and adding a confirmation toggle balanced safety with usability.

What Worked Well

Clear modular design (stt.py, intent.py, tools/*, app.py)
Human-in-the-loop confirmation before file operations
Compound-intent execution support
Persistent memory support and reset controls
Benchmark script to evaluate intent accuracy and generation latency

Conclusion

This project showed that a practical voice agent is less about one “smart model” and more about pipeline reliability: robust STT config, strict intent schema, safe tool boundaries, and transparent UI feedback. The next meaningful improvement would be adding a write_text intent and richer compound-intent planning so user requests map more naturally to expected outcomes.

DEV Community: Abhigyan Pal