DEV Community

Srishti Mishra
Srishti Mishra

Posted on

I Built a Voice-Controlled Local AI Agent That Actually Works — Here's Everything I Learned

What I Built

A voice-controlled AI agent that accepts audio input, classifies the user's intent, executes local tools, and displays the full pipeline in a Streamlit UI.

Try it: GitHub | Demo Video


*Architecture
*

Audio (mic / file / text)

[1] STT — Groq Whisper API

[2] Intent Detection — LLaMA 3.3 70B (JSON output)

create_file | write_code | summarize | general_chat | compound

[3] Tool Execution → output/ folder

[4] Streamlit UI


Models Used

Layer Model Why
STT Groq Whisper (whisper-large-v3) Local Whisper took 12–18s per clip on CPU; Groq does it in under 1s
LLM LLaMA 3.3 70B via Groq Reliable structured JSON output; local 3B models were inconsistent

Hardware note: Local openai/whisper-base via HuggingFace works but is too slow for real-time use without a GPU. Groq's free tier is fast enough to feel instant.


** Intent Classification**

The LLM returns strict JSON:


Enter fullscreen mode Exit fullscreen mode

{
"intent": "write_code",
"filename": "retry.py",
"language": "python"
}

Enter fullscreen mode Exit fullscreen mode

For compound commands like "write a retry function and send a leave letter", it returns a tasks array and the agent runs each sub-task sequentially. A keyword-based fallback handles API failures.


Supported Intents

Intent Example
create_file "Make a notes.txt"
write_code "Write a Python retry decorator"
summarize "Summarize this and save it"
general_chat "What is a linked list?"
compound "Write a C++ sort and a leave letter"

** Key Challenges**

1. Follow-up save commands"Save that as a text file" has no content. The agent looks backward through chat history to find the last substantive assistant response and writes that to disk.

2. Compound commands — Split on \band\b or commas, run intent detection on each fragment, execute independently.

3. Sandboxing — All file writes are restricted to output/ via _safe_filename() which strips path traversal sequences.


Bonus Features Implemented

  • Compound commands — multiple actions in one input
  • Human-in-the-loop — confirmation toggle before any file operation
  • Graceful degradation — keyword fallback if LLM fails; errors shown in UI
  • Session memory — last 10 messages passed as context on every call

Setup


Enter fullscreen mode Exit fullscreen mode

git clone https://github.com/Srishti-1806/.git/OA_Submission.git
cd OA_Submission
pip install -r requirements.txt
cp .env # add GROQ_API_KEY
streamlit run app.py

Enter fullscreen mode Exit fullscreen mode

Free Groq API key at console.groq.com.


**
 **

  • True token

Enter fullscreen mode Exit fullscreen mode

streaming in the UI

  • Structured outputs / function calling instead of regex-cleaned JSON
  • TTS for voice responses to close the loop

Top comments (0)