What I Built
A voice-controlled AI agent that accepts audio input, classifies the user's intent, executes local tools, and displays the full pipeline in a Streamlit UI.
Try it: GitHub | Demo Video
*Architecture
*
Audio (mic / file / text)
↓
[1] STT — Groq Whisper API
↓
[2] Intent Detection — LLaMA 3.3 70B (JSON output)
↓
create_file | write_code | summarize | general_chat | compound
↓
[3] Tool Execution → output/ folder
↓
[4] Streamlit UI
Models Used
| Layer | Model | Why |
|---|---|---|
| STT | Groq Whisper (whisper-large-v3) | Local Whisper took 12–18s per clip on CPU; Groq does it in under 1s |
| LLM | LLaMA 3.3 70B via Groq | Reliable structured JSON output; local 3B models were inconsistent |
Hardware note: Local openai/whisper-base via HuggingFace works but is too slow for real-time use without a GPU. Groq's free tier is fast enough to feel instant.
** Intent Classification**
The LLM returns strict JSON:
{
"intent": "write_code",
"filename": "retry.py",
"language": "python"
}
For compound commands like "write a retry function and send a leave letter", it returns a tasks array and the agent runs each sub-task sequentially. A keyword-based fallback handles API failures.
Supported Intents
| Intent | Example |
|---|---|
create_file |
"Make a notes.txt" |
write_code |
"Write a Python retry decorator" |
summarize |
"Summarize this and save it" |
general_chat |
"What is a linked list?" |
compound |
"Write a C++ sort and a leave letter" |
** Key Challenges**
1. Follow-up save commands — "Save that as a text file" has no content. The agent looks backward through chat history to find the last substantive assistant response and writes that to disk.
2. Compound commands — Split on \band\b or commas, run intent detection on each fragment, execute independently.
3. Sandboxing — All file writes are restricted to output/ via _safe_filename() which strips path traversal sequences.
Bonus Features Implemented
- ✅ Compound commands — multiple actions in one input
- ✅ Human-in-the-loop — confirmation toggle before any file operation
- ✅ Graceful degradation — keyword fallback if LLM fails; errors shown in UI
- ✅ Session memory — last 10 messages passed as context on every call
Setup
git clone https://github.com/Srishti-1806/.git/OA_Submission.git
cd OA_Submission
pip install -r requirements.txt
cp .env # add GROQ_API_KEY
streamlit run app.py
Free Groq API key at console.groq.com.
- True token
streaming in the UI
- Structured outputs / function calling instead of regex-cleaned JSON
- TTS for voice responses to close the loop

Top comments (0)