DEV Community

Vansh Jangwal
Vansh Jangwal

Posted on

AI VOICE AGENT USING GROQ API

πŸŽ™οΈ VoiceAgent AI β€” Local AI Agent with Voice Control

Fully-functioning voice-controlled local AI Agent for Mem0 AI/ML & Generative AI Developer Intern Assignment. The system accepts audio input, transcribes it, classifies intent with an LLM, and runs local tools, all presented in a sleek, dark-themed Streamlit UI.

πŸ—οΈ Architecture

The architecture includes a few components such as audio input, speech-to-text, intent classification, and a tool dispatcher.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     VoiceAgent AI                        β”‚
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚  Audio   │──▢│  STT     │──▢│  Intent   β”‚            β”‚
β”‚  β”‚  Input   β”‚   β”‚ (Whisper β”‚   β”‚ Classify  β”‚            β”‚
β”‚  β”‚  .wav    β”‚   β”‚  via     β”‚   β”‚ (LLaMA    β”‚            β”‚
β”‚  β”‚  .mp3    β”‚   β”‚  Groq)   β”‚   β”‚  3.3 70B  β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚  via Groq)β”‚            β”‚
β”‚                                β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                                      β”‚                   β”‚
β”‚              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚              β”‚      Tool Dispatcher  β”‚                β”‚  β”‚
β”‚              β”‚                       β–Ό                β”‚  β”‚
β”‚              β”‚  create_file β”‚ write_code β”‚ summarize  β”‚  β”‚
β”‚              β”‚              β”‚  general_chat           β”‚  β”‚
β”‚              └─────────────────────────────────────── β”˜  β”‚
β”‚                                      β”‚                   β”‚
β”‚                                      β–Ό                   β”‚
β”‚                            output/ folder                β”‚
β”‚                          (sandboxed, safe)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

The architecture diagram above shows this.

Module Breakdown

app.py | Streamlit UI β€” Pipeline Display, Session History, Human-in-the-loop. intent_classifier.py | LLaMA 3.3 70B prompt + JSON parsing + graceful fallback. tools.py | Tool handlers: create_file, write_code, summarize, general_chat. requirements.txt | Minimal dependencies (Streamlit + Groq SDK).

πŸ”‘ Hardware Note & Workaround

Why Groq API instead of local models?

My local machine does not have any dedicated GPU. Running Whisper Large v3 or LLaMA 3.3 70B locally would require at minimum 8GB VRAM (Whisper) and 40GB+ RAM/VRAM (LLaMA 70B quantized). Inference using the models locally will result in accuracy degradation that’s prohibitively high.

Solution: Groq’s free API tier offers:

Whisper Large v3 for STT β€” state-of-the-art accuracy, ~2-3 seconds per audio file LLaMA 3.3 70B Versatile for intent + code generation β€” extremely fast (~200 tokens/sec on Groq hardware)

This is fully in compliance with the assignment’s hardware workaround policy. The whole pipeline is running at API speed (3 – 6 seconds, end-to-end).

✨ Bonus Features Implemented

Compound Commands β€” β€œSummarize and save to file” deals with more than one action Human-in-the-Loop β€” Checkbox confirmations for any writing-to-file commands Graceful Degradation – when JSON parsing fails, no intent corresponds to the request, or the audio message makes no sense, general_chat with a useful message is triggered. Session Memory: The entire history of the actions is displayed in the UI for the session. Safe Sandbox β€” All file operation limited to output/ folder with path traversal safeguard

πŸ“ Project Structure

voice-agent-ai/ β”œβ”€β”€ app.py # The main Streamlit UI β”œβ”€β”€ intent_classifier.py # LLM-based intent classification β”œβ”€β”€ tools.py # Tool execution handlers; β”œβ”€β”€ requirements.txt # Dependencies β”œβ”€β”€ output/ # All generated files (gitignored) README.md

🎬 Demo Video

YouTube Unlisted Link β€” Demonstrates:

  1. Voice input β†’ β€œCreate a python file with a retry decorator” β†’ write_code intent β†’ file saved 4. Provide a voice input: β€œWhat is the difference between RAM and ROM?" β†’ intent of general_chat β†’ get a response;

πŸ“ Technical Article

Medium / Dev.to Link β€” Architecture, model selection, Groq’s speed advantage, challenges.

πŸ›‘οΈ Safety

All file writes are limited to output/ directory, using an os.path.basename() stripping Stripping Path traversal (β€œ../”) Human-in-the-loop confirmation before any destructive file operation

πŸ“¦ Dependencies

streamlit>=1.35.0 # UI framework groq>=0.9.0 # Groq SDK (STT + LLM)

No Bulky ML Libraries Needed. No heavy ML libraries required; can run on any machine with Python 3.9+.

YOU CAN CHECK MY WORK

YT -- https://youtu.be/GyBar8-K7Wk

Top comments (0)