This project was built as part of an AI/ML assignment focused on
building real-world AI agents.
Introduction
As a third-year undergraduate interested in AI systems, I wanted to
explore how we can move beyond chat-based interfaces and build systems
that actually perform real actions.
In this project, I built a voice-controlled AI agent that takes audio
input, understands user intent, and executes tasks like file creation,
code generation, summarization, and general chat.
Problem Statement
Most AI systems today are limited to text-based interaction. Even voice
assistants often act as wrappers over chat models and do not perform
meaningful system-level actions.
The goal of this project was to build an agent that: - accepts voice
input - understands the intent behind it - executes real actions on the
system safely
System Overview
Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI
Output
Tech Stack
- Speech-to-Text: Groq API
- LLM: Ollama (local)
- UI: Streamlit
- Language: Python
Key Design Decisions
Speech-to-Text (Groq API)
Local STT models are computationally expensive. Using Groq provides fast
and reliable transcription.
LLM (Ollama)
Runs locally ensuring privacy and no API cost, though cloud models may
have lower latency.
Core Features
- Voice input (mic + file)
- Intent classification
- File creation and code generation
- Summarization and chat
- Human confirmation
- Session memory
- Safe execution in output folder
Challenges
- Heavy local STT
- Structuring intent reliably
- Safe file execution
- Latency vs control trade-off
Example Flow
User: "write a c++ code to find the max element from an array."
- Transcription
- Intent detection
- Confirmation
- File creation
- UI output
Conclusion
This project demonstrates how AI systems can move beyond chat into
real-world action systems.
Project Link
Voice-Controlled Local AI Agent
This project implements the assignment from Mem0_ AI_ML & Generative AI Developer Intern Assignment.pdf: a voice-driven AI agent that accepts audio, transcribes speech, classifies the user's intent, safely executes local actions inside output/, and shows the full pipeline in a Streamlit UI.
Assignment status
Requirement-by-requirement status against the PDF:
- Audio input from microphone: satisfied
- Audio file upload: satisfied
- Speech-to-text: satisfied through OpenAI or Groq API-based STT
- Local or API STT note in README: satisfied
- Intent understanding with LLM: satisfied through Ollama, OpenAI, or Groq
- Minimum supported intents
- create file: satisfied
- write code to new or existing file: satisfied
- summarize text: satisfied
- general chat: satisfied
- Tool execution for local file operations: satisfied
- Create files or folders inside sandboxed
output/: satisfied - Code generation saved directly to file: satisfied
- Text summarization: satisfied
- UI shows transcription: satisfied
- UI shows detected intent: satisfied
- UI shows action taken: satisfied
- …
Top comments (0)