A Voice-Controlled AI Agent for Real-World Task Execution

Lakshmi Surya Kalla — Tue, 14 Apr 2026 18:53:17 +0000

This project was built as part of an AI/ML assignment focused on
building real-world AI agents.

Introduction

As a third-year undergraduate interested in AI systems, I wanted to
explore how we can move beyond chat-based interfaces and build systems
that actually perform real actions.

In this project, I built a voice-controlled AI agent that takes audio
input, understands user intent, and executes tasks like file creation,
code generation, summarization, and general chat.

Problem Statement

Most AI systems today are limited to text-based interaction. Even voice
assistants often act as wrappers over chat models and do not perform
meaningful system-level actions.

The goal of this project was to build an agent that: - accepts voice
input - understands the intent behind it - executes real actions on the
system safely

System Overview

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI
Output

Tech Stack

Speech-to-Text: Groq API
LLM: Ollama (local)
UI: Streamlit
Language: Python

Key Design Decisions

Speech-to-Text (Groq API)

Local STT models are computationally expensive. Using Groq provides fast
and reliable transcription.

LLM (Ollama)

Runs locally ensuring privacy and no API cost, though cloud models may
have lower latency.

Core Features

Voice input (mic + file)
Intent classification
File creation and code generation
Summarization and chat
Human confirmation
Session memory
Safe execution in output folder

Challenges

Heavy local STT
Structuring intent reliably
Safe file execution
Latency vs control trade-off

Example Flow

User: "write a c++ code to find the max element from an array."

Transcription
Intent detection
Confirmation
File creation
UI output

Conclusion

This project demonstrates how AI systems can move beyond chat into
real-world action systems.

Project Link

suryakalla06 / voice-controlled_local_ai_agent

Voice-Controlled Local AI Agent

This project implements the assignment from Mem0_ AI_ML & Generative AI Developer Intern Assignment.pdf: a voice-driven AI agent that accepts audio, transcribes speech, classifies the user's intent, safely executes local actions inside output/, and shows the full pipeline in a Streamlit UI.

Assignment status

Requirement-by-requirement status against the PDF:

Audio input from microphone: satisfied
Audio file upload: satisfied
Speech-to-text: satisfied through OpenAI or Groq API-based STT
Local or API STT note in README: satisfied
Intent understanding with LLM: satisfied through Ollama, OpenAI, or Groq
Minimum supported intents
- create file: satisfied
- write code to new or existing file: satisfied
- summarize text: satisfied
- general chat: satisfied
Tool execution for local file operations: satisfied
Create files or folders inside sandboxed output/: satisfied
Code generation saved directly to file: satisfied
Text summarization: satisfied
UI shows transcription: satisfied
UI shows detected intent: satisfied
UI shows action taken: satisfied
…

View on GitHub

DEV Community: Lakshmi Surya Kalla