DEV Community

Lakshmi Surya Kalla
Lakshmi Surya Kalla

Posted on

A Voice-Controlled AI Agent for Real-World Task Execution

This project was built as part of an AI/ML assignment focused on
building real-world AI agents.

Introduction

As a third-year undergraduate interested in AI systems, I wanted to
explore how we can move beyond chat-based interfaces and build systems
that actually perform real actions.

In this project, I built a voice-controlled AI agent that takes audio
input, understands user intent, and executes tasks like file creation,
code generation, summarization, and general chat.

Problem Statement

Most AI systems today are limited to text-based interaction. Even voice
assistants often act as wrappers over chat models and do not perform
meaningful system-level actions.

The goal of this project was to build an agent that: - accepts voice
input - understands the intent behind it - executes real actions on the
system safely

System Overview

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → UI
Output

Tech Stack

  • Speech-to-Text: Groq API
  • LLM: Ollama (local)
  • UI: Streamlit
  • Language: Python

Key Design Decisions

Speech-to-Text (Groq API)

Local STT models are computationally expensive. Using Groq provides fast
and reliable transcription.

LLM (Ollama)

Runs locally ensuring privacy and no API cost, though cloud models may
have lower latency.

Core Features

  • Voice input (mic + file)
  • Intent classification
  • File creation and code generation
  • Summarization and chat
  • Human confirmation
  • Session memory
  • Safe execution in output folder

Challenges

  • Heavy local STT
  • Structuring intent reliably
  • Safe file execution
  • Latency vs control trade-off

Example Flow

User: "write a c++ code to find the max element from an array."

  1. Transcription
  2. Intent detection
  3. Confirmation
  4. File creation
  5. UI output

Conclusion

This project demonstrates how AI systems can move beyond chat into
real-world action systems.

Project Link

Voice-Controlled Local AI Agent

This project implements the assignment from Mem0_ AI_ML & Generative AI Developer Intern Assignment.pdf: a voice-driven AI agent that accepts audio, transcribes speech, classifies the user's intent, safely executes local actions inside output/, and shows the full pipeline in a Streamlit UI.

Assignment status

Requirement-by-requirement status against the PDF:

  • Audio input from microphone: satisfied
  • Audio file upload: satisfied
  • Speech-to-text: satisfied through OpenAI or Groq API-based STT
  • Local or API STT note in README: satisfied
  • Intent understanding with LLM: satisfied through Ollama, OpenAI, or Groq
  • Minimum supported intents
    • create file: satisfied
    • write code to new or existing file: satisfied
    • summarize text: satisfied
    • general chat: satisfied
  • Tool execution for local file operations: satisfied
  • Create files or folders inside sandboxed output/: satisfied
  • Code generation saved directly to file: satisfied
  • Text summarization: satisfied
  • UI shows transcription: satisfied
  • UI shows detected intent: satisfied
  • UI shows action taken: satisfied

Top comments (0)