Priya Kanade

Posted on Apr 15

From Voice to Action: Building an AI Agent with Speech and LLMs

#agents #llm #nlp #showdev

🎤 Building a Voice AI Agent with LLMs: From Speech to Action

In this project, I built an end-to-end Voice AI Agent that converts speech into text, understands user intent using Large Language Models (LLMs), and performs real-world actions like code generation, file creation, and summarization.

This project focuses on combining speech processing, LLM reasoning, and tool execution into a single interactive system.

🚀 Problem Statement

Traditional systems require manual input and lack flexibility in understanding natural language commands. The goal of this project was to build an intelligent agent that can:

Accept voice input
Understand user intent
Execute meaningful actions
Provide real-time feedback through a UI

🏗️ System Architecture

The system follows a modular pipeline:
Audio Input → Speech-to-Text → LLM → Agent → Tools → UI

1. Audio Input

Supports both microphone recording and file uploads
Handles formats like WAV, MP3, and AAC

2. Speech-to-Text (STT)

The audio input is converted into text using a speech recognition model.

This step acts as the entry point for the LLM.

3. LLM (Intent Detection + Parsing)

The LLM is responsible for:

Understanding the user's request
Extracting structured information (intent, filename, etc.)
Returning a JSON output

Example:

{
  "intent": "write_code",
  "filename": "binary_search.cpp",
  "code": "..."
}

4. Agent Layer (Core Logic)

The agent acts as the brain of the system:

Parses LLM output
Handles errors and fallbacks
Decides which tool to execute
Supports compound commands

5. Tools (Execution Layer)

Different tools are used for specific actions:

File creation
Code generation and saving
Text summarization

6. Frontend (Streamlit UI)

The UI displays:

Transcribed text
Detected intent
Action taken
Final output

It also includes:

Session history
User confirmation for critical actions

🔄 Key Features

🎙️ Voice & Audio Input

Supports both microphone recording (local environment) and audio file upload
Accepts multiple formats: WAV, MP3, and AAC
Automatically converts AAC to WAV for processing

🧠 Intent Detection using LLM

Uses a Large Language Model to understand user commands
Classifies input into actionable intents:
- create_file
- write_code
- summarize
- chat
Extracts structured data like filename and content

⚙️ Action Execution Layer

Performs real-world tasks based on detected intent:
- Create files
- Generate and save code
- Summarize text
- Answer general queries

🔄 Compound Command Support

Handles multi-step instructions in a single input
Example: > “Summarize this and save it to summary.txt”
Executes both summarization and file-saving sequentially

👤 Human-in-the-Loop Confirmation

Asks user confirmation before executing file operations
Prevents unintended file creation or overwriting

⚠️ Graceful Error Handling

Handles unclear or empty audio inputs
Provides fallback responses if LLM output is invalid
Ensures system stability without crashes

🧠 Session Memory

Stores user interactions within the session
Displays conversation history in the UI
Improves traceability of actions and results

DEV Community