DEV Community

Cover image for From Voice to Action: Building an AI Agent with Speech and LLMs
Priya Kanade
Priya Kanade

Posted on

From Voice to Action: Building an AI Agent with Speech and LLMs

🎀 Building a Voice AI Agent with LLMs: From Speech to Action

In this project, I built an end-to-end Voice AI Agent that converts speech into text, understands user intent using Large Language Models (LLMs), and performs real-world actions like code generation, file creation, and summarization.

This project focuses on combining speech processing, LLM reasoning, and tool execution into a single interactive system.


πŸš€ Problem Statement

Traditional systems require manual input and lack flexibility in understanding natural language commands. The goal of this project was to build an intelligent agent that can:

  • Accept voice input
  • Understand user intent
  • Execute meaningful actions
  • Provide real-time feedback through a UI

πŸ—οΈ System Architecture

The system follows a modular pipeline:
Audio Input β†’ Speech-to-Text β†’ LLM β†’ Agent β†’ Tools β†’ UI

1. Audio Input

  • Supports both microphone recording and file uploads
  • Handles formats like WAV, MP3, and AAC

2. Speech-to-Text (STT)

The audio input is converted into text using a speech recognition model.

This step acts as the entry point for the LLM.


3. LLM (Intent Detection + Parsing)

The LLM is responsible for:

  • Understanding the user's request
  • Extracting structured information (intent, filename, etc.)
  • Returning a JSON output

Example:

{
  "intent": "write_code",
  "filename": "binary_search.cpp",
  "code": "..."
}
Enter fullscreen mode Exit fullscreen mode

4. Agent Layer (Core Logic)

The agent acts as the brain of the system:

  • Parses LLM output
  • Handles errors and fallbacks
  • Decides which tool to execute
  • Supports compound commands

5. Tools (Execution Layer)

Different tools are used for specific actions:

  • File creation
  • Code generation and saving
  • Text summarization

6. Frontend (Streamlit UI)

The UI displays:

  • Transcribed text
  • Detected intent
  • Action taken
  • Final output

It also includes:

  • Session history
  • User confirmation for critical actions

πŸ”„ Key Features

πŸŽ™οΈ Voice & Audio Input

  • Supports both microphone recording (local environment) and audio file upload
  • Accepts multiple formats: WAV, MP3, and AAC
  • Automatically converts AAC to WAV for processing

🧠 Intent Detection using LLM

  • Uses a Large Language Model to understand user commands
  • Classifies input into actionable intents:
    • create_file
    • write_code
    • summarize
    • chat
  • Extracts structured data like filename and content

βš™οΈ Action Execution Layer

  • Performs real-world tasks based on detected intent:
    • Create files
    • Generate and save code
    • Summarize text
    • Answer general queries

πŸ”„ Compound Command Support

  • Handles multi-step instructions in a single input
  • Example: > β€œSummarize this and save it to summary.txt”
  • Executes both summarization and file-saving sequentially

πŸ‘€ Human-in-the-Loop Confirmation

  • Asks user confirmation before executing file operations
  • Prevents unintended file creation or overwriting

⚠️ Graceful Error Handling

  • Handles unclear or empty audio inputs
  • Provides fallback responses if LLM output is invalid
  • Ensures system stability without crashes

🧠 Session Memory

  • Stores user interactions within the session
  • Displays conversation history in the UI
  • Improves traceability of actions and results

Top comments (0)