Ishaan-Chaturved1

Posted on Apr 14

Building a Voice-Controlled AI Agent using AssemblyAI and Groq

#ai #python #agents #llm

Introduction

In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions such as generating code, creating files, and summarizing text.

The goal was to combine speech processing, language models, and tool execution into a single pipeline that feels like a real-world AI assistant.

Architecture Overview

The system follows a simple but powerful pipeline:

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output

Each stage is modular, making the system easy to extend and debug.

Tech Stack

Speech-to-Text: AssemblyAI
Language Model: Groq (llama-3.1-8b-instant)
Frontend: Streamlit
Backend: Python

How It Works

1. Speech-to-Text

The user uploads an audio file, which is transcribed into text using AssemblyAI.
This step converts unstructured voice input into usable text.

2. Intent Detection

The transcribed text is sent to a language model hosted on Groq.

The model analyzes the command and returns structured output like:

{
  "intents": ["write_code", "create_file"],
  "params": {
    "filename": "retry.py",
    "language": "python"
  }
}

This allows the system to support multiple actions in a single command.

3. Tool Execution

Based on detected intents, the system executes actions such as:

Generating code
Creating files
Summarizing text
General chat responses

All file operations are restricted to a safe output/ directory.

4. User Interface

The UI is built using Streamlit and shows:

Transcribed text
Detected intent
Execution results
Session history

Example

Input:

"Create a Python file with a Fibonacci function"

Output:

Code is generated
File is created in the output folder
Results are displayed in the UI

Bonus Features

Compound Commands

The system supports multiple actions in one input:

"Summarize this text and save it to summary.txt"

Human-in-the-Loop

Before file operations, the user is asked to confirm execution.
This adds a layer of safety and control.

Graceful Degradation

If intent detection fails, the system falls back to keyword-based classification instead of crashing.

Session Memory

The agent maintains a history of interactions within the session.

Challenges Faced

1. Local Model Limitations

Initially, I used local models:

Whisper (HuggingFace) for speech-to-text
Ollama for language models

However, this approach led to:

FFmpeg setup issues on Windows
High memory usage
Slow performance on CPU
Frequent crashes

2. Switching to API-based Models

After exploring developer discussions (including Reddit), I switched to:

AssemblyAI for STT
Groq for LLM inference

This significantly improved:

Speed
Stability
Ease of setup

3. Model Deprecation Issues

While using Groq, some models were deprecated during development.
This required updating model names and adapting quickly to API changes.

4. Output Cleaning

Language models sometimes returned explanations along with code.
This was fixed by enforcing strict prompts and cleaning responses before saving.

Model Benchmarking

Component	Model	Speed	Stability
STT	AssemblyAI	Fast	High
LLM (Local)	Ollama	Slow	Unstable
LLM (API)	Groq	Very Fast	High

API-based models clearly outperformed local setups in this project.

Key Learnings

Building reliable systems is more important than using local models
APIs can significantly improve performance and developer experience
Fallback mechanisms are essential in AI systems
Debugging agent pipelines requires step-by-step visibility

Future Improvements

Real-time microphone input
Persistent memory across sessions
Streaming responses
More advanced tool integrations

Conclusion

This project demonstrates how speech, language models, and execution logic can be combined to build a practical AI agent.

It also highlights the tradeoffs between local and API-based approaches, and the importance of choosing the right tools based on system constraints.

DEV Community