Building a Voice-Controlled AI Agent using AssemblyAI and Groq

Ishaan-Chaturved1 — Tue, 14 Apr 2026 17:25:39 +0000

Introduction

In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions such as generating code, creating files, and summarizing text.

The goal was to combine speech processing, language models, and tool execution into a single pipeline that feels like a real-world AI assistant.

Architecture Overview

The system follows a simple but powerful pipeline:

Audio Input → Speech-to-Text → Intent Detection → Tool Execution → Output

Each stage is modular, making the system easy to extend and debug.

Tech Stack

Speech-to-Text: AssemblyAI
Language Model: Groq (llama-3.1-8b-instant)
Frontend: Streamlit
Backend: Python

How It Works

1. Speech-to-Text

The user uploads an audio file, which is transcribed into text using AssemblyAI.
This step converts unstructured voice input into usable text.

2. Intent Detection

The transcribed text is sent to a language model hosted on Groq.

The model analyzes the command and returns structured output like:

{
  "intents": ["write_code", "create_file"],
  "params": {
    "filename": "retry.py",
    "language": "python"
  }
}

This allows the system to support multiple actions in a single command.

3. Tool Execution

Based on detected intents, the system executes actions such as:

Generating code
Creating files
Summarizing text
General chat responses

All file operations are restricted to a safe output/ directory.

4. User Interface

The UI is built using Streamlit and shows:

Transcribed text
Detected intent
Execution results
Session history

Example

Input:

"Create a Python file with a Fibonacci function"

Output:

Code is generated
File is created in the output folder
Results are displayed in the UI

Bonus Features

Compound Commands

The system supports multiple actions in one input:

"Summarize this text and save it to summary.txt"

Human-in-the-Loop

Before file operations, the user is asked to confirm execution.
This adds a layer of safety and control.

Graceful Degradation

If intent detection fails, the system falls back to keyword-based classification instead of crashing.

Session Memory

The agent maintains a history of interactions within the session.

Challenges Faced

1. Local Model Limitations

Initially, I used local models:

Whisper (HuggingFace) for speech-to-text
Ollama for language models

However, this approach led to:

FFmpeg setup issues on Windows
High memory usage
Slow performance on CPU
Frequent crashes

2. Switching to API-based Models

After exploring developer discussions (including Reddit), I switched to:

AssemblyAI for STT
Groq for LLM inference

This significantly improved:

Speed
Stability
Ease of setup

3. Model Deprecation Issues

While using Groq, some models were deprecated during development.
This required updating model names and adapting quickly to API changes.

4. Output Cleaning

Language models sometimes returned explanations along with code.
This was fixed by enforcing strict prompts and cleaning responses before saving.

Model Benchmarking

Component	Model	Speed	Stability
STT	AssemblyAI	Fast	High
LLM (Local)	Ollama	Slow	Unstable
LLM (API)	Groq	Very Fast	High

API-based models clearly outperformed local setups in this project.

Key Learnings

Building reliable systems is more important than using local models
APIs can significantly improve performance and developer experience
Fallback mechanisms are essential in AI systems
Debugging agent pipelines requires step-by-step visibility

Future Improvements

Real-time microphone input
Persistent memory across sessions
Streaming responses
More advanced tool integrations

Conclusion

This project demonstrates how speech, language models, and execution logic can be combined to build a practical AI agent.

It also highlights the tradeoffs between local and API-based approaches, and the importance of choosing the right tools based on system constraints.

Building a Voice-Controlled AI Agent using AssemblyAI and Groq

Ishaan-Chaturved1 — Tue, 14 Apr 2026 13:29:12 +0000

Introduction

In this project, I built a voice-controlled AI agent that converts spoken commands into executable actions like generating code and creating files.

Architecture

The system follows a modular pipeline:

Audio → STT → Intent Detection → Tool Execution → Output

Technologies Used

AssemblyAI for speech-to-text
Groq LLM (llama-3.1-8b-instant) for intent classification
Streamlit for UI
Python for backend agent logic

How it Works

User uploads audio
Audio is transcribed into text
LLM detects intent (multi-intent supported)
Agent executes actions
Output is displayed and files are created

Challenges Faced

Ollama instability on local setup
Model deprecations in Groq
Handling multi-intent parsing
Debugging silent failures in Streamlit

Key Learnings

Importance of fallback mechanisms
API-based models are more stable than local inference
Proper debugging is critical in agent systems

Future Work

Add real-time voice input
Integrate memory and context
Add RAG for knowledge-based queries

Conclusion

This project demonstrates how AI agents can combine speech, reasoning, and actions into a seamless user experience.

DEV Community: Ishaan-Chaturved1

Building a Voice-Controlled AI Agent using AssemblyAI and Groq

Introduction

Architecture Overview

Tech Stack

How It Works

1. Speech-to-Text

2. Intent Detection

3. Tool Execution

4. User Interface

Example

Input:

Output:

Bonus Features

Compound Commands

Human-in-the-Loop

Graceful Degradation

Session Memory

Challenges Faced

1. Local Model Limitations

2. Switching to API-based Models

3. Model Deprecation Issues

4. Output Cleaning

Model Benchmarking

Key Learnings

Future Improvements

Conclusion

Building a Voice-Controlled AI Agent using AssemblyAI and Groq

Introduction

Architecture

Technologies Used

How it Works

Challenges Faced

Key Learnings

Future Work

Conclusion