siddharth shetty

Posted on Apr 12

Audio Ai agent Pipeline

#agents #ai #llm #showdev

Introduction

Voice-controlled AI agents have traditionally required expensive cloud APIs, constant internet connectivity, and a willingness to send sensitive audio to third-party servers. This project breaks that mould by assembling an end-to-end voice-to-action pipeline that keeps the heavy inference local. You speak — or upload an audio file — and the system transcribes, understands, routes, and executes without leaving your machine (except for the Groq-hosted Whisper call at the STT stage).
The stack is deliberately minimal yet production-quality:

Whisper Large V3 via the Groq API for near-real-time, high-accuracy speech-to-text
Llama 3 via Ollama as the local reasoning engine for intent classification and response generation
Streamlit as the browser-based frontend with a premium glassmorphism UI
Python tools layer for sandboxed file creation, code generation, summarisation, and general chat

This article walks through every layer of the pipeline — how each component works, how they connect, and the design decisions behind the architecture.

System Architecture

The agent follows a strictly linear pipeline: audio in → text out → intent out → tool execution → UI feedback. There are no shared mutable states between stages, which makes the system easy to reason about and straightforward to extend.

Stage 1 — Audio Input

The frontend supports two input modes:
Microphone recording — Streamlit's st.button triggers a Python call to sounddevice.rec(). The raw PCM buffer is collected at 16 kHz (mono), chosen because Whisper was trained at this sample rate, and saved as a temporary .wav file using scipy.io.wavfile.write.
File upload — Streamlit's st.file_uploader accepts .wav, .mp3, .m4a, and other common audio containers. The bytes are written to a temp file and handled identically to a microphone recording from here on.

Stage 2 — Speech-to-Text with Whisper Large V3

Why Whisper Large V3?
OpenAI's Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual, multitask supervised audio data. The Large V3 variant (1.55 billion parameters) achieves the lowest word-error-rate in the series and adds improved noise robustness and language identification compared to V2.
Key improvements in V3 over V2:

Reduced hallucinations on silent or near-silent segments
Better handling of code-switching (mixing languages mid-sentence)
Improved punctuation placement, which matters for downstream NLP
80-channel log-Mel spectrogram input (up from 80 uniform Mel filterbanks) for finer frequency resolution
Groq API Integration
Running Whisper Large V3 locally requires a GPU with at least 10 GB VRAM. To keep the local machine requirements low while retaining V3's accuracy, the project routes STT through the Groq API — a hardware-accelerated inference service that returns transcripts in under a second on typical voice clips.
The Human-in-the-Loop Checkpoint
Before the transcript reaches Llama 3, Streamlit renders it in an editable st.text_area. This is a deliberate design choice: even at >95% WER accuracy, domain-specific jargon, proper nouns, and ambient noise can corrupt a word or two. Letting the user correct the transcript before execution prevents hallucination propagation — a transcription error fed into the LLM compounds into a wrong intent and a wrong action.

Stage 3 — Intent Detection with Llama 3 (Local)

Why Llama 3 via Ollama?
Meta's Llama 3 (8B instruction-tuned variant) runs fully on-device via Ollama, which manages model download, quantisation (4-bit by default), and a local REST API that mirrors the OpenAI chat completions format.
Choosing a local LLM for intent detection rather than another cloud API offers three advantages:
1.Privacy — the transcript never leaves the machine after the Groq STT call
2.Latency — no network round-trip; inference on a modern CPU takes 1–3 seconds
3.Cost — zero per-token fees for high-frequency, short-context intent classification

Stage 4 — Tool Execution (tools.py)

Once the intent and payload are extracted, a simple match / if-elif router in app.py calls the corresponding function from tools.py. All output is written to a sandboxed output/ directory, keeping generated files out of the project root.

Each tool is a thin wrapper that formats a prompt for Llama 3, calls ollama.chat(), and returns the result string. The write_code tool additionally saves the code block to disk and returns the file path.

Stage 5 — Streamlit UI

Layout

Streamlit was chosen because it eliminates the client-server boundary for rapid prototyping: the Python process is both the application logic and the web server. The UI uses custom CSS injected via st.markdown(..., unsafe_allow_html=True) to achieve the glassmorphism aesthetic described in the README.
This gives users real-time visibility into where the pipeline is, which is important because the Llama 3 inference step can take 2–5 seconds on CPU.

Data Flow — Step by Step

Here is the complete data transformation at each stage for a sample utterance: "Write a Python function that reverses a string"

Design Decisions and Trade-offs

Cloud STT vs. Fully Local STT

Running Whisper Large V3 locally requires ≥10 GB GPU VRAM and adds significant startup latency. Routing STT through Groq's inference API offers V3-quality transcription at sub-second latency without the hardware requirement. The trade-off is a single cloud dependency per voice interaction — acceptable for most use cases, but replaceable with a local Whisper model for fully air-gapped deployments.

4-bit Quantised Llama 3 vs. Full Precision

Ollama's default is 4-bit quantisation (Q4_K_M), which reduces the 8B model from ~16 GB to ~4.7 GB. At this compression level, intent classification and short code generation quality are effectively unchanged compared to full-precision inference. For longer code generation or complex reasoning, users can pull llama3:8b-instruct-fp16 at the cost of ~3× the memory.

Sandboxed Output Directory

All generated files land in output/, never in the project root. This prevents accidental overwrites of source files during code generation tasks and makes cleanup trivial. A future enhancement could mount output/ as a Docker volume.

Stateless Tool Functions

Each tool function in tools.py is pure in the sense that it takes a string and returns a string (plus a side-effect write to disk). This makes tools individually unit-testable and easy to extend — adding a new intent requires only: (a) adding the intent label to the system prompt, (b) writing a new function in tools.py, and (c) adding a case to the router.

Conclusion

This project demonstrates that a fully functional, voice-controlled AI agent is buildable with open-source components and minimal infrastructure. The architecture is deliberately simple — a five-stage linear pipeline where each stage has a single responsibility and a clear input/output contract. Whisper Large V3 handles the perceptual hard part (speech recognition), Llama 3 handles the semantic hard part (understanding intent), and Streamlit handles the UX hard part (real-time feedback) without requiring a JavaScript build step.
The human-in-the-loop checkpoint between STT and LLM is the system's most important reliability feature: it acknowledges that no transcription model is perfect and puts the user in control before any irreversible action is taken.
The codebase is small enough to read in an afternoon, modular enough to extend in an hour, and principled enough to deploy with confidence.

DEV Community