<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: siddharth shetty</title>
    <description>The latest articles on DEV Community by siddharth shetty (@siddshett).</description>
    <link>https://dev.to/siddshett</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3874916%2F1023b692-f8df-49c2-a133-4c3c7d344587.png</url>
      <title>DEV Community: siddharth shetty</title>
      <link>https://dev.to/siddshett</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siddshett"/>
    <language>en</language>
    <item>
      <title>Audio Ai agent Pipeline</title>
      <dc:creator>siddharth shetty</dc:creator>
      <pubDate>Sun, 12 Apr 2026 13:25:25 +0000</pubDate>
      <link>https://dev.to/siddshett/audio-ai-agent-pipeline-49go</link>
      <guid>https://dev.to/siddshett/audio-ai-agent-pipeline-49go</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Voice-controlled AI agents have traditionally required expensive cloud APIs, constant internet connectivity, and a willingness to send sensitive audio to third-party servers. This project breaks that mould by assembling an end-to-end voice-to-action pipeline that keeps the heavy inference local. You speak — or upload an audio file — and the system transcribes, understands, routes, and executes without leaving your machine (except for the Groq-hosted Whisper call at the STT stage).&lt;br&gt;
The stack is deliberately minimal yet production-quality:&lt;/p&gt;

&lt;p&gt;Whisper Large V3 via the Groq API for near-real-time, high-accuracy speech-to-text&lt;br&gt;
Llama 3 via Ollama as the local reasoning engine for intent classification and response generation&lt;br&gt;
Streamlit as the browser-based frontend with a premium glassmorphism UI&lt;br&gt;
Python tools layer for sandboxed file creation, code generation, summarisation, and general chat&lt;/p&gt;

&lt;p&gt;This article walks through every layer of the pipeline — how each component works, how they connect, and the design decisions behind the architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  System Architecture
&lt;/h2&gt;

&lt;p&gt;The agent follows a strictly linear pipeline: audio in → text out → intent out → tool execution → UI feedback. There are no shared mutable states between stages, which makes the system easy to reason about and straightforward to extend.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcxi2a0syigmhzry6tz3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzcxi2a0syigmhzry6tz3.png" alt=" " width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbcb676ijrxocu9yq9eqf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbcb676ijrxocu9yq9eqf.png" alt=" " width="800" height="382"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1 — Audio Input
&lt;/h2&gt;

&lt;p&gt;The frontend supports two input modes:&lt;br&gt;
Microphone recording — Streamlit's st.button triggers a Python call to sounddevice.rec(). The raw PCM buffer is collected at 16 kHz (mono), chosen because Whisper was trained at this sample rate, and saved as a temporary .wav file using scipy.io.wavfile.write.&lt;br&gt;
File upload — Streamlit's st.file_uploader accepts .wav, .mp3, .m4a, and other common audio containers. The bytes are written to a temp file and handled identically to a microphone recording from here on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2 — Speech-to-Text with Whisper Large V3
&lt;/h2&gt;

&lt;p&gt;Why Whisper Large V3?&lt;br&gt;
OpenAI's Whisper is an encoder-decoder transformer trained on 680,000 hours of multilingual, multitask supervised audio data. The Large V3 variant (1.55 billion parameters) achieves the lowest word-error-rate in the series and adds improved noise robustness and language identification compared to V2.&lt;br&gt;
Key improvements in V3 over V2:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Reduced hallucinations on silent or near-silent segments&lt;br&gt;
Better handling of code-switching (mixing languages mid-sentence)&lt;br&gt;
Improved punctuation placement, which matters for downstream NLP&lt;br&gt;
80-channel log-Mel spectrogram input (up from 80 uniform Mel filterbanks) for finer frequency resolution&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Groq API Integration&lt;br&gt;
Running Whisper Large V3 locally requires a GPU with at least 10 GB VRAM. To keep the local machine requirements low while retaining V3's accuracy, the project routes STT through the Groq API — a hardware-accelerated inference service that returns transcripts in under a second on typical voice clips.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The Human-in-the-Loop Checkpoint&lt;br&gt;
Before the transcript reaches Llama 3, Streamlit renders it in an editable st.text_area. This is a deliberate design choice: even at &amp;gt;95% WER accuracy, domain-specific jargon, proper nouns, and ambient noise can corrupt a word or two. Letting the user correct the transcript before execution prevents hallucination propagation — a transcription error fed into the LLM compounds into a wrong intent and a wrong action.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stage 3 — Intent Detection with Llama 3 (Local)
&lt;/h2&gt;

&lt;p&gt;Why Llama 3 via Ollama?&lt;br&gt;
Meta's Llama 3 (8B instruction-tuned variant) runs fully on-device via Ollama, which manages model download, quantisation (4-bit by default), and a local REST API that mirrors the OpenAI chat completions format.&lt;br&gt;
Choosing a local LLM for intent detection rather than another cloud API offers three advantages:&lt;br&gt;
1.Privacy — the transcript never leaves the machine after the Groq STT call&lt;br&gt;
2.Latency — no network round-trip; inference on a modern CPU takes 1–3 seconds&lt;br&gt;
3.Cost — zero per-token fees for high-frequency, short-context intent classification&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8z66a1m0sxgko8x6h0hy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8z66a1m0sxgko8x6h0hy.png" alt=" " width="800" height="381"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 4 — Tool Execution (tools.py)
&lt;/h2&gt;

&lt;p&gt;Once the intent and payload are extracted, a simple match / if-elif router in app.py calls the corresponding function from tools.py. All output is written to a sandboxed output/ directory, keeping generated files out of the project root.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpc1yduf333cdligz5d5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpc1yduf333cdligz5d5.png" alt=" " width="721" height="285"&gt;&lt;/a&gt;&lt;br&gt;
Each tool is a thin wrapper that formats a prompt for Llama 3, calls ollama.chat(), and returns the result string. The write_code tool additionally saves the code block to disk and returns the file path.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fys8xm1x4eu1drg0x07oj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fys8xm1x4eu1drg0x07oj.png" alt=" " width="800" height="396"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 5 — Streamlit UI
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Layout
&lt;/h2&gt;

&lt;p&gt;Streamlit was chosen because it eliminates the client-server boundary for rapid prototyping: the Python process is both the application logic and the web server. The UI uses custom CSS injected via st.markdown(..., unsafe_allow_html=True) to achieve the glassmorphism aesthetic described in the README.&lt;br&gt;
This gives users real-time visibility into where the pipeline is, which is important because the Llama 3 inference step can take 2–5 seconds on CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Data Flow — Step by Step
&lt;/h2&gt;

&lt;p&gt;Here is the complete data transformation at each stage for a sample utterance: "Write a Python function that reverses a string"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnnaiqqg8acqrpqzp8sqj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnnaiqqg8acqrpqzp8sqj.png" alt=" " width="800" height="501"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi7f9vl6lex077pc08kd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgi7f9vl6lex077pc08kd.png" alt=" " width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Design Decisions and Trade-offs
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Cloud STT vs. Fully Local STT
&lt;/h2&gt;

&lt;p&gt;Running Whisper Large V3 locally requires ≥10 GB GPU VRAM and adds significant startup latency. Routing STT through Groq's inference API offers V3-quality transcription at sub-second latency without the hardware requirement. The trade-off is a single cloud dependency per voice interaction — acceptable for most use cases, but replaceable with a local Whisper model for fully air-gapped deployments.&lt;/p&gt;

&lt;h2&gt;
  
  
  4-bit Quantised Llama 3 vs. Full Precision
&lt;/h2&gt;

&lt;p&gt;Ollama's default is 4-bit quantisation (Q4_K_M), which reduces the 8B model from ~16 GB to ~4.7 GB. At this compression level, intent classification and short code generation quality are effectively unchanged compared to full-precision inference. For longer code generation or complex reasoning, users can pull llama3:8b-instruct-fp16 at the cost of ~3× the memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sandboxed Output Directory
&lt;/h2&gt;

&lt;p&gt;All generated files land in output/, never in the project root. This prevents accidental overwrites of source files during code generation tasks and makes cleanup trivial. A future enhancement could mount output/ as a Docker volume.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stateless Tool Functions
&lt;/h2&gt;

&lt;p&gt;Each tool function in tools.py is pure in the sense that it takes a string and returns a string (plus a side-effect write to disk). This makes tools individually unit-testable and easy to extend — adding a new intent requires only: (a) adding the intent label to the system prompt, (b) writing a new function in tools.py, and (c) adding a case to the router.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project demonstrates that a fully functional, voice-controlled AI agent is buildable with open-source components and minimal infrastructure. The architecture is deliberately simple — a five-stage linear pipeline where each stage has a single responsibility and a clear input/output contract. Whisper Large V3 handles the perceptual hard part (speech recognition), Llama 3 handles the semantic hard part (understanding intent), and Streamlit handles the UX hard part (real-time feedback) without requiring a JavaScript build step.&lt;br&gt;
The human-in-the-loop checkpoint between STT and LLM is the system's most important reliability feature: it acknowledges that no transcription model is perfect and puts the user in control before any irreversible action is taken.&lt;br&gt;
The codebase is small enough to read in an afternoon, modular enough to extend in an hour, and principled enough to deploy with confidence.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
