<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: GURRALA SAI HANEESH</title>
    <description>The latest articles on DEV Community by GURRALA SAI HANEESH (@gurrala_saihaneesh_eb299).</description>
    <link>https://dev.to/gurrala_saihaneesh_eb299</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3873081%2F423a217d-150d-428d-9b66-689d55fd7a22.png</url>
      <title>DEV Community: GURRALA SAI HANEESH</title>
      <link>https://dev.to/gurrala_saihaneesh_eb299</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gurrala_saihaneesh_eb299"/>
    <language>en</language>
    <item>
      <title>Building a Voice-Controlled AI Agent with OpenAI Whisper, GPT-4o-mini, and Next.js</title>
      <dc:creator>GURRALA SAI HANEESH</dc:creator>
      <pubDate>Sat, 11 Apr 2026 08:29:13 +0000</pubDate>
      <link>https://dev.to/gurrala_saihaneesh_eb299/building-a-voice-controlled-ai-agent-with-openai-whisper-gpt-4o-mini-and-nextjs-4mh7</link>
      <guid>https://dev.to/gurrala_saihaneesh_eb299/building-a-voice-controlled-ai-agent-with-openai-whisper-gpt-4o-mini-and-nextjs-4mh7</guid>
      <description>&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;For the Mem0 Generative AI Developer Intern assignment, I built a voice-controlled &lt;br&gt;
local AI agent that accepts audio input, transcribes it, classifies the user's &lt;br&gt;
intent, and executes the appropriate tool — all displayed in a real-time Next.js UI.&lt;/p&gt;

&lt;p&gt;The agent supports four intents: creating files, writing code, summarizing text, &lt;br&gt;
and general chat. A single voice command can trigger multiple intents sequentially.&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Audio Input (mic/upload)&lt;br&gt;
→ Next.js frontend (localhost:3000)&lt;br&gt;
→ FastAPI backend (localhost:8000)&lt;br&gt;
→ core/stt.py — Whisper transcription&lt;br&gt;
→ core/intent_classifier.py — GPT-4o-mini structured output&lt;br&gt;
→ core/dispatcher.py — tool routing + confirmation logic&lt;br&gt;
→ tools/ — file, code, summarize, chat&lt;br&gt;
→ core/memory.py — session history&lt;br&gt;
→ JSON response → UI renders results&lt;/p&gt;
&lt;h2&gt;
  
  
  Models I Chose and Why
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-Text: OpenAI Whisper API (whisper-1)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I initially attempted local Whisper inference. On my CPU-only Windows machine, &lt;br&gt;
transcribing a 5-second audio clip took 45-60 seconds — completely unusable for &lt;br&gt;
an interactive agent. The OpenAI Whisper API returns the same quality transcript &lt;br&gt;
in 1-2 seconds over the network. The tradeoff is worth it at this scale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Intent Classification: GPT-4o-mini with Structured Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I used &lt;code&gt;client.beta.chat.completions.parse()&lt;/code&gt; with Pydantic models to get &lt;br&gt;
guaranteed JSON conforming to my schema. This eliminated all prompt engineering &lt;br&gt;
around output formatting — the model simply fills typed fields.&lt;/p&gt;
&lt;h2&gt;
  
  
  Model Benchmarking
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;Local (CPU)&lt;/th&gt;
&lt;th&gt;API-based&lt;/th&gt;
&lt;th&gt;Winner&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Whisper transcription (5s audio)&lt;/td&gt;
&lt;td&gt;45-60 seconds&lt;/td&gt;
&lt;td&gt;1-2 seconds&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-4o-mini intent classification&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;0.8-1.5 seconds&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end pipeline&lt;/td&gt;
&lt;td&gt;~60 seconds&lt;/td&gt;
&lt;td&gt;2-4 seconds&lt;/td&gt;
&lt;td&gt;API&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For a machine with a CUDA GPU, local Whisper would be competitive. On CPU-only &lt;br&gt;
hardware, the API approach is the only viable path for real-time interaction.&lt;/p&gt;
&lt;h2&gt;
  
  
  Bonus Features Implemented
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Compound Commands&lt;/strong&gt;&lt;br&gt;
A single audio input like "Summarize this text and save it to summary.txt" produces &lt;br&gt;
two intents: &lt;code&gt;summarize_text&lt;/code&gt; followed by &lt;code&gt;create_file&lt;/code&gt;. The dispatcher processes &lt;br&gt;
them sequentially and automatically injects the summary output as the file content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Human-in-the-Loop&lt;/strong&gt;&lt;br&gt;
Before any file or code write operation, the dispatcher returns a PENDING signal. &lt;br&gt;
The frontend shows an amber confirmation panel with the proposed action. Nothing &lt;br&gt;
is written until the user explicitly confirms.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Graceful Degradation&lt;/strong&gt;&lt;br&gt;
Every pipeline stage handles failure independently — STT errors, low-confidence &lt;br&gt;
intents (routed to chat instead of executing), and tool-level exceptions all return &lt;br&gt;
structured error responses. The UI always renders a coherent message.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Session Memory&lt;/strong&gt;&lt;br&gt;
The memory module maintains a rolling action log and the last 6 chat turns. &lt;br&gt;
The classifier receives this context on every call, allowing it to resolve &lt;br&gt;
references like "save that to a file" against prior session actions.&lt;/p&gt;
&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;OpenAI Structured Output Schema Rejection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The biggest technical challenge was this error:&lt;br&gt;
"required" is required to be supplied and to be an array including every key&lt;br&gt;
in properties. Extra required key "parameters" supplied.&lt;/p&gt;

&lt;p&gt;OpenAI's structured output validator rejects &lt;code&gt;dict[str, str]&lt;/code&gt; fields because &lt;br&gt;
it cannot generate a strict schema for arbitrary key-value maps. The fix was &lt;br&gt;
replacing the free-form dict with explicit flat fields (&lt;code&gt;filename&lt;/code&gt;, &lt;code&gt;content&lt;/code&gt;, &lt;br&gt;
&lt;code&gt;language&lt;/code&gt;, &lt;code&gt;description&lt;/code&gt;, &lt;code&gt;text&lt;/code&gt;, &lt;code&gt;message&lt;/code&gt;) in the Pydantic schema, then &lt;br&gt;
reconstructing the parameters dict after parsing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tailwind CSS v4 Migration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The project scaffolded with Tailwind v4 but Opus generated v3 syntax &lt;br&gt;
(&lt;code&gt;@tailwind base/components/utilities&lt;/code&gt;). In v4, all three directives are &lt;br&gt;
replaced with a single &lt;code&gt;@import "tailwindcss"&lt;/code&gt; and content scanning is &lt;br&gt;
automatic — no &lt;code&gt;tailwind.config.ts&lt;/code&gt; needed.&lt;/p&gt;
&lt;h2&gt;
  
  
  GitHub Repository
&lt;/h2&gt;


&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/GURRALASAIHANEESH" rel="noopener noreferrer"&gt;
        GURRALASAIHANEESH
      &lt;/a&gt; / &lt;a href="https://github.com/GURRALASAIHANEESH/voice-agent" rel="noopener noreferrer"&gt;
        voice-agent
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;Voice-Controlled Local AI Agent&lt;/h1&gt;
&lt;/div&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Project Overview&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;A voice-controlled AI agent that converts spoken commands into executed actions such as file creation, code generation, and text summarization. Built as a submission for the Generative AI Developer Intern assignment at Mem0. The system accepts audio input through the frontend, classifies user intent via structured LLM output, and dispatches to the appropriate tool.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Tech Stack&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;STT:&lt;/strong&gt; OpenAI Whisper API (&lt;code&gt;whisper-1&lt;/code&gt;) — chosen over local Whisper due to
CPU-only hardware constraints; local inference produced unacceptable latency.
Documented here per assignment instructions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM:&lt;/strong&gt; OpenAI GPT-4o-mini with structured output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Frontend:&lt;/strong&gt; Next.js 14 (App Router) + TypeScript + Tailwind CSS v4 — &lt;code&gt;localhost:3000&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backend:&lt;/strong&gt; FastAPI — &lt;code&gt;localhost:8000&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Architecture&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;Audio is captured from a microphone or uploaded file and sent to the OpenAI Whisper API for transcription. The transcript is passed to an OpenAI-backed intent classifier that returns one or more structured intents with parameters and…&lt;/p&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/GURRALASAIHANEESH/voice-agent" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;h2&gt;
  
  
  Video Demo
&lt;/h2&gt;

&lt;p&gt;  &lt;iframe src="https://www.youtube.com/embed/0Pzach_pAgM"&gt;
  &lt;/iframe&gt;
&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>nextjs</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
