DEV Community

GURRALA SAI HANEESH
GURRALA SAI HANEESH

Posted on

Building a Voice-Controlled AI Agent with OpenAI Whisper, GPT-4o-mini, and Next.js

What I Built

For the Mem0 Generative AI Developer Intern assignment, I built a voice-controlled
local AI agent that accepts audio input, transcribes it, classifies the user's
intent, and executes the appropriate tool — all displayed in a real-time Next.js UI.

The agent supports four intents: creating files, writing code, summarizing text,
and general chat. A single voice command can trigger multiple intents sequentially.

Architecture

Audio Input (mic/upload)
→ Next.js frontend (localhost:3000)
→ FastAPI backend (localhost:8000)
→ core/stt.py — Whisper transcription
→ core/intent_classifier.py — GPT-4o-mini structured output
→ core/dispatcher.py — tool routing + confirmation logic
→ tools/ — file, code, summarize, chat
→ core/memory.py — session history
→ JSON response → UI renders results

Models I Chose and Why

Speech-to-Text: OpenAI Whisper API (whisper-1)

I initially attempted local Whisper inference. On my CPU-only Windows machine,
transcribing a 5-second audio clip took 45-60 seconds — completely unusable for
an interactive agent. The OpenAI Whisper API returns the same quality transcript
in 1-2 seconds over the network. The tradeoff is worth it at this scale.

Intent Classification: GPT-4o-mini with Structured Output

I used client.beta.chat.completions.parse() with Pydantic models to get
guaranteed JSON conforming to my schema. This eliminated all prompt engineering
around output formatting — the model simply fills typed fields.

Model Benchmarking

Operation Local (CPU) API-based Winner
Whisper transcription (5s audio) 45-60 seconds 1-2 seconds API
GPT-4o-mini intent classification N/A 0.8-1.5 seconds API
End-to-end pipeline ~60 seconds 2-4 seconds API

For a machine with a CUDA GPU, local Whisper would be competitive. On CPU-only
hardware, the API approach is the only viable path for real-time interaction.

Bonus Features Implemented

1. Compound Commands
A single audio input like "Summarize this text and save it to summary.txt" produces
two intents: summarize_text followed by create_file. The dispatcher processes
them sequentially and automatically injects the summary output as the file content.

2. Human-in-the-Loop
Before any file or code write operation, the dispatcher returns a PENDING signal.
The frontend shows an amber confirmation panel with the proposed action. Nothing
is written until the user explicitly confirms.

3. Graceful Degradation
Every pipeline stage handles failure independently — STT errors, low-confidence
intents (routed to chat instead of executing), and tool-level exceptions all return
structured error responses. The UI always renders a coherent message.

4. Session Memory
The memory module maintains a rolling action log and the last 6 chat turns.
The classifier receives this context on every call, allowing it to resolve
references like "save that to a file" against prior session actions.

Challenges

OpenAI Structured Output Schema Rejection

The biggest technical challenge was this error:
"required" is required to be supplied and to be an array including every key
in properties. Extra required key "parameters" supplied.

OpenAI's structured output validator rejects dict[str, str] fields because
it cannot generate a strict schema for arbitrary key-value maps. The fix was
replacing the free-form dict with explicit flat fields (filename, content,
language, description, text, message) in the Pydantic schema, then
reconstructing the parameters dict after parsing.

Tailwind CSS v4 Migration

The project scaffolded with Tailwind v4 but Opus generated v3 syntax
(@tailwind base/components/utilities). In v4, all three directives are
replaced with a single @import "tailwindcss" and content scanning is
automatic — no tailwind.config.ts needed.

GitHub Repository

Voice-Controlled Local AI Agent

Project Overview

A voice-controlled AI agent that converts spoken commands into executed actions such as file creation, code generation, and text summarization. Built as a submission for the Generative AI Developer Intern assignment at Mem0. The system accepts audio input through the frontend, classifies user intent via structured LLM output, and dispatches to the appropriate tool.

Tech Stack

  • STT: OpenAI Whisper API (whisper-1) — chosen over local Whisper due to CPU-only hardware constraints; local inference produced unacceptable latency. Documented here per assignment instructions.
  • LLM: OpenAI GPT-4o-mini with structured output
  • Frontend: Next.js 14 (App Router) + TypeScript + Tailwind CSS v4 — localhost:3000
  • Backend: FastAPI — localhost:8000

Architecture

Audio is captured from a microphone or uploaded file and sent to the OpenAI Whisper API for transcription. The transcript is passed to an OpenAI-backed intent classifier that returns one or more structured intents with parameters and…

Video Demo

Top comments (1)

Collapse
 
ali_muwwakkil_a776a21aa9c profile image
Ali Muwwakkil

In our experience working with enterprise teams, we found that the biggest challenge in building AI agents isn't the technical setup with tools like Whisper or GPT - it's ensuring the AI aligns with actual user workflows. Start by embedding these agents into daily routines with clear triggers and actions. This often requires more iteration than expected but leads to higher adoption and real-world utility. - Ali Muwwakkil (ali-muwwakkil on LinkedIn)