Operation	Local (CPU)	API-based	Winner
Whisper transcription (5s audio)	45-60 seconds	1-2 seconds	API
GPT-4o-mini intent classification	N/A	0.8-1.5 seconds	API
End-to-end pipeline	~60 seconds	2-4 seconds	API

Voice-Controlled Local AI Agent

GURRALA SAI HANEESH · 2026-04-11T08:29:13Z

What I Built For the Mem0 Generative AI Developer Intern assignment, I built a voice-controlled local AI agent that accepts audio input, transcribes it, classifies the user's intent, and executes the appropriate tool — all displayed in a real-time Next.js UI. The agent supports four intents: creating files, writing code, summarizing text, and general chat. A single voice command can trigger multiple intents sequentially. Architecture Audio Input (mic/upload) → Next.js frontend (localhost:3000) → FastAPI backend (localhost:8000) → core/stt.py — Whisper transcription → core/intent_classifier.py — GPT-4o-mini structured output → core/dispatcher.py — tool routing + confirmation logic → tools/ — file, code, summarize, chat → core/memory.py — session history → JSON response → UI renders results Models I Chose and Why Speech-to-Text: OpenAI Whisper API (whisper-1) I initially attempted local Whisper inference. On my CPU-only Windows machine, transcribing a 5-second audio clip took 45-60 seconds — completely unusable for an interactive agent. The OpenAI Whisper API returns the same quality transcript in 1-2 seconds over the network. The tradeoff is worth it at this scale. Intent Classification: GPT-4o-mini with Structured Output I used client.beta.chat.completions.parse() with Pydantic models to get guaranteed JSON conforming to my schema. This eliminated all prompt engineering around output formatting — the model simply fills typed fields. Model Benchmarking Operation Local (CPU) API-based Winner Whisper transcription (5s audio) 45-60 seconds 1-2 seconds API GPT-4o-mini intent classification N/A 0.8-1.5 seconds API End-to-end pipeline ~60 seconds 2-4 seconds API For a machine with a CUDA GPU, local Whisper would be competitive. On CPU-only hardware, the API approach is the only viable path for real-time interaction. Bonus Features Implemented 1. Compound Commands A single audio input like "Summarize this text and save it to summary.txt" produces two intents: summarize_text followed by create_file . The dispatcher processes them sequentially and automatically injects the summary output as the file content. 2. Human-in-the-Loop Before any file or code write operation, the dispatcher returns a PENDING signal. The frontend shows an amber confirmation panel with the proposed action. Nothing is written until the user explicitly confirms. 3. Graceful Degradation Every pipeline stage handles failure independently — STT errors, low-confidence intents (routed to chat instead of executing), and tool-level exceptions all return structured error responses. The UI always renders a coherent message. 4. Session Memory The memory module maintains a rolling action log and the last 6 chat turns. The classifier receives this context on every call, allowing it to resolve references like "save that to a file" against prior session actions. Challenges OpenAI Structured Output Schema Rejection The biggest technical challenge was this error: "required" is required to be supplied and to be an array including every key in properties. Extra required key "parameters" supplied. OpenAI's structured output validator rejects dict[str, str] fields because it cannot generate a strict schema for arbitrary key-value maps. The fix was replacing the free-form dict with explicit flat fields ( filename , content , language , description , text , message ) in the Pydantic schema, then reconstructing the parameters dict after parsing. Tailwind CSS v4 Migration The project scaffolded with Tailwind v4 but Opus generated v3 syntax ( @tailwind base/components/utilities ). In v4, all three directives are replaced with a single @import "tailwindcss" and content scanning is automatic — no tailwind.config.ts needed. GitHub Repository GURRALASAIHANEESH / voice-agent Voice-Controlled Local AI Agent Project Overview A voice-controlled AI agent that converts spoken commands into executed actions such as file creation, code generation, and text summarization. Built as a submission for the Generative AI Developer Intern assignment at Mem0. The system accepts audio input through the frontend, classifies user intent via structured LLM output, and dispatches to the appropriate tool. Tech Stack STT: OpenAI Whisper API ( whisper-1 ) — chosen over local Whisper due to CPU-only hardware constraints; local inference produced unacceptable latency. Documented here per assignment instructions. LLM: OpenAI GPT-4o-mini with structured output Frontend: Next.js 14 (App Router) + TypeScript + Tailwind CSS v4 — localhost:3000 Backend: FastAPI — localhost:8000 Architecture Audio is captured from a microphone or uploaded file and sent to the OpenAI Whisper API for transcription. The transcript is passed to an OpenAI-backed intent classifier that returns one or more structured intents with parameters and… View on GitHub Video Demo

Project Overview

A voice-controlled AI agent that converts spoken commands into executed actions such as file creation, code generation, and text summarization. Built as a submission for the Generative AI Developer Intern assignment at Mem0. The system accepts audio input through the frontend, classifies user intent via structured LLM output, and dispatches to the appropriate tool.

Tech Stack

STT: OpenAI Whisper API (whisper-1) — chosen over local Whisper due to CPU-only hardware constraints; local inference produced unacceptable latency. Documented here per assignment instructions.
LLM: OpenAI GPT-4o-mini with structured output
Frontend: Next.js 14 (App Router) + TypeScript + Tailwind CSS v4 — localhost:3000
Backend: FastAPI — localhost:8000

Architecture

Audio is captured from a microphone or uploaded file and sent to the OpenAI Whisper API for transcription. The transcript is passed to an OpenAI-backed intent classifier that returns one or more structured intents with parameters and…