Abstract
This article presents the design and implementation of a voice-controlled AI agent that accepts spoken commands, transcribes them using OpenAI Whisper, classifies intent using a large language model, and executes file system operations autonomously. The system runs entirely on local hardware with optional cloud API fallbacks, making it suitable for offline-first development environments.
1. Introduction
Voice interfaces represent the next frontier in human-computer interaction. While commercial assistants like Siri and Alexa dominate consumer markets, developers lack accessible tools to build custom voice-controlled automation pipelines. This project bridges that gap by combining three mature technologies — automatic speech recognition, large language model reasoning, and a reactive web UI — into a single cohesive agent architecture.
The agent accepts audio from a microphone or file upload, converts it to text, understands the user's intent, and executes one of four actions: creating files, generating and writing code, summarizing text, or answering conversational queries. All file operations are sandboxed to a designated output directory, ensuring safety without sacrificing utility.
2. System Architecture
The system follows a linear pipeline architecture with four distinct stages:
Audio Input → STT → Intent Classification → Tool Execution → UI Rendering
Each stage is implemented as an independent Python module, enabling easy substitution of components. For example, the STT module can switch between local Whisper and the OpenAI Whisper API without affecting downstream components.
Component breakdown:
Module ----- Responsibility
stt.py ----- Audio transcription via Whisper
intent_classifier.py ----- LLM-based intent detection
tools.py ----- Tool execution
engineutils.py ----- Shared helpers and formatting
app.py ----- Streamlit UI and pipeline
3. Speech-to-Text with Whisper
OpenAI Whisper is a transformer-based automatic speech recognition model trained on 680,000 hours of multilingual audio. The base model used in this project offers a balance between accuracy and speed, requiring approximately 140MB of disk space and running on CPU in 3–8 seconds per utterance.
Key implementation decisions:
Temp file handling on Windows. The standard Python NamedTemporaryFile opens a file lock on Windows that prevents other processes from reading it while open. The fix is explicit closure before passing the path to Whisper:
pythontmp = tempfile.NamedTemporaryFile(delete=False, suffix=".wav")
tmp.write(audio_bytes)
tmp.flush()
tmp.close() # required on Windows
result = model.transcribe(tmp.name, fp16=False)
fp16 disabled on CPU:
Whisper's default configuration attempts half-precision floating point inference, which is only supported on CUDA GPUs. Passing fp16=False eliminates the warning and prevents potential inference errors on CPU-only machines.
ffmpeg dependency:
Whisper uses ffmpeg internally for audio decoding, supporting formats including WAV, MP3, M4A, OGG, and FLAC. The system implements an auto-discovery function that searches common Windows installation paths and injects the ffmpeg binary directory into the process PATH at runtime, providing graceful degradation when ffmpeg is installed but not yet on the system PATH.
4. Intent Classification
Intent classification transforms unstructured natural language into a structured action descriptor. The system supports three classification backends with automatic fallback:
4.1 OpenAI GPT-4o-mini
The primary classifier uses GPT-4o-mini with a carefully engineered system prompt that constrains output to a strict JSON schema:
json{
"intent": "write_code",
"confidence": 0.92,
"filename": "binary_search.py",
"language": "python",
"reasoning": "User requested code generation"
}
Temperature is set to 0.1 to maximize determinism in classification decisions.
4.2 Ollama (Local LLM)
For fully offline operation, the system integrates with Ollama's REST API. The implementation uses /api/generate rather than /api/chat for single-turn classification, setting temperature: 0.0 and num_predict: 150 to force short, deterministic JSON responses. When Ollama returns malformed JSON — common with smaller models — the system automatically falls back to keyword classification rather than failing hard.
4.3 Keyword Fallback
A rule-based classifier serves as the final fallback, matching against curated keyword lists for each intent category. While less accurate than LLM-based classification, it ensures the system remains functional with zero external dependencies.
5. Tool Execution Engine
The tool execution engine implements four capabilities behind a unified interface:
5.1 File Creation
Empty files are created in the sandboxed ./output/ directory. Path traversal attacks are prevented by resolving the absolute path and verifying it remains within the output directory boundary:
pythonfull_path = (OUTPUT_DIR / filename).resolve()
if not str(full_path).startswith(str(OUTPUT_DIR.resolve())):
raise ValueError("Unsafe path detected")
5.2 Code Generation
The code generation tool combines LLM prompting with a smart offline fallback. When an LLM is available, a system prompt instructs the model to return raw code without markdown fences. A post-processing step strips any fences the model includes despite the instruction.
The offline code generator recognizes common computer science topics and returns complete working implementations. Supported patterns include binary search, linear search, bubble/merge/quick sort, linked lists, stacks, queues, fibonacci sequences, factorial computation, and retry decorators — covering the most common voice command patterns observed during testing.
5.3 Text Summarization
The summarization tool extracts the text payload from the user's speech using regex pattern matching — isolating content after trigger phrases like "summarize this:" — and passes it to the LLM with a summarization-focused system prompt. An offline fallback returns the first three sentences of the input.
5.4 General Chat
Conversational queries are handled by a general-purpose chat function that maintains no conversation history, treating each query as stateless. This design choice simplifies the architecture at the cost of multi-turn conversational coherence, an acceptable tradeoff for a command-oriented agent.
6. User Interface
The Streamlit frontend implements a reactive single-page application pattern. The UI renders in three zones: a persistent sidebar for configuration, a two-column main area for input and results, and a session history panel below.
Render order is critical in Streamlit. The framework executes Python top-to-bottom on every interaction, re-rendering all components. A key architectural constraint emerged during development: confirmation dialogs and result panels must be positioned carefully relative to column declarations to ensure session state writes from button callbacks are visible when result panels render on the subsequent rerun.
The correct execution order is:
- Session state initialization
- Function definitions
- Sidebar rendering
- Toast notifications
- Confirmation dialog (full width, before columns)
- Column layout declaration
- col_left: input widgets
- col_right: results (reads from session state)
- Process button logic (sets session state, triggers rerun)
- History panel
The Spotify-inspired design uses a light green palette (#1db954 accent, #f6fdf6 background) with a contrasting dark forest green sidebar (#0f3d1f). Buttons use pill-shaped border-radius (999px) matching Spotify's design language. The DM Sans typeface provides clean readability at small sizes.
7. Error Handling Strategy
The system implements layered error handling at each pipeline stage:
STT errors provide actionable diagnostic messages including the active Python executable path, enabling users to install packages into the correct environment. Environment mismatch — where packages are installed in a different Python than Streamlit uses — was the most common deployment failure observed.
LLM errors fall back gracefully. Ollama timeouts trigger keyword classification. JSON parse failures from malformed LLM output attempt regex extraction before falling back. OpenAI authentication errors surface specific guidance rather than raw API error messages.
File system errors are caught at the tool level, returning structured error dicts rather than raising exceptions, allowing the UI to display meaningful error cards.
8. Security Considerations
Path traversal prevention. All file operations validate that the resolved absolute path begins with the output directory path, blocking inputs like ../../etc/passwd.
API key handling. Keys are entered through Streamlit's type="password" input, masking them in the UI. They exist only in memory during the session and are never logged or persisted.
Sandboxed execution. The agent creates and reads files but does not execute generated code, eliminating code injection risks from LLM-generated content.
9. Performance Characteristics
Operation --- Typical Duration
Whisper base (CPU, 5s audio) --- 3–8 seconds
GPT-4o-mini classification --- 1–2 seconds
GPT-4o-mini code generation --- 3–8 seconds
Ollama phi3 classification --- 5–15 seconds
Ollama phi3 code generation --- 15–60 seconds
Keyword fallback --- < 10ms
File creation --- < 50ms
The primary performance bottleneck is Whisper's CPU inference. Users with NVIDIA GPUs can enable CUDA acceleration by installing the CUDA-enabled PyTorch build, reducing transcription time to under one second.
10. Deployment
The system runs as a local Streamlit server on port 8501, accessible from the local network via the Network URL. No containerization or reverse proxy is required for development use.
For production deployment, the following additions are recommended:
Authentication — Streamlit Community Cloud supports OAuth; self-hosted deployments should add a login layer
Rate limiting — prevent API cost overruns from repeated requests
Persistent history — replace in-memory session history with SQLite
HTTPS — required before exposing to the public internet
11. Conclusion
This project demonstrates that a functional voice-controlled AI agent can be built from open-source components in under 600 lines of application code. The modular pipeline architecture — STT, classification, execution, UI — provides clear extension points. Developers can swap Whisper for a faster distilled model, replace GPT-4o-mini with a domain-fine-tuned classifier, or add new tools by extending the tools.py module and updating the intent classifier's system prompt.
The most significant engineering challenges encountered were Windows-specific: file locking in temp file handling, ffmpeg PATH propagation across terminal sessions, and Python environment fragmentation between system and virtual environments. Each was resolved through defensive coding patterns that check preconditions, provide actionable error messages, and degrade gracefully rather than crashing.
The result is a practical foundation for voice-driven developer tooling — a category with substantial room for innovation as local LLMs continue to improve in speed and capability.
References
Radford, A. et al. (2022). Robust Speech Recognition via Large-Scale Weak Supervision. OpenAI.
Streamlit Inc. (2024). Streamlit Documentation. streamlit.io
Ollama (2024). Run Large Language Models Locally. ollama.com
OpenAI (2024). Whisper API Reference. platform.openai.com
Project repository includes: app.py, stt.py, intent_classifier.py, tools.py, utils.py, requirements.txt, README.md
🔗 Links
GitHub Repository: https://github.com/Preethii19V/Voice-AI-Agent.git
Top comments (0)