Voice Agent Project

Aaditya Kapruwan — Sat, 11 Apr 2026 23:09:56 +0000

Building a Local-First AI Agent: Coding with Your Voice

I built a local-first AI agent that turns spoken words into real-time actions on your machine—whether it's coding or general file management.

The Vision

Most coding tools today are cloud-dependent. I wanted something that:

Respects privacy (data stays local)
Has low latency
Enables hands-free workflows

The goal was simple: a lightweight local system capable of handling tasks like saving files, deleting directories, or summarizing documents without sending data to external services.

The Tech Stack

Building this system required stitching together multiple components that initially didn’t integrate smoothly. To make them work cohesively, I:

Dockerized services for isolation and reliability
Used JSON as a standard communication format between components

Components

Frontend: Streamlit
Backend (STT): FastAPI + Faster-Whisper (Dockerized)
AI Layer: Ollama (local models for intent detection and code generation)
Action Layer: Custom Python functions for system operations

4-Layer Architecture

The system is divided into four layers to ensure separation of concerns and maintainability.

Frontend (Streamlit)

Handles mic recording, file uploads, and displays action logs.

STT Service (FastAPI + Whisper)

Runs in a dedicated Docker container and converts audio into text.

AI Layer (Ollama)

Processes text to detect intent and generate code or actions.

Action Layer

Executes safe file operations within a controlled output directory.

The Flow

Voice → Whisper STT → Text → Ollama → Intent → Action → File Output

Overcoming Challenges

1. The Streamlit "Hang"

Streamlit reruns the script on interaction. Initially, stopping a recording caused the UI to crash or feel stuck. I solved this by implementing a non-stopping recording logic using session state and hashing:
if mic_audio is not None: recorded = mic_audio.getvalue() if recorded: recorded_digest = hashlib.sha1(recorded).hexdigest() # Only process if the audio is new if recorded_digest != st.session_state.mic_audio_digest: st.session_state.mic_audio_bytes = recorded st.session_state.mic_audio_ready = True st.session_state.mic_audio_digest = recorded_digest

2. Service Reliability

To prevent the UI from hanging when a backend service was down, I added defensive health checks for all Dockerized components.

Lessons Learned

Building AI features isn't just about the model; it’s about reliability. My biggest improvements came from:

Strict API contracts.
Defensive programming.
Safe execution boundaries (sandboxing).
Future Plans
I’m looking into integrating Gemma 4 models for better task following and more complex conversation handling.

Explore the Code
You can check out the full source code and setup instructions here:

fister12 / voice-agent

Voice Agent

Local-first voice assistant for coding and text workflows.

It combines:

Streamlit UI for input, status, and results
FastAPI + faster-whisper STT service in Docker
Ollama for intent classification and generation
Safe action executor that only writes inside output/
Persistent memory and action history for continuity

Features

Audio input from file upload and microphone recording (when supported by Streamlit)
Typed command fallback when audio is unavailable
Intent routing to file creation, code generation, summarization, chat, and compound multi-step actions
Guardrails to prevent path traversal outside output/
Persistent SQLite memory in output/memory.db
Action audit log in output/action_log.jsonl
Benchmark runner with JSONL result logging and dashboard snapshot

Architecture

Main components:

app.py: Streamlit UI and orchestration flow
stt.py: client for STT HTTP API
stt_service/app.py: Whisper transcription API (FastAPI)
intent.py: intent classification + LLM helpers
tools/actions.py: safe action execution and logging
memory_store.py: SQLite memory retrieval and storage
benchmark.py: repeatable intent/STT benchmarking

Request flow:

View on GitHub

I would love to hear your feedback or suggestions for improvement!

DEV Community: Aaditya Kapruwan