Building a Local-First AI Agent: Coding with Your Voice
I built a local-first AI agent that turns spoken words into real-time actions on your machine—whether it's coding or general file management.
The Vision
Most coding tools today are cloud-dependent. I wanted something that:
- Respects privacy (data stays local)
- Has low latency
- Enables hands-free workflows
The goal was simple: a lightweight local system capable of handling tasks like saving files, deleting directories, or summarizing documents without sending data to external services.
The Tech Stack
Building this system required stitching together multiple components that initially didn’t integrate smoothly. To make them work cohesively, I:
- Dockerized services for isolation and reliability
- Used JSON as a standard communication format between components
Components
- Frontend: Streamlit
- Backend (STT): FastAPI + Faster-Whisper (Dockerized)
- AI Layer: Ollama (local models for intent detection and code generation)
- Action Layer: Custom Python functions for system operations
4-Layer Architecture
The system is divided into four layers to ensure separation of concerns and maintainability.
Frontend (Streamlit)
Handles mic recording, file uploads, and displays action logs.
STT Service (FastAPI + Whisper)
Runs in a dedicated Docker container and converts audio into text.
AI Layer (Ollama)
Processes text to detect intent and generate code or actions.
Action Layer
Executes safe file operations within a controlled output directory.
The Flow
Voice → Whisper STT → Text → Ollama → Intent → Action → File Output
Overcoming Challenges
1. The Streamlit "Hang"
Streamlit reruns the script on interaction. Initially, stopping a recording caused the UI to crash or feel stuck. I solved this by implementing a non-stopping recording logic using session state and hashing:
if mic_audio is not None:
recorded = mic_audio.getvalue()
if recorded:
recorded_digest = hashlib.sha1(recorded).hexdigest()
# Only process if the audio is new
if recorded_digest != st.session_state.mic_audio_digest:
st.session_state.mic_audio_bytes = recorded
st.session_state.mic_audio_ready = True
st.session_state.mic_audio_digest = recorded_digest
2. Service Reliability
To prevent the UI from hanging when a backend service was down, I added defensive health checks for all Dockerized components.
Lessons Learned
Building AI features isn't just about the model; it’s about reliability. My biggest improvements came from:
Strict API contracts.
Defensive programming.
Safe execution boundaries (sandboxing).
Future Plans
I’m looking into integrating Gemma 4 models for better task following and more complex conversation handling.
Explore the Code
You can check out the full source code and setup instructions here:
Voice Agent
Local-first voice assistant for coding and text workflows.
It combines:
- Streamlit UI for input, status, and results
- FastAPI + faster-whisper STT service in Docker
- Ollama for intent classification and generation
- Safe action executor that only writes inside output/
- Persistent memory and action history for continuity
Features
- Audio input from file upload and microphone recording (when supported by Streamlit)
- Typed command fallback when audio is unavailable
- Intent routing to file creation, code generation, summarization, chat, and compound multi-step actions
- Guardrails to prevent path traversal outside output/
- Persistent SQLite memory in output/memory.db
- Action audit log in output/action_log.jsonl
- Benchmark runner with JSONL result logging and dashboard snapshot
Architecture
Main components:
- app.py: Streamlit UI and orchestration flow
- stt.py: client for STT HTTP API
- stt_service/app.py: Whisper transcription API (FastAPI)
- intent.py: intent classification + LLM helpers
- tools/actions.py: safe action execution and logging
- memory_store.py: SQLite memory retrieval and storage
- benchmark.py: repeatable intent/STT benchmarking
Request flow:
- …
I would love to hear your feedback or suggestions for improvement!

Top comments (0)