Posted on Apr 13

I built a fully offline voice-controlled AI agent using Whisper and qwen2.5

#ai #productivity #opensource #python

What I built

Chronos is a voice-controlled local AI agent that listens to your voice, understands what you want, and actually does it — on your machine, with no internet, no API bills, and no data leaving your system.
You speak a command like "write a Python file called calculator with add and subtract functions" — and within seconds, the file exists on your disk with working code inside it.
It supports six actions: creating files, writing code, summarizing text, running Python scripts, launching files, and general conversation. Everything runs locally on an RTX 4050 laptop GPU.

The Architecture

Mic / Audio File
↓
Whisper Medium (GPU) — Speech to Text
↓
qwen2.5-coder:7b (GPU via Ollama)(other models can also be used) — Intent Detection
↓
Python Tool Executor — Actual Action
↓
Streamlit UI — Live Results
The pipeline is intentionally linear. Each stage completes before the next starts. This matters for VRAM management, which I'll explain below.

Why local models?

Four reasons:
Privacy — your voice commands, your code, your files never leave your machine. No cloud provider sees any of it.
Speed — no network latency. The bottleneck is local inference, not round trips to a remote API.
Cost — zero API bills. Run it a thousand times a day, costs nothing.
Control — you choose the model, the hardware, the behavior. No rate limits, no deprecations, no terms of service changes breaking your workflow overnight.

The interesting problem — VRAM management

This was the first real engineering constraint I hit.
My RTX 4050 has 6GB VRAM. Whisper medium needs roughly 3GB to load. qwen2.5-coder:7b quantized needs around 4.5GB. Together that's 7.5GB — more than the card has.
The solution was sequential execution. Whisper runs first, transcribes the audio, then releases GPU memory. qwen2.5 loads after Whisper is done. Since they never run simultaneously, 6GB is sufficient for both.
This is why the pipeline is strictly linear — it's not just clean architecture, it's a hardware constraint turned into a design decision.

Prompt engineering for intent detection

The hardest part wasn't the models — it was making qwen reliably return structured JSON instead of conversational text. The key was:

Explicit system prompt saying ONLY return JSON
Multiple examples showing the exact format
Fallback to general_chat when JSON parsing fails

My first attempts returned things like:
"Sure! Based on what you said, I think you want to create a file. Here's what I'd suggest..."
Completely unusable for programmatic parsing. I needed exactly this:
{"intent": "create_file", "filename": "notes.txt", "content": "shopping list", "description": null}

Three things fixed it:

Explicit system prompt — I told the model in the first line: "Respond ONLY with a JSON object. No explanation. No markdown. No extra text." Repetition matters. Models respond to emphasis.
Few-shot examples — I included four complete examples in the system prompt showing exact input/output pairs for each intent. The model learned the pattern from examples, not just instructions.
Fallback handling — even with a tight prompt, qwen occasionally returns malformed JSON. I wrapped the parser in a try/except that falls back to general_chat instead of crashing. The app never breaks — it just has a conversation instead. Like this: try: result = json.loads(raw) except json.JSONDecodeError: return { "intent": "general_chat", "content": transcribed_text } Graceful degradation is not optional in production AI systems.

Tech Stack

Challenges I Actually Hit

VRAM conflict — solved with sequential execution as described above.
qwen returning markdown in code — when asked to generate code, qwen wrapped responses in triple backticks. The generated file literally had _

python_ at the top. Fixed by stripping any lines starting with _

_ from the response before writing to disk.
Streamlit widget rendering — Streamlit re-renders the entire script on every interaction. If you put a widget inside a custom HTML div using st.markdown, Streamlit renders the widget outside the div entirely.
Learned this the hard way — all widgets must be direct children of Streamlit columns, never inside HTML wrappers.
Mic recorder iframe — the audio-recorder-streamlit component renders inside an iframe. CSS from the parent page cannot reach inside iframes due to browser security. The black background of the mic recorder couldn't be overridden with external CSS. Worked around it by wrapping the component in a white div and accepting the limitation.

Model Benchmarking
I tested three models for intent detection:

llama3:8b — failed immediately. Ollama returned an explicit error: llama3:latest does not support tools. Not viable for structured output.
mistral:7b — interesting split performance. For text tasks — summarization, explanation, conversational answers — mistral is genuinely impressive. It writes fluent, well-structured responses with good reasoning. Ask it to explain machine learning and you get a clear, readable answer. However for structured output like JSON intent detection, it was inconsistent. It occasionally mixed conversational text into the JSON response, breaking the parser roughly 20% of the time. Great model, wrong job.
qwen2.5-coder:7b — most reliable by a significant margin and dominant for anything code related. Its code-focused training makes it naturally better at treating output as structured data rather than conversation. JSON format compliance was near perfect with the right system prompt — code generation was clean, syntactically correct, and well structured. Inference speed was also the fastest of the three on my hardware. For any task where the output must be machine-parseable, a code-trained model outperforms general instruction-tuned models significantly.

What I'd Improve With More Time

Compound commands — handling multiple intents in one audio input. "Summarize this text and save it to summary.txt" currently only handles one action. Parsing compound commands requires chaining tool calls.
Persistent session memory — current memory resets on app restart. A local SQLite database would maintain conversation history across sessions, making the agent genuinely stateful.
Streaming output — currently the full response appears at once after processing. Streaming tokens from qwen to the UI would feel much more responsive.
Model switching at runtime — a dropdown in the UI to switch between qwen2.5, mistral, and llama3 without restarting the app. The architecture supports it, just needs a UI control.

Try It Yourself
GitHub: https://github.com/ParzivalZ73/Chronos
Demo: https://youtu.be/UmMykbFlc6k
Requires Ollama, Python 3.10+, and an NVIDIA GPU with at least 5GB VRAM.

DEV Community

I built a fully offline voice-controlled AI agent using Whisper and qwen2.5

Top comments (0)