DEV Community

Rupesh-Max-na-Ore
Rupesh-Max-na-Ore

Posted on

Voice Controlled Local AI Agent

#ai

Building a Voice-Controlled Local AI Agent That Actually Does Things

Most AI apps today stop at conversation. You ask something, and the system replies. Useful—but passive.

What if your AI could listen, understand, plan, and act?

In this project, I built a voice-controlled local AI agent that doesn’t just respond—it executes real tasks like:

  • summarizing text and saving it to files
  • generating runnable Python code
  • combining multiple instructions into a single workflow
  • explaining results interactively

This article breaks down how it works and the challenges behind making it reliable.


The Core Idea

Instead of treating language as output, we treat it as input for execution.

A spoken command like:

“Summarize this text and save it to a file, then explain it”

gets transformed into a sequence of actions:

  1. Generate summary
  2. Save it to disk
  3. Explain it in chat

So the system becomes less like a chatbot and more like a task executor driven by language.


System Architecture (Simplified)

The pipeline looks like this:

  1. Speech → Text
  2. Text → Intent(s)
  3. Intent(s) → Execution Plan
  4. Plan → Tool Execution
  5. Results → UI + Files

Each step is simple individually, but coordinating them is where things get interesting.


1. Speech-to-Text: Not as Simple as It Looks

Even random noise (fan sound, traffic, static) can produce some transcription.

That means the system will often get valid-looking but meaningless text.

Solution:

  • Add a confidence score
  • Block or warn on low-confidence input

This is critical for demos—it prevents the system from doing nonsense actions on noise.


2. Intent Detection: The Brain of the System

The system maps text into structured intents like:

  • summarize
  • write_code
  • create_file
  • chat

For example:

“Create a Python file and explain it”

becomes:

  • write_code
  • create_file
  • chat

Key Design Decision

I used a rule-based classifier with light LLM support, not a fully LLM-driven parser.

Why?

  • Predictability matters when executing actions
  • Pure LLM parsing caused inconsistent behavior
  • Rules handle core commands reliably

3. Execution Planning (Implicit but Powerful)

There’s no heavy planner yet. Instead, the system uses intent order as the plan.

So:

summarize → create_file → chat

naturally becomes:

  1. Generate summary
  2. Save summary
  3. Explain summary

Simple, but surprisingly effective.


4. Router: The Execution Engine

This is the core of the system.

It:

  • takes intents one by one
  • maintains a shared context
  • calls the appropriate tools
  • passes results forward

Example context:

  • summary
  • generated code
  • last saved file

This allows chaining like:

summarize → save → explain

without recomputing anything.


5. Tools: Modular and Clean

Each capability is implemented as a separate module:

  • summarizer
  • code generator
  • file operations
  • chat

This separation makes debugging much easier.


6. Code Generation: Trickier Than Expected

Generating code sounds easy—until you try to save and run it.

Problems encountered:

  • LLM adds ```python blocks
  • Adds explanations inside code
  • Produces incomplete scripts

Fixes:

  • strict prompts: “output only code”
  • post-processing to remove markdown
  • enforce executable structure

Now the system generates clean .py files that run directly.


7. File Saving: Small Bug, Big Impact

Initially, outputs like this were getting saved:


python
Saved code to output/generated.py
print("Hello World")


Enter fullscreen mode Exit fullscreen mode

Which breaks execution.

Solution:

  • strictly separate UI messages from file content
  • save only the raw artifact (code or summary)

8. Context Passing: The Key Insight

This is what makes multi-step tasks work.

Without shared context, the system would:

  • generate a summary
  • then forget it when saving
  • then fail to explain it

With context, everything flows correctly across steps.


Major Challenges

1. Intent Ambiguity

Natural language is messy.
“Explain it” — what is “it”?

Solution:

  • prioritize latest output (code > summary > raw text)

2. Multi-Step Coordination

Order matters a lot.

Wrong order = empty files or broken logic.


3. LLM Control

LLMs love adding extra text.

We had to:

  • constrain prompts heavily
  • clean outputs programmatically

4. Silent Failures

One broken step can break everything downstream.

So we added:

  • defensive checks
  • fallback handling
  • clear error messages

5. Noise Handling

Even garbage audio produces text.

So:

  • low confidence → no execution
  • show warning instead

What This System Really Is

The best way to think about it:

It’s a compiler for human language into actions

Instead of compiling code → machine instructions, we compile:

  • speech → intent
  • intent → actions
  • actions → real outputs

What’s Next

Some natural extensions:

  • better intent parsing using structured LLM output
  • persistent memory across sessions
  • graph-based planning instead of linear steps
  • more tools (APIs, browser, databases)

Final Thought

The real shift here is subtle but important:

We’re moving from:

AI that talks

to:

AI that acts

And once systems start acting reliably, they stop being assistants—and start becoming agents.

Top comments (0)