DEV Community

Abhigyan Pal
Abhigyan Pal

Posted on

Building a Voice-Controlled Local AI Agent

I built a voice-controlled AI agent that takes spoken input, converts it to text, classifies intent, executes local tools, and shows the full pipeline in a Gradio UI. The goal was to make the system practical, safe, and easy to debug end-to-end.

Architecture

The system follows a simple 4-stage pipeline:

  1. Input Layer (UI)

    Users can provide commands through microphone recording, audio upload, or text input (for quick testing).

  2. Speech-to-Text (STT)

    Audio is transcribed using AssemblyAI.

  3. Intent Understanding (LLM Router)

    Transcribed text is sent to a Groq-hosted Llama 3.3 70B model, which returns structured JSON intents such as:

    • create_file
    • write_code
    • summarize
    • general_chat
  4. Tool Execution Layer

    The app routes each detected intent to a tool:

    • file creation
    • code generation and save
    • summarization
    • chat response

The UI then displays:

  • transcribed text
  • detected intents
  • action(s) taken
  • final result

All file operations are sandboxed to an output/ directory for safety.

Models and Why I Chose Them

1) AssemblyAI for STT

I initially considered local models (Whisper/wav2vec), but for this machine and timeline, API-based STT was more reliable and faster to integrate.

Why AssemblyAI:

  • generous free tier
  • strong transcription quality
  • simple Python SDK
  • avoids local GPU dependency

2) Groq + Llama 3.3 70B for Intent + Generation

For intent classification and text/code generation, I used Groq’s hosted Llama model.

Why Groq:

  • fast inference latency
  • good structured-output behavior (JSON intent schema)
  • strong instruction following for routing + generation
  • straightforward integration in Python

Key Challenges I Faced

1) STT model configuration mismatches

A major challenge was AssemblyAI configuration compatibility:

  • speech_model was deprecated by API expectations
  • speech_models required specific values (universal-3-pro, universal-2)
  • enum values in SDK and accepted server values were not always intuitive

I resolved this by explicitly setting supported speech_models values.

2) Language drift (Hindi vs English output)

During testing, English speech was sometimes transcribed in Hindi script due to language auto-detection. This cascaded into Hindi LLM responses.

I fixed this by forcing English in STT config and aligning LLM prompts to respond in English.

3) Intent ambiguity in compound commands

User prompts like “create a file and write your capabilities” can be interpreted as:

  • create file + generate code
  • create file + write plain text
  • chat + file write

Because the current intent set does not include a dedicated write_text intent, the model sometimes chose write_code, producing code when plain text was expected. This highlighted an important product gap: intent taxonomies must match real user phrasing.

4) Safety vs usability

I needed to enable local file actions while minimizing risk. Restricting writes to output/ and adding a confirmation toggle balanced safety with usability.

What Worked Well

  • Clear modular design (stt.py, intent.py, tools/*, app.py)
  • Human-in-the-loop confirmation before file operations
  • Compound-intent execution support
  • Persistent memory support and reset controls
  • Benchmark script to evaluate intent accuracy and generation latency

Conclusion

This project showed that a practical voice agent is less about one “smart model” and more about pipeline reliability: robust STT config, strict intent schema, safe tool boundaries, and transparent UI feedback. The next meaningful improvement would be adding a write_text intent and richer compound-intent planning so user requests map more naturally to expected outcomes.

Top comments (0)