DEV Community

Cover image for Building a Context-Aware Local Voice Agent without the Bloat
Monika Pralayakaveri
Monika Pralayakaveri

Posted on

Building a Context-Aware Local Voice Agent without the Bloat

🔗 Project Links & Resources

The flow of this post follows:

  1. Introduction
  2. Architecture: The Lean Three-Layer Design
  3. Model Selection
  4. The Speech-to-Text (STT) and Text-to-Speech (TTS) Flow
  5. Bypassing LangChain for JSON-Array Routing
  6. The Safety Sandbox (Directory Isolation)
  7. Challenges faced and resolved
  8. Conclusion

  9. Solving Hallucination with Context Injection

  10. Security & Stability: Human-in-the-Loop Logic

1. Introduction:

Building AI agents has become heavily tied to massive orchestration libraries like LangChain, LlamaIndex, AutoGen, or Crew AI.

While these frameworks are incredible for enterprise scalability, they often introduce unnecessary latency, complex abstractions, and steep learning curves for simpler tasks.

For this project(Assignment), the goal was to build a local, voice-controlled AI Assistant capable of manipulating the local file system(creating, reading, rewriting, and deleting files) based on spoken intents like:
○ Create a file.
○ Write code to a new or existing file.
○ Summarize text.
○ General Chat.
Instead of wrapping it in eternal libraries, we stripped it down to raw LLM reasoning and built it from scratch using plain Python. Here is a breakdown of the architecture, the models deployed, and the unique challenges faced.

2. Architecture: The Lean Three-Layer Design

The system operates linearly over three distinct layers:

Architecture Flow

1. The Interface Layer:

A Streamlit Web UI handling chat logic, microphone Web Audio captures, and file uploads.

2. The LLM Intent Router (agent.py):

A pure Python execution loop that takes a strict JSON schema prompt and categorizes user requests into mapped intent lists.

3. The Tool Executors (tools.py & audio.py):

Secured Python functions holding specific logic isolating operations to a safe .output/ sandbox.

3. Model Selection

I relied entirely on the Groq LPU Engine for this deployment:

Reasoning / Routing: llama-3.3-70b-versatile

I needed a model that would absolutely never break JSON structure formatting. A 70-Billion parameter model guarantees flawless adherence to syntax handling, even when navigating compound tool queries. Using an 8B model occasionally resulted in hallucinated parameters.

Transcription: whisper-large-v3-turbo

Extremely resilient to different mic hardware setups and exceptionally fast.

4. The Speech-to-Text (STT) and Text-to-Speech (TTS) Flow

  • Used Streamlit’s native st.audio_input to capture .wav buffers directly from the user >>> than using complex WebRTC WebSocket streams like LiveKit.

STT:

These bytes go through Groq’s hosted whisper-large-v3-turbo model.
The result is near-instantaneous transcription without consuming heavy local GPU resources.

TTS:

Once LLM generates the response or action confirmation, we pipe the output string through gTTS(Google Text-To-Speech) and dynamically update a Streamlit st.audio component in the UI to read the status back to the user seamlessly.

5. Bypassing LangChain for JSON-Array Routing

Most of the AI developers tend to choose LangChain's@tooldecorator

However, for a set of 7 localized tools, we found that simply prompting a highly capable model to return an Array of JSON Action Objects was astronomically faster.

Because the system allows "Compound Commands" (e.g. "Create a file called index.html and then summarize this text"), we instructed the LLM to return an array:

"actions": [
    {"intent": "create_file", "filename": "index.html"},
    {"intent": "summarize_text", "text_to_summarize": "..."} // Sequential independent execution
]
Enter fullscreen mode Exit fullscreen mode

By isolating the execution loop in native Python if/elif statements, we achieved flawless multi-level orchestration without the latency of an external framework.

6. The Safety Sandbox (Directory Isolation)

I used Directory Isolation to prevent accidental deletions or modifications of my core project files.

Giving an AI agent access to my terminal is like giving a toddler a chainsaw—it’s dangerous. To prevent the agent from accidentally deleting my system32 or messing with my source code, I built a Restricted Execution Environment.

7. Challenges faced and resolved

Challenge 1: The Hallucination of File Extensions (Fuzzy Context)

The Problem:

When a user says "Delete the dummy file," LLMs often hallucinate extensions, confidently guessing dummy.py even if the file is just named dummy.

The Solution:

I implemented Dynamic Context Injection, where the backend silently crawls the /output folder and injects the actual file list into the hidden System Prompt. This gives the agent "omnipresence," forcing the LLM to string-match the user's verbal slang directly to real files on disk.

Challenge 2: Human in the Loop (Destructive Interception)

The Problem:

Granting an AI direct permission to rename or delete files is inherently reckless and dangerous for local file systems.

The Solution:

I decoupled the intent engine from immediate execution by intercepting "dangerous" actions (like delete_file) and returning a pending_action payload. This halts the UI from displaying the raw JSON plan and requires the user to click ✅ Approve or ❌ Reject, ensuring absolute transparency before any changes touch the disk.

8. Conclusion

By refusing to rely on bloated frameworks for internal routing, this architecture proved that pure context-injection and strict JSON prompts are wildly sufficient for handling localized, secure, multimodal tasks effortlessly!

Top comments (0)