DEV Community

Suraj Kaushik
Suraj Kaushik

Posted on

Building VoiceAgent: From Speech to Safe Action

Introduction

Voice interfaces feel natural to humans, but systems require structure, validation, and control.

VoiceAgent was built to bridge that gap — a system that takes voice input, understands intent, and executes actions safely.

This article focuses on the architecture, design choices, and challenges behind building the system.


System Architecture

The system follows a structured pipeline:

Voice → Text → Intent → Validation → Approval → Action

Each stage plays a critical role in ensuring both functionality and safety.


1. Speech-to-Text (Whisper)

For transcription, I used a local Whisper model.

Why Whisper?

  • High accuracy for speech recognition
  • Works offline (no API dependency)
  • No cost involved

Key Consideration

Handling audio input required:

  • Converting audio to float32 format
  • Normalizing amplitude
  • Resampling to 16 kHz for consistent input

2. Intent Detection (Groq + LLM)

Once text is generated, it is passed to a language model via Groq.

Why Groq?

  • Fast inference speed
  • Free tier available
  • Reliable for structured prompting

Approach

Instead of free-form output, I enforced structured JSON responses:

{
  "intent": "...",
  "params": {...},
  "reasoning": "..."
}
Enter fullscreen mode Exit fullscreen mode

This ensured:

  • Predictability
  • Easier parsing
  • Better control over execution

3. Validation Layer

Before executing any action, the system performs strict validation:

  • Filename sanitization
  • Allowed file extensions only
  • File size limits
  • Prevention of overwriting existing files

This layer ensures that the system remains safe and controlled.


4. Human-in-the-Loop

For file-related actions, execution is not automatic.

The system pauses and asks for user confirmation.

This prevents unintended or harmful actions and adds an extra safety layer.


5. Execution Engine

Once approved, the system executes the action:

  • File creation
  • Code writing
  • Text responses

All operations are restricted to a local output/ directory.


Challenges Faced

1. Audio Handling

Handling both microphone input and file uploads required a unified processing pipeline. Different formats and sampling rates had to be normalized.


2. Transcription Noise

Speech models can produce unexpected outputs when audio is unclear. This was addressed using normalization and controlled inference settings.


3. Safe Execution

Allowing an AI system to create files introduces risk. The solution was a combination of:

  • Validation
  • Restricted directories
  • User confirmation

4. Structured LLM Output

Ensuring consistent JSON output from the model required careful prompt design and fallback handling.


Key Design Decisions

  • Use local Whisper to avoid API costs and enable offline capability
  • Use Groq for fast and efficient inference
  • Enforce structured JSON output for reliability
  • Add human confirmation for safety
  • Restrict execution to a sandboxed directory

Conclusion

VoiceAgent is not just about converting speech to text.

It is about building a system that:

  • Understands
  • Validates
  • Executes

— all while keeping the user in control.

This project highlights that in AI systems, safety and structure are just as important as intelligence.


Links

GitHub: https://github.com/Suraj308/VoiceAgent
Demo Video: https://youtu.be/gGnH3v7BVdQ

Top comments (0)