Suraj Kaushik

Posted on Apr 13

Building VoiceAgent: From Speech to Safe Action

#agents #ai #machinelearning #architecture

Introduction

Voice interfaces feel natural to humans, but systems require structure, validation, and control.

VoiceAgent was built to bridge that gap — a system that takes voice input, understands intent, and executes actions safely.

This article focuses on the architecture, design choices, and challenges behind building the system.

System Architecture

The system follows a structured pipeline:

Voice → Text → Intent → Validation → Approval → Action

Each stage plays a critical role in ensuring both functionality and safety.

1. Speech-to-Text (Whisper)

For transcription, I used a local Whisper model.

Why Whisper?

High accuracy for speech recognition
Works offline (no API dependency)
No cost involved

Key Consideration

Handling audio input required:

Converting audio to float32 format
Normalizing amplitude
Resampling to 16 kHz for consistent input

2. Intent Detection (Groq + LLM)

Once text is generated, it is passed to a language model via Groq.

Why Groq?

Fast inference speed
Free tier available
Reliable for structured prompting

Approach

Instead of free-form output, I enforced structured JSON responses:

{
  "intent": "...",
  "params": {...},
  "reasoning": "..."
}

This ensured:

Predictability
Easier parsing
Better control over execution

3. Validation Layer

Before executing any action, the system performs strict validation:

Filename sanitization
Allowed file extensions only
File size limits
Prevention of overwriting existing files

This layer ensures that the system remains safe and controlled.

4. Human-in-the-Loop

For file-related actions, execution is not automatic.

The system pauses and asks for user confirmation.

This prevents unintended or harmful actions and adds an extra safety layer.

5. Execution Engine

Once approved, the system executes the action:

File creation
Code writing
Text responses

All operations are restricted to a local output/ directory.

Challenges Faced

1. Audio Handling

Handling both microphone input and file uploads required a unified processing pipeline. Different formats and sampling rates had to be normalized.

2. Transcription Noise

Speech models can produce unexpected outputs when audio is unclear. This was addressed using normalization and controlled inference settings.

3. Safe Execution

Allowing an AI system to create files introduces risk. The solution was a combination of:

Validation
Restricted directories
User confirmation

4. Structured LLM Output

Ensuring consistent JSON output from the model required careful prompt design and fallback handling.

Key Design Decisions

Use local Whisper to avoid API costs and enable offline capability
Use Groq for fast and efficient inference
Enforce structured JSON output for reliability
Add human confirmation for safety
Restrict execution to a sandboxed directory

Conclusion

VoiceAgent is not just about converting speech to text.

It is about building a system that:

Understands
Validates
Executes

— all while keeping the user in control.

This project highlights that in AI systems, safety and structure are just as important as intelligence.

Links

GitHub: https://github.com/Suraj308/VoiceAgent
Demo Video: https://youtu.be/gGnH3v7BVdQ

DEV Community