Shagun Dubey

Posted on Apr 12

Creating an Offline AI Voice Agent Using Whisper and Ollama

#agents #ai #llm #nlp

Introduction

Artificial intelligence (AI) systems based on voice have become very popular among modern devices. In my current research, I created an AI Voice Agent that listens to users' speech, understands their intentions, and executes some smart operations, including writing code, summarizing data, and creating files.

One of the main goals of my current work was developing an AI Voice Agent operating offline and not relying on paid APIs.

System Architecture

Architecture consists of five stages:

Voice Input → Speech-to-Text → Intent Detection → Action Execution → Output

Voice Input

Voice input is done either from an existing recording or a file upload through a web interface that employs Streamlit framework.

Speech-to-Text (STT)

Inputted voice data is processed by the Whisper model that recognizes speech and translates it into text form. FFmpeg is used for audio format conversion.

Intent Detection

The intent detection mechanism uses a rule-based system. The same command can have several intents at the same time (e.g., “Create a file and generate some code”).

Action Execution

According to detected intentions, the system executes actions like:

Creation of files
Code generation in Python programming language
Summary and explanation of texts
Chatting

Local LLM Integration

To execute the tasks above, the system leverages Ollama software that runs a locally deployed large language model llama3. Thus, no API is necessary.

Output Layer

Output is presented to users via the UI interface, and if needed, is saved to files.

Technology Utilized

Python - Programming Language
Streamlit - UI Design
Whisper - Speech to Text Conversion
Ollama (llama3) - Local LLM for code creation and explanations
FFmpeg - Preprocessing of Audio Files

Key Functionalities

Voice Command (Recording/Uploading files)
Multiple Intent Detection & Execution
Offline AI functionality by leveraging a local LLM
Code Generation
Confirmation Method before performing File Operations
Session History

What Makes a Local LLM Better?

Instead of integrating with any third-party services or applications for running models, this project makes use of Ollama. There are several reasons why such an approach is better:

No API fee
Works offline
Enhanced data security
Better control

Challenges Encountered

Issues With Audio

Audio recordings needed to be properly formatted before running them through the transcription process because of some initial errors that occurred due to formatting mistakes.

Incompatible Dependencies and Versions

There were numerous problems with various dependencies like NumPy and pandas because of their incompatible versions.

Uploading Files via Git

The upload to GitHub was not working correctly due to the large files in the virtual environment, but this problem was successfully solved by setting up .gitignore correctly.

Connecting Local LLM

Switching to the integration with local LLMs meant that I had to set up Ollama and work with HTTP connections.

Learning Points

The importance of properly managing Python environments
Real-world audio pipeline management
Incorporation of local AI models within application programs
Modular and scalable design of AI systems

Conclusion

The current project highlights how to design and implement an end-to-end AI solution locally without any need to utilize paid APIs. Through the integration of speech recognition, intent identification, and a local LLM, the proposed system has proven to be efficient.

Project Repository

Github Project Repository: Click here to view project

DEV Community

Creating an Offline AI Voice Agent Using Whisper and Ollama

Top comments (0)