DEV Community

Shagun Dubey
Shagun Dubey

Posted on

Creating an Offline AI Voice Agent Using Whisper and Ollama

Introduction

Artificial intelligence (AI) systems based on voice have become very popular among modern devices. In my current research, I created an AI Voice Agent that listens to users' speech, understands their intentions, and executes some smart operations, including writing code, summarizing data, and creating files.

One of the main goals of my current work was developing an AI Voice Agent operating offline and not relying on paid APIs.

System Architecture

Architecture consists of five stages:

Voice Input → Speech-to-Text → Intent Detection → Action Execution → Output

Voice Input

Voice input is done either from an existing recording or a file upload through a web interface that employs Streamlit framework.

Speech-to-Text (STT)

Inputted voice data is processed by the Whisper model that recognizes speech and translates it into text form. FFmpeg is used for audio format conversion.

Intent Detection

The intent detection mechanism uses a rule-based system. The same command can have several intents at the same time (e.g., “Create a file and generate some code”).

Action Execution

According to detected intentions, the system executes actions like:

  • Creation of files
  • Code generation in Python programming language
  • Summary and explanation of texts
  • Chatting

Local LLM Integration

To execute the tasks above, the system leverages Ollama software that runs a locally deployed large language model llama3. Thus, no API is necessary.

Output Layer

Output is presented to users via the UI interface, and if needed, is saved to files.

Technology Utilized

  • Python - Programming Language
  • Streamlit - UI Design
  • Whisper - Speech to Text Conversion
  • Ollama (llama3) - Local LLM for code creation and explanations
  • FFmpeg - Preprocessing of Audio Files

Key Functionalities

  • Voice Command (Recording/Uploading files)
  • Multiple Intent Detection & Execution
  • Offline AI functionality by leveraging a local LLM
  • Code Generation
  • Confirmation Method before performing File Operations
  • Session History

What Makes a Local LLM Better?

Instead of integrating with any third-party services or applications for running models, this project makes use of Ollama. There are several reasons why such an approach is better:

  • No API fee
  • Works offline
  • Enhanced data security
  • Better control

Challenges Encountered

  • Issues With Audio

Audio recordings needed to be properly formatted before running them through the transcription process because of some initial errors that occurred due to formatting mistakes.

  • Incompatible Dependencies and Versions

There were numerous problems with various dependencies like NumPy and pandas because of their incompatible versions.

  • Uploading Files via Git

The upload to GitHub was not working correctly due to the large files in the virtual environment, but this problem was successfully solved by setting up .gitignore correctly.

  • Connecting Local LLM

Switching to the integration with local LLMs meant that I had to set up Ollama and work with HTTP connections.

Learning Points

  • The importance of properly managing Python environments
  • Real-world audio pipeline management
  • Incorporation of local AI models within application programs
  • Modular and scalable design of AI systems

Conclusion

The current project highlights how to design and implement an end-to-end AI solution locally without any need to utilize paid APIs. Through the integration of speech recognition, intent identification, and a local LLM, the proposed system has proven to be efficient.

Project Repository

Github Project Repository: Click here to view project

Top comments (0)