Youdiowei Eteimorde

for Neurl Creators

Posted on Apr 17 • Originally published at neurlcreators.substack.com

Building A Voice AI Agent with OpenClaw and AssemblyAI

#ai #python #agents #llm

OpenClaw went viral this year because of its simplicity in allowing users to communicate with AI agents running on their own hardware. It wasn’t a new model, a new architecture, or a new agentic protocol. Instead, it demonstrated a new way people could work with AI agents using technology they were already familiar with.

OpenClaw allows users to communicate with their agents through chat apps such as Telegram and WhatsApp. A user can simply pick up their device and send a text message to the AI agent. This ease of use is what attracted many users to OpenClaw.

But can this interaction be simplified even further? Yes, it can. One way to do this is by turning your OpenClaw agent into a voice AI agent. Since OpenClaw already works through chat apps, and most of these apps support voice notes, voice interaction becomes a natural extension of the system.

In this article, we will show you how to set up OpenClaw as a voice AI agent. We will also demonstrate how to bring your own speech-to-text model and integrate it with OpenClaw. The model we will use is Universal-3 Pro, and we will explore how its prompting capabilities can be used to create a more customized voice interaction experience.

OpenClaw in a nutshell

There is a lot of confusion surrounding OpenClaw. Even the name itself creates confusion. So in this section, we will do a quick breakdown of what OpenClaw is and how it works.

What is OpenClaw?

You can think of OpenClaw as a gateway between your chat app and your AI agent.

The chat app can be Telegram, WhatsApp, or Slack. The AI agent can be powered by cloud-based Large Language Models (LLMs) such as those provided by Anthropic or OpenAI, or even by a locally hosted model.

The AI agent also has access to a computer system. This could be your personal computer (though this is not advisable), a Mac Mini, a Raspberry Pi, or a cloud server.

The OpenClaw setup consists of the following:

The chat application serves as the user interface
OpenClaw acts as the orchestrator
The AI agent is the brain
The computer serves as the universal tool.

What makes openclaw different?

Agents are not a new concept. Agents that have access to a computer are not new either, and chatting with AI is certainly not new.

However, two things make OpenClaw stand out.

The first is the medium of communication. Unlike many chatbots that require you to use a separate app or a dedicated website, OpenClaw allows you to communicate with your agent through the chat apps you already use.

The second difference is that the OpenClaw agent is more proactive. It is not just another chat session. The agent can maintain memory, send reminders about tasks it is working on, and interact with the computer it has access to.

Since the agent has access to the system, it can perform actions such as reading files, editing files, and running commands. In many ways, OpenClaw feels like giving a personal computer to an AI assistant.

Setting up openclaw

When it comes to installing openclaw they are several different options you can choose. The easiest option is to install it on your personal computer but this is not advisable since an AI agent will have full control over your computer and a lot security experts have warned several security vulnerabilities on openclaw.

But running on your personal computer is the fastest way to experiment with it. The other option is have a dedicated computer for openclaw such as mac mini or a raspberry pi. Another option is to run openclaw within container using docker that way it is sandboxed.

If you’re on Mac or Linux, you can install OpenClaw with this one-liner:

curl -fsSL https://openclaw.ai/install.sh | bash

Then set it up by running:

openclaw onboard --install-daemon

This will prompt you to configure your model. The --install-daemon flag sets up OpenClaw as a background service, so it runs automatically whenever your device starts.

Once setup is complete, you can confirm everything is running with:

openclaw gateway status

For other installation methods, refer to the official OpenClaw installation guide.

Setting up a channel for communication

When OpenClaw is installed, the next thing you need to do is set up the channel of communication. This is essentially the chat app you wish to use to communicate with OpenClaw from.

All channels support communication via text, but since we are working with a voice agent, we want one that can support other media types, such as audio. Telegram is perfect for this because it offers the easiest setup compared to other channels, and you can send voice notes to the openclaw via Telegram.

Go through the setup guide on Telegram on openclaw documentation.

OpenClaw’s media understanding capabilities

OpenClaw’s media understanding capabilities allow it to process more than just text. When it receives a media file, such as an image or audio, it can use one of its model providers to transform it into a format the agent can understand.

For example, if OpenClaw receives a voice note from a channel like Telegram, it will use a speech-to-text (STT) model to convert the audio into text before passing it to the LLM. Similarly, if it receives an image, it can summarize the content using an image model and send that information to the agent. In this article, we are focusing on the audio understanding aspect of OpenClaw.

By default, OpenClaw supports a limited set of STT providers, including OpenAI, Mixtral Voxtral, and Deepgram. In this guide, we’ll go a step further by integrating a custom STT model, extending OpenClaw beyond its built-in options.

Bring your own Speech To Text model

There are several ways to extend OpenClaw’s capabilities, one of which is via plugins. While writing a plugin to perform media understanding is possible, it is often overkill. OpenClaw already provides a built-in way to extend media understanding using a custom script.

With a custom script, you simply tell OpenClaw that whenever it receives an audio file, it should run the script. The script processes the audio and returns the transcribed text. All the heavy lifting is handled by OpenClaw. You just need to write the script and configure the openclaw.json.

Since we get to write the script, we can choose any STT model provider. In this guide, we will use AssemblyAI.

Step 1: Set Up Your Environment

It is best to create a dedicated Python environment first. Then, install the AssemblyAI SDK:

pip install assemblyai

Next, create an AssemblyAI API key and store it in an environment variable:

export ASSEMBLYAI_API_KEY="your_api_key_here"

For global access, it is recommended to add this line to your .bashrc or .zshrcfile.

Step 2: Create the Transcription Script

Create a Python file called main.py and add the following:

import argparse
import os
import sys
import assemblyai as aai


def main():
    # 1. Set up the argument parser
    parser = argparse.ArgumentParser(
        description="Transcribe an audio file using AssemblyAI."
    )

    # Add positional argument for the audio file path
    parser.add_argument(
        "audio_file", 
        type=str, 
        help="Path to the audio file you want to transcribe (e.g., ./voice_note.ogg)"
    )

    # Add optional argument for the API key
    parser.add_argument(
        "--api-key", 
        type=str, 
        help="Your AssemblyAI API key (can also be set via ASSEMBLYAI_API_KEY env variable)",
        default=None
    )

    args = parser.parse_args()

    # 2. Configure API Key
    api_key = args.api_key or os.environ.get("ASSEMBLYAI_API_KEY")
    if not api_key:
        print("Error: API key is missing.")
        print("Please set the ASSEMBLYAI_API_KEY environment variable or pass it via --api-key.")
        sys.exit(1)

    aai.settings.api_key = api_key

    # 3. Configure and run the transcription
    config = aai.TranscriptionConfig(speech_models=["universal-3-pro"], 
                                     language_detection=True, 
                                     prompt="Transcribe the audio make sure include fillers and stutters in the transcript.")

    print(f"Transcribing '{args.audio_file}'... Please wait.")

    try:
        transcript = aai.Transcriber(config=config).transcribe(args.audio_file)

        if transcript.status == "error":
            raise RuntimeError(f"Transcription failed: {transcript.error}")

        print("\n--- Transcript ---")
        print(transcript.text)
        print("------------------\n")


    except Exception as e:
        print(f"\nAn error occurred: {e}")
        sys.exit(1)

if __name__ == "__main__":
    main()

This script takes the path to an audio file, transcribes it using AssemblyAI, and prints the result.

Step 3: Configure OpenClaw

Next, integrate the script with OpenClaw by editing openclaw.json:

"tools":{
    "media": {
      "audio": {
        "enabled": true,
        "models": [
          {
            "type": "cli",
            "command": "python",
            "args": ["/PATH/TO/SCRIPT/main.py", "{{MediaPath}}"]
          }
        ]
      }
    }
  }

This configuration tells OpenClaw to enable audio understanding, but instead of using a default provider, it will run your custom script.

Tip: If you are using a Python virtual environment, set the command to the full path of your environment’s Python binary. You can find it with:

whereis python

Step 4: Restart OpenClaw

After setup, restart OpenClaw to apply the changes:

openclaw daemon restart

With this setup, you now have full control over OpenClaw’s audio understanding capabilities.

You can find the complete implementation in the GitHub repository.

Why Use Universal-3 Pro with OpenClaw?

A common question when working with OpenClaw’s media capabilities is: why switch to a different STT model? After all, don’t all STT models just convert speech to text?

The answer is no. Different STT models have different strengths and trade-offs, for example:

Speed: Some models prioritize fast transcription, making them suitable for real-time applications.
Accuracy (WER): Others focus on achieving a low Word Error Rate, improving transcription quality.
Domain specialization: Certain models are optimized for specific areas such as medicine, legal, or customer support.
Customization: Some models allow fine-tuning or prompting to handle unique names, jargon, or phrases.
Deployment preference: Developers may prefer local models for privacy, control, or cost reasons.

In this article, we use AssemblyAI’s Universal-3 Pro because of its powerful prompting capabilities. For example, my name is Eteimorde. It is not an English name and rarely appears in standard datasets.

While building my personal voice AI agent with OpenClaw, I noticed that default STT models consistently misheard my name. To solve this, I used Universal-3 Pro’s keyterm prompting feature to explicitly define my name as an important term:

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"], 
    language_detection=True, 
    keyterms_prompt=["Eteimorde"]
)

Additional Capabilities of Universal-3 Pro via Prompting

Universal-3 Pro provides advanced features that can be easily leveraged through prompting. You can customize the behavior of the model by updating the prompt in the transcription configuration:

config = aai.TranscriptionConfig(
    speech_models=["universal-3-pro"], 
    language_detection=True, 
    prompt="YOUR_PROMPT_GOES_HERE"
)

Using prompting, the model can perform the following tasks:

Verbatim transcription and disfluencies
Preserve natural speech patterns such as filler words, repetitions, and self-corrections.
Audio event tagging
Mark non-speech sounds like laughter, music, applause, or background noise.
Crosstalk labeling
Identify overlapping speech, interruptions, and multiple speakers talking at once.
Numbers and measurements formatting
Control how numbers, percentages, and measurements are represented.
Context-aware clues
Improve transcription for domain-specific terms, names, and jargon by providing relevant hints in the prompt.
Speaker attribution
Detect and label different speakers in a conversation.
PII redaction
Tag personal identifiable information such as names, addresses, and contact details, useful for limiting what the agent can access.

By using prompting, these capabilities allow your OpenClaw voice agent to become more accurate, context-aware, and personalized, going beyond the default transcription behavior.

Conclusion

OpenClaw makes it easy to run AI agents through chat apps you already use, and adding voice capabilities takes the interaction to a whole new level. By integrating your own speech-to-text models, such as Universal-3 Pro, you unlock features beyond OpenClaw’s built-in media understanding.

Its prompting capabilities allow users to customize how the model transcribes audio, accurately recognize custom keyterms, and leverage features like verbatim transcription to preserve natural speech and audio event tagging to capture non-speech context such as background noise or laughter.

With this setup, your OpenClaw agent behaves more like a true personal assistant. It can remember context, send proactive reminders, and leverage system tools to perform tasks. Voice interaction, combined with Universal-3 Pro’s advanced prompting features, transforms the agent from a simple chat companion into a more robust, seamless, and highly personalized experience.

DEV Community