DEV Community

Cover image for Building an AI Voice Assistant in 1 Minute (Command Line)
Zuluana
Zuluana

Posted on

Building an AI Voice Assistant in 1 Minute (Command Line)

Today, I decided to build an AI Voice Assistant.

My goal was to convert my voice to text, pass it through an LLM, and stream it back as audio - all within a few seconds in MacOS Terminal.

I was able to accomplish this quickly with help from GPT-4o.

Setup

We'll build this using 3 OpenAI models:

  1. Whisper: Speech -> Text
  2. GPT: LLM to Process Text
  3. TTS: Text -> Speech

If you don't already have API keys, you can get them here: https://openai.com/api

Before starting, you'll need to export your OpenAI API Key for the commands to work.

export OPENAI_API_KEY=sk-...
Enter fullscreen mode Exit fullscreen mode

If you don't want to use OpenAI models, there are plenty of alternatives (Open-Whisper, LM Studio, Piper, Claude, etc...).

The Minute

Over the next minute, you can paste these commands into your MacOS Terminal:

Record Your Request

sox -d -q test.wav trim 0 3
Enter fullscreen mode Exit fullscreen mode

This will run the SoX tool (Sound eXchange) for recording / processing audio.

  • The -d option says to use the input device.
  • The -q option enables quiet mode (to suppress output).
  • The recording is saved as test.wav.
  • trim 0 3 tells sox to listen for 3 seconds.

Convert to Text

TRANSCRIPTION=$(curl -s -X POST https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file=@test.wav \
  -F model=whisper-1 \
  | jq -r .text)
Enter fullscreen mode Exit fullscreen mode

This will run OpenAI's Whisper model to convert your audio into text.

Process the Text

REPLY=$(curl -s -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gpt-3.5-turbo\",
    \"messages\": [
      { \"role\": \"system\", \"content\": \"You are a helpful assistant. Keep responses short.\" },
      { \"role\": \"user\", \"content\": \"$TRANSCRIPTION\" }
    ]
  }" | jq -r .choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

This uses GPT-3.5 to process your request.

Stream the Reply

curl -s -X POST https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"tts-1\",
    \"input\": \"$REPLY\",
    \"voice\": \"fable\",
    \"response_format\": \"pcm\",
    \"sample_rate\": 24000
  }" | sox -t raw -b16 -e signed-integer -r24000 -c1 -L - -d
Enter fullscreen mode Exit fullscreen mode

This uses OpenAI's TTS API to convert the output of GPT back into speech. It then streams that to sox in lightweight PCM format.

Done!

You can add all of this to a single shell script to make it easier to run:

assist.sh

#!/bin/bash

# Record WAV β€” fixed 3 second clip
echo "πŸŽ™οΈ  Recording 3 second clip..."
sox -d -q test.wav trim 0 3

# Transcribe with Whisper
echo "πŸ“ Transcribing..."
TRANSCRIPTION=$(curl -s -X POST https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: multipart/form-data" \
  -F file=@test.wav \
  -F model=whisper-1 \
  | jq -r .text)

# Print what was transcribed
echo "πŸ—£οΈ  You said: \"$TRANSCRIPTION\""

# Chat with GPT
REPLY=$(curl -s -X POST https://api.openai.com/v1/chat/completions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"gpt-3.5-turbo\",
    \"messages\": [
      { \"role\": \"system\", \"content\": \"You are a helpful assistant. Keep responses short.\" },
      { \"role\": \"user\", \"content\": \"$TRANSCRIPTION\" }
    ]
  }" | jq -r .choices[0].message.content)

# Print reply
echo "πŸ€– AI reply: \"$REPLY\""

# TTS β€” stream back and play
echo "πŸ”Š Speaking reply..."
curl -s -X POST https://api.openai.com/v1/audio/speech \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"tts-1\",
    \"input\": \"$REPLY\",
    \"voice\": \"fable\",
    \"response_format\": \"pcm\",
    \"sample_rate\": 24000
  }" | sox -t raw -b16 -e signed-integer -r24000 -c1 -L - -d -q

# Final message
echo "βœ… Done."
Enter fullscreen mode Exit fullscreen mode

Then, to run it:

chmod +x ./assist.sh
./assist.sh
Enter fullscreen mode Exit fullscreen mode

Conclusion

This is a quick AI assistant you can use by typing "assist" in the command line.

You can extend yours to use "silence" to listen until you stop speaking or listen on a loop for a hot-key, etc.

I've extended mine to run within an express server for better control and both input / output streaming for embedded devices.

Let me know if you have any questions!

Happy Hacking,
Zuluana

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.