DEV Community

Cover image for End-to-End Voice Recognition with Python
grace
grace

Posted on • Edited on

End-to-End Voice Recognition with Python

August 16th, 2024 · 2 min read

There are several approaches for adding speech recognition capabilities to a Python application. In this article, I’d like to introduce a new paradigm for adding purpose-made & context-aware voice assistants into Python apps using the Picovoice platform.

Picovoice enables developers to create voice experiences similar to Alexa and Google for existing Python apps. Different from cloud-based alternatives, Picovoice is:

  • Private and secure — no voice data goes out of the app.
  • Accurate — focuses on the domain of interest
  • Cross-platform — Linux, macOS, Windows, Raspberry Pi, …
  • Reliable and zero latency — eliminates unpredictable network delay

In what follows, I’ll introduce Picovoice by building a voice-enabled alarm clock using Picovoice SDK, Picovoice Console, and Tkinter GUI framework. The code is open-source and available on Picovoice’s GitHub repository.

1 — Install Picovoice
Install Picovoice from a terminal:

pip3 install picovoice
Enter fullscreen mode Exit fullscreen mode

2 — Create an Instance of Picovoice

Picovoice is an end-to-end voice recognition platform with wake word detection and intent inference capabilities. Picovoice uses the Porcupine Wake Word engine for voice activation and the Rhino Speech-to-Intent engine for inferring intent from follow-on voice commands. For example, when a user says:

Picovoice, set an alarm for 2 hours and 31 seconds.

Porcupine detects the utterance of thePicovoice wake word. Then Rhino infers the user’s intent from the follow-on command and provides a structured inference:

{
  is_understood: true,
  intent: setAlarm,
  slots: {
    hours: 2,
    seconds: 31
  }
}
Enter fullscreen mode Exit fullscreen mode

Create an instance of Picovoice by providing paths to Porcupine and Rhino models and callbacks for wake word detection and inference completion:

from picovoice import Picovoice

keyword_path = ...  # path to Porcupine wake word file (.PPN)

def wake_word_callback():
  pass

context_path = ...  # path to Rhino context file (.RHN)

def inference_callback(inference):
  print(inference.is_understood)
  if inference.is_understood:
    print(inference.intent)
    for k, v in inference.slots.items():
      print(f"{k} : {v}")

pv = Picovoice(
  access_key=${YOUR_ACCESS_KEY}
  keyword_path=keyword_path(),
  wake_word_callback=wake_word_callback,
  context_path=context_path(),
  inference_callback=inference_callback)
Enter fullscreen mode Exit fullscreen mode

Several pre-trained Porcupine and Rhino models are available on their GitHub repositories [1][2]. For this demo, we use the pre-trained PicovoicePorcupine model and the pre-trained Alarm Rhino model. Developers are also empowered to create custom models using Picovoice Console.

3 — Get your Free AccessKey

Sign up for Picovoice Console to get your AccessKey. It is free. AccessKey is used for authentication and authorization when using Picovoice SDK.

4 — Process Audio with Picovoice

Once the engine is instantiated it can process a stream of audio. Simply pass frames of audio to the engine:

pv.process(audio_frame)
Enter fullscreen mode Exit fullscreen mode

5 — Read audio from the Microphone

Install pvrecorder. Then, read the audio:

from pvrecoder import PvRecoder
# `-1` is the default input audio device.
recorder = PvRecoder(device_index=-1)
recorder.start()
Enter fullscreen mode Exit fullscreen mode

Read frames of audio from the recorder and pass it to Picovoice’s .process method:

pcm = recorder.read()
pv.process(pcm)

Enter fullscreen mode Exit fullscreen mode

6— Create a Cross-Platform GUI using Tkinter

Tkinter is the standard GUI framework shipped with Python. Create a frame (window), add a label showing the remaining time to it, and launch the app:

window = tk.Tk()
time_label = tk.Label(window, text='00 : 00 : 00')
time_label.pack()

window.protocol('WM_DELETE_WINDOW', on_close)

window.mainloop()
Enter fullscreen mode Exit fullscreen mode

7 — Putting it Together

There are about 200 lines of code for GUI, audio recording, and voice recognition. I also created a separate thread for audio processing to avoid blocking the main GUI thread.

If you have technical questions or suggestions please open a GitHub issue on Picovoice’s GitHub repository. If you wish to modify or improve this demo, feel free to submit a pull request.

Top comments (0)