DEV Community: grace

Suppress Noise in 3 Lines of Python

grace — Thu, 22 Aug 2024 21:03:42 +0000

August 22, 2024 · 1 min read

Learn how to suppress background acoustic noise using Picovoice Koala Noise Suppression Python SDK.

Koala Noise Suppression performs speech enhancement locally, keeping your voice data private (i.e. GDPR and HIPAA-compliant by design). Furthermore, by running on the device, Koala Noise Suppression guarantees real-time processing with minimum latency. The SDK runs on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson. Koala Noise Suppression can also run on Android, iOS, and web browsers.

Noise Suppression, Noise Cancellation, and Speech Enhancement are the same.

Install Noise Suppression Python SDK
Install the SDK:

pip install pvkoala

Sign up for Picovoice Console
Log in to (sign up for) Picovoice Console. It is free, and no credit card is required! Copy your AccessKey to the clipboard.

Line 1
Import the package:

import pvkoala

Line 2
Create an instance of the noise cancellation object with your AccessKey:

handle = pvkoala.create(access_key)

Line 3
Suppress noise:

enhanced_pcm = handle.process(pcm)

Koala Noise Suppression processes incoming audio in frames. The length of each frame can be attained via handle.frame_length. Koala Noise Suppression operations on single-channel and 16 kHz audio.

It only takes 90 seconds to suppress noise and enhance speech!

Source Code
The source code for a fully-working demo with Koala Noise Suppression Python SDK is available on GitHub .

For more information, check out the Koala Noise Suppression product page or refer to the Koala Noise Suppression Python SDK quick start guide.

OctoTube: Voice Search for YouTube

grace — Thu, 22 Aug 2024 20:59:56 +0000

Have you ever been in a situation where you are going back and forth in a YouTube video searching for a specific phrase? No more. There is a little script that can search any video (even without transcription) lightning-fast and point you to the exact second the phrase occurs. Enter OctoTube!

Get Started
Clone the Octopus GitHub repository:

git clone --recurse-submodules https://github.com/Picovoice/octopus.git

Run this from the root of the repository to install Python dependencies:

pip3 install -r demo/youtube/requirements.txt

Get an AccessKey from Picovoice Console. It is free.

Find a video on YouTube you like to search and from the root of the repository run:

python3 demo/youtube/octotube.py \
--access-key ${ACCESS_KEY} \
--url ${YOUTUBE_VIDEO_URL} \
--phrases ${SEARCH_PHRASE0} ${SEARCH_PHRASE1}

You should get something like the below (yes, I watch too much Silicon Valley):

indexed 3024 seconds of audio in 54.36 seconds
searched 3024 seconds of audio for 1 phrases in 0.01013 seconds
pied piper >>>
[0.5] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=784
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=840
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=2355
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=2940

Notice that indexing is the bulk of the processing time. The good news is once the video is indexed, it is super fast to search for more (similar to how the Google search engine works):

searched 3024 seconds of audio for 1 phrases in 0.00655 seconds
jian yang >>>
[0.3] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=1332
[0.7] https://www.youtube.com/watch?v=Lt6PPiTTwbE&t=2478

How Does it Work?
OctoTube uses the Picovoice Speech-to-Index engine (also known as Octopus). Octopus directly indexes audio without relying on a text representation (Learn more). Octopus runs on Android, iOS, Ubuntu, macOS, Windows, and even modern web browsers.

Start Building
Go to Octopus’s GitHub and start building your applications with Octopus!

Speaker Diarization in Python

grace — Thu, 22 Aug 2024 20:56:25 +0000

August 22, 2024 · 2 min read
EngineeringSpeaker Diarization
Speaker diarization is the process of dividing an audio stream into distinct segments based on speaker identity. In simpler terms, it answers the question, "Who spoke when?"

Previously, we introduced you to some of the Top Speaker Diarization APIs and SDKs currently available in the market. In this article, we'll dive into practical demonstrations of three Python-based speaker diarization frameworks, showcasing their capabilities through a straightforward speaker diarization task.

pyannote.audio
Getting started with pyannote.audio for speaker diarization is straightforward. Follow these steps:

Install the pyannote.audio package using pip:

pip3 install pyannote.audio

Obtain your authentication token to download pretrained models by visiting their Hugging Face pages .
Use the following Python code to perform speaker diarization on an audio file:

from pyannote.audio import Pipeline

# Replace "${ACCESS_TOKEN_GOES_HERE}" with your authentication token
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token="${ACCESS_TOKEN_GOES_HERE}")

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
diarization = pipeline("${AUDIO_FILE_PATH}")

for segment, _, speaker in diarization.itertracks(yield_label=True):
    print(f'Speaker "{speaker}" - "{segment}"')

This code will perform speaker diarization and print out the identified speakers along with their corresponding segments in the audio file.

NVIDIA NeMo
To perform speaker diarization using NVIDIA NeMo , follow these steps:

Install dependencies:

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip3 install Cython

Install NeMo:

pip install git+https://github.com/NVIDIA/NeMo.git@r1.20.0#egg=nemo_toolkit[all]

Download the config file for the inference from the NeMo GitHub repository .
Generate and store the manifest file by running the following code:

import json
import os

from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

INPUT_FILE = '/PATH/TO/AUDIO_FILE.wav'
MANIFEST_FILE = '/PATH/TO/MANIFEST_FILE.json'

meta = {
    'audio_filepath': input_file,
    'offset': 0,
    'duration': None,
    'label': 'infer',
    'text': '-',
    'num_speakers': None,
    'rttm_filepath': None,
    'uem_filepath': None
}
with open(MANIFEST_FILE, 'w') as fp:
    json.dump(meta, fp)
    fp.write('\n')

Replace /PATH/TO/AUDIO_FILE.wav with the path to your audio file and /PATH/TO/MANIFEST_FILE.json with the desired path for your manifest file.

Load the config file and define a ClusteringDiarizer object:

OUTPUT_DIR = '/PATH/TO/OUTPUT_DIR'
MODEL_CONFIG = '/PATH/TO/CONFIG_FILE.yaml'

config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = MANIFEST_FILE
config.diarizer.out_dir = OUTPUT_DIR
config.diarizer.oracle_vad = False
config.diarizer.clustering.parameters.oracle_num_speakers = False

sd_model = ClusteringDiarizer(cfg=config)

Replace /PATH/TO/OUTPUT_DIR and /PATH/TO/CONFIG_FILE.yaml with the desired paths for your output directory and config file, respectively.

Perform speaker diarization on the audio file:

sd_model.diarize()

The output of the speaker diarization will be stored in the OUTPUT_DIR directory as a Rich Transcription Time Marked (RTTM) file.

Simple Diarizer
Simple Diarizer is a speaker diarization library that utilizes pretrained models from SpeechBrain . To get started with simple_diarizer, follow these steps:

Install the package using pip:

pip install simple_diarizer

Define a Diarizer object:

from simple_diarizer.diarizer import Diarizer

diarization = Diarizer(embed_model='xvec', cluster_method='sc')

Perform speaker diarization on an audio file by either passing the number of speakers:

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = diarization.diarize("${AUDIO_FILE_PATH}", num_speakers=NUM_SPEAKERS)

Or by passing a threshold value:

segments = diarization.diarize("${AUDIO_FILE_PATH}", threshold=THRESHOLD)

The speaker information and timing details, including the start and end times of each segment, are stored in the segment variable.

Falcon Speaker Diarization
Falcon Speaker Diarization is an on-device speaker diarization engine powered by deep learning. To get started with Falcon Speaker Diarization, follow these steps:

Install the package using pip:

pip install pvfalcon

Sign up for Picovoice Console for free and copy your AccessKey. It handles authentication and authorization.
Create an instance of the engine:

import pvfalcon

# Replace "${ACCESS_KEY}" with your Picovoice Console AccessKey
falcon = pvfalcon.create(access_key="${ACCESS_KEY}")

Perform speaker diarization on an audio file:

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = falcon.process_file("${AUDIO_FILE_PATH}")
for segment in segments:
    print(
        "{speaker_tag=%d start_sec=%.2f end_sec=%.2f}"
        % (segment.speaker_tag, segment.start_sec, segment.end_sec)
    )

The segments variable represents an array of segments, each of which includes the segment's timing and speaker information.

For more information about Falcon Speaker Diarization, check out the Falcon Speaker Diarization product page or refer to the Falcon Speaker Diarization Python SDK quick start guide.

Real-time Speaker Identification in Python

grace — Thu, 22 Aug 2024 20:48:18 +0000

August 22, 2024 · 2 min read

Speaker Recognition (or Speaker Identification) analyzes distinctive voice characteristics to identify and verify speakers. It is the technology behind voice authentication, speaker-based personalization, and speaker spotting. However, many applications of Speaker Recognition suffer from the high latency of cloud-based services, leading to poor user experience. That is where Picovoice's Eagle Speaker Recognition SDK comes in, offering on-device Speaker Recognition without sacrificing accuracy. What's more, Eagle Speaker Recognition makes it so easy, you can add Speaker Recognition to your app in just a few lines of Python.

Speaker Recognition typically requires two steps. The first step is speaker Enrollment, where a speaker's voice is registered using a short clip of audio to produce a Speaker Profile. The second step is Recognition, where the Speaker Profile is used to detect when that speaker is speaking given an audio stream.

Let's see how to use the Eagle Speaker Recognition Python SDK / API to implement a speaker recognition app!

Setup
Install pveagle using pip. We will be using pvrecorder to get cross-platform audio, so install that as well:

pip3 install pveagle pvrecorder
Lastly, you will need a Picovoice AccessKey, which can be obtained with a free Picovoice Console account.

Enroll a speaker
Import pveagle and create an instance of the EagleProfiler class:

import pveagle

access_key = "{YOUR_ACCESS_KEY}";
try:
    eagle_profiler = pveagle.create_profiler(access_key=access_key)
except pveagle.EagleError as e:
    # Handle error
    pass

Now, import pvrecorder and create an instance of the recorder as well. Use the EagleProfiler's .min_enroll_samples as the frame_length:

from pvrecorder import PvRecorder

DEFAULT_DEVICE_INDEX = -1
recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle_profiler.min_enroll_samples)

Now it's time to enroll a speaker. The .enroll() function takes in frames of audio and provides feedback on the audio quality and Enrollment percentage. Use the percentage value to know when Enrollment is done and another speaker can be enrolled:

recorder.start()

enroll_percentage = 0.0
while enroll_percentage < 100.0:
    audio_frame = recorder.read()
    enroll_percentage, feedback = eagle_profiler.enroll(audio_frame)

recorder.stop()

Once Enrollment reaches 100%, export the speaker profile to use in the next step, Speaker Recognition:

speaker_profile = eagle_profiler.export()

The speaker_profile object can be saved and reused; see the docs for more details. Profiles can be made for additional users by calling the .reset() function on the EagleProfiler, and repeating the .enroll() step.

Once profiles have been created for all speakers, don't forget to clean up used resources:

recorder.delete()
eagle_profiler.delete()

Perform recognition
Import pveagle and create an instance of the Eagle class, using the speaker profiles created by the Enrollment step:

import pveagle

access_key = "{YOUR_ACCESS_KEY}"
profiles = [speaker_profile_1, speaker_profile_2]
try:
    eagle = pveagle.create_recognizer(
        access_key=access_key,
        speaker_profiles=profiles)
except pveagle.EagleError as e:
    # Handle error
    pass

Now set up pvrecorder to use with Eagle:

recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle.frame_length)

Pass audio frames into the eagle.process() function get back speaker scores:

while True:
    audio_frame = recorder.read()
    scores = eagle.process(audio_frame)

When finished, don't forget to clean up used resources:

eagle.delete()

Putting It All Together
Here is an example program bringing together everything that has been shown so far:

import pveagle
from pvrecorder import PvRecorder

DEFAULT_DEVICE_INDEX = -1
access_key = "{YOUR_ACCESS_KEY}";

# Step 1: Enrollment
try:
    eagle_profiler = pveagle.create_profiler(access_key=access_key)
except pveagle.EagleError as e:
    pass

enroll_recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle_profiler.min_enroll_samples)

enroll_recorder.start()

enroll_percentage = 0.0
while enroll_percentage < 100.0:
    audio_frame = enroll_recorder.read()
    enroll_percentage, feedback = eagle_profiler.enroll(audio_frame)

enroll_recorder.stop()

speaker_profile = eagle_profiler.export()

enroll_recorder.delete()
eagle_profiler.delete()

# Step 2: Recognition
try:
    eagle = pveagle.create_recognizer(
        access_key=access_key,
        speaker_pofiles=[speaker_profile])
except pveagle.EagleError as e:
    pass

recognizer_recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle.frame_length)

recognizer_recorder.start()

while True:
    audio_frame = recorder.read()
    scores = eagle.process(audio_frame)
    print(scores)

recognizer_recorder.stop()

recognizer_recorder.delete()
eagle.delete()

It just takes 2 minutes to get it up and running:

Next Steps
See the GitHub Python Demo for a more complete example, including how to handle Enrollment feedback, save Speaker Profiles to disk and use files as the audio input. You can also view the Python API docs for details on the package.

Adding Speaker Diarization to OpenAI Whisper using Picovoice Falcon

grace — Thu, 22 Aug 2024 20:43:06 +0000

August 22, 2024 · 1 min read

OpenAI Whisper Speech-to-Text is a locally executable speech recognition model that comes in various sizes, allowing users to choose a model that suits their device's specifications. Unfortunately, Whisper lacks speaker diarization, a crucial feature for applications that require speaker identification (e.g. discerning speakers in a meeting scenario).

This article guides you through the process of integrating Picovoice Falcon Speaker Diarization with OpenAI Whisper in Python. Adding speaker diarization will result in a more user-friendly, dialogue-style transcription.

Setup
Start by installing the necessary packages:

pip3 install -U openai-whisper
pip3 install -U pvfalcon

Both Falcon Speaker Diarization and Whisper Speech-to-Text run on CPU and do not require a GPU. While Whisper may be slow on CPU, utilizing a GPU can improve its runtime.

Speech Recognition with Whisper
Let's begin by utilizing Whisper for speech recognition. The code snippet below demonstrates how to transcribe speech using Whisper:

import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]

Here, ${WHISPER_MODEL} refers to one of the available Whisper models , and ${AUDIO_FILE_PATH} is the path to the audio file. Since our goal is a dialogue-style transcription, we'll focus on extracting segments from the result, each representing a part of the transcript with its corresponding timestamp.

Speaker Diarization with Falcon
Next, let's perform speaker diarization using Falcon. The following code snippet illustrates how to apply Falcon for this purpose:

import pvfalcon

falcon = pvfalcon.create(${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})

Here, ${ACCESS_KEY} is your access key obtained from the Picovoice Console. The process method result is a list of speaker segments, similar to Whisper's segments but with speaker_tag fields indicating the speaker.

Integrating Whisper and Falcon Speaker Diarization
By combining OpenAI Whisper for speech recognition and Picovoice Falcon Speaker Diarization for speaker diarization, we aim to create a dialogue-style transcription. To achieve this, we'll define a simple score to measure the overlap between Whisper and Falcon Speaker Diarization segments. The following code snippet demonstrates how to calculate this score:

def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio

Utilizing this score, we can find the best-matching Falcon Speaker Diarization segment for each Whisper segment. The code snippet below demonstrates this process:

for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score > max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")

This is a basic approach for merging the two segment lists, intended for demonstration purposes. Results can be further enhanced with a more sophisticated matching algorithm.
Putting everything together would result in the script below:

import pvfalcon
import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]

falcon = pvfalcon.create(access_key=${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})


def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio


for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score > max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")

And the expected result follows a format similar to the below output:

Speaker 1:  Hey, has the task been completed?
Speaker 2:  I don't know anything about it.
Speaker 3:  Well, we're in the process of working on it. 
Speaker 3:  There's a bit of a delay because we're waiting on someone else to complete their part.
Speaker 1:  Waiting again? This is taking longer than expected. 
Speaker 1:  Can we get an update on the timeline?
Speaker 3:  I understand the urgency. 
Speaker 3:  I've followed up with the person responsible, and they've assured me they're working on it. 
Speaker 3:  We should have a clearer timeline by the end of the day.

It only takes a minute to add speaker diarization to Whisper using Falcon:

For more in-depth information on the Falcon Speaker Diarization Python SDK, delve into the documentation. For those seeking a seamless solution that effortlessly combines speech recognition and speaker diarization, consider exploring Picovoice Leopard Speech-to-Text. Leopard Speech-to-Text, recognized for its lightweight and fast performance, internally incorporates Falcon Speaker Diarization, resulting in optimized outcomes. It streamlines the transcription process, enabling you to effortlessly obtain speaker information through a single function call.

Speech-to-Text with React.js

grace — Mon, 19 Aug 2024 18:23:54 +0000

August 19, 2024 · 1 min read

Speech-to-text is a technology that converts spoken words into written text. Integrating speech-to-text (STT) technology into an application can bring significant benefits, such as enhancing user experience, accessibility, and overall functionality.

In this article, we will walk you through the process of integrating speech-to-text into a React application using Picovoice's Leopard Speech-to-Text engine.

1. Prerequisites
Sign up for a free Picovoice Console account. Once you've created an account, copy your AccessKey on the main dashboard.

2. Create a React Project:
If you don't already have a React project, start by creating one with the following command:

npx create-react-app leopard-react

3. Install Dependencies:
Install @picovoice/leopard-react and @picovoice/web-voice-processor :

npm install @picovoice/leopard-react @picovoice/web-voice-processor

4. Leopard Model
In order to initialize Leopard, you will need a model file. Download one of the default model files for your desired language and place it in the /public directory of your project.

Create Components
Create a file within /src called VoiceWidget.js and paste the below into it. The code uses Leopard's hook to perform speech-to-text. Remember to replace ${ACCESS_KEY} with your AccessKey obtained from the Picovoice Console and ${MODEL_FILE_PATH} with the path to your model file.

import React from "react";
import { useLeopard } from "@picovoice/leopard-react";

export default function VoiceWidget() {
  const {
    result,
    isLoaded,
    error,
    init,
    processFile,
    startRecording,
    stopRecording,
    isRecording,
  } = useLeopard();

  const initEngine = async () => {
    await init(
      "${ACCESS_KEY}",
      { publicPath: "${MODEL_FILE_PATH}" },
      { enableAutomaticPunctuation: true }
    );
  };

  const toggleRecord = async () => {
    if (isRecording) {
      await stopRecording();
    } else {
      await startRecording();
    }
  };

  return (
    <div>
      {error && <p className="error-message">{error.toString()}</p>}
      <br />
      <button onClick={initEngine} disabled={isLoaded}>Initialize Leopard</button>
      <br />
      <br />
      <label htmlFor="audio-file">Choose audio file to transcribe:</label>
      <input
        id="audio-file"
        type="file"
        accept="audio/*"
        disabled={!isLoaded}
        onChange={async (e) => {
          if (!!e.target.files?.length) {
            await processFile(e.target.files[0])
          }
        }}
      />
      <br />
      <label htmlFor="audio-record">Record audio to transcribe:</label>
      <button id="audio-record" disabled={!isLoaded} onClick={toggleRecord}>
        {isRecording ? "Stop Recording" : "Start Recording"}
      </button> 
      <h3>Transcript:</h3>
      <p>{result?.transcript}</p>
    </div>
  );
}

Modify App.js to display the VoiceWidget:

import VoiceWidget from "./VoiceWidget";

function App() {
  return (
    <div className="App">
      <h1>
        Leopard React Demo
      </h1>
      <VoiceWidget />
    </div>
  );
}

export default App;

Start the development server:

npm run start

Once it's running, navigate to localhost:3000 and click the "Initialize Leopard" button. Once Leopard has initialized, upload an audio file or record audio to see the transcription.

It takes less than 90 seconds to get it up and running!

Additional Languages
Leopard supports many more languages aside from English. To use models in other languages, refer to the Leopard Speech-to-Text React quick start guide.

Source Code
The source code for the complete demo with Leopard React is available on its GitHub repository .

How to Record Audio from a Web Browser

grace — Mon, 19 Aug 2024 18:19:12 +0000

August 19th, 2024 · 2 min read

Recording audio from a web browser is more challenging than it might seem at first glance. While the browser's abstraction from the hardware it's running on has its benefits, it can make it difficult to communicate with certain peripherals - e.g. a user's connected microphone. Luckily for us modern developers, the Web Audio API and the MediaStream API came along over a decade ago and solved many of these problems.

The Web Audio API is a powerful tool for manipulating audio in the browser. It allows developers to analyze, synthesize, and manipulate audio in real-time using some simple JavaScript. The MediaStream API allows developers to open streams of media content from many sources, including the microphone. In this article, we will look at how to use the Web Audio API and the MediaStream API to capture microphone audio in any modern web browser.

Setting up a basic HTML page
First, let's create a basic HTML page that we can use to control audio capture from the microphone. Create a new file called index.html and add the following code:

<!DOCTYPE html>
<html>
<head>
  <title>Microphone Capture Demo</title>
</head>
<body>
  <button id="start-button">Start Capture</button>
  <button id="stop-button">Stop Capture</button>
  <script src="main.js"></script>
</body>
</html>

Capturing audio from the microphone

Now that we have our HTML page, let's create the main JavaScript file to capture microphone audio. Create a new file called main.js and add the following code:

const startButton = document.getElementById('start-button');
const stopButton = document.getElementById('stop-button');


let audioContext;
let micStreamAudioSourceNode;
let audioWorkletNode;


startButton.addEventListener('click', async () => {
  // Check if the browser supports the required APIs
  if (!window.AudioContext || 
      !window.MediaStreamAudioSourceNode || 
      !window.AudioWorkletNode) {
    alert('Your browser does not support the required APIs');
    return;
  }


  // Request access to the user's microphone
  const micStream = await navigator
      .mediaDevices
      .getUserMedia({ audio: true });


  // Create the microphone stream
  audioContext = new AudioContext();
  mediaStreamAudioSourceNode = audioContext
      .createMediaStreamSource(micStream);


  // Create and connect AudioWorkletNode 
  // for processing the audio stream
  await audioContext
      .audioWorklet
      .addModule("my-audio-processor.js");
  audioWorkletNode = new AudioWorkletNode(
      audioContext,
      'my-audio-processor');
  micStreamAudioSourceNode.connect(audioWorkletNode);
});


stopButton.addEventListener('click', () => {
  // Close audio stream
  micStreamAudioSourceNode.disconnect();
  audioContext.close();
});

With this code, we are able to capture microphone audio using the Web Audio API and the MediaStream API. When the user clicks the Start Capture button, we create an AudioContext and request access to the user's microphone. Once we know we have access to the microphone audio, we then create an audio processing graph using a MediaStreamAudioSourceNode to capture the audio and an AudioWorkletNode to process it.

The Web Audio API and MediaStream API are supported on Google Chrome, Firefox, Safari, Microsoft Edge and Opera. A host of mobile web browsers are also supported.

Processing the captured audio data
Now that we have set up the basic infrastructure for capturing microphone audio, we can start processing the real-time audio data. To do this, we will need to define the behaviour of the AudioWorkletNode with an AudioWorkletProcessor implementation of our own.

Create a new file called my-audio-processor.js and add the following code:

class MyAudioProcessor extends AudioWorkletProcessor {
    process(inputs, outputs, parameters) {
        // Get the input audio data from the first channel
        const inputData = inputs[0][0];

        // Do something with the audio data
        // ...

        return true;
    }
}

registerProcessor('my-audio-processor', MyAudioProcessor);

In the process function that we've defined, we can access the input audio data and perform various operations on it. For example, we can use the Web Audio API's AnalyserNode to analyze the frequency spectrum or buffer the audio to send to a speech recognition engine.

With this final addition, we can now capture real-time microphone audio from the HTML page we created earlier.

Capturing audio from the browser on Easy Mode

Now, you might be thinking, "this approach seems complicated and limited (i.e. can't choose the sample rate of the incoming audio, audio processing on the main thread seems bad, etc.)", and you would be right. That's why we created the Picovoice Audio Recorders.

At Picovoice, we ran into a multitude of challenges getting audio from the web browser for speech recognition. We require specific audio properties for our speech recognition engines, and - since our audio processing happens all in the browser - we want the processing to happen on a worker thread. We found ourselves building out a complex array of utility functions to help, which we eventually merged into an open-source library: Picovoice Web Voice Processor.

With Web Voice Processor imported, our main.js file would look like this:

import { WebVoiceProcessor } from '@picovoice/web-voice-processor';

const startButton = document.getElementById('start-button');
const stopButton = document.getElementById('stop-button');

const engine = {
    onmessage: function(e) {
        switch (e.data.command) {
            case 'process':
                const inputData = e.data.inputFrame;
                // do something with the audio
                break;
        }
    }
}

startButton.addEventListener('click', async () => {
    // Once WebVoiceProcessor has at least one engine
    // subscribed, audio capture begins
    WebVoiceProcessor.subscribe(engine);
});

stopButton.addEventListener('click', () => {
    // Once WebVoiceProcessor no longer has engines
    // subscribed, audio capture stops
    WebVoiceProcessor.unsubscribe(engine);
});

In addition to simplifying the audio capture process, Web Voice Processor adds options for resampling the input audio, selecting the audio device to record with and running audio processing on a Worker Thread.

It takes less than 90 seconds to start recording audio from a web browser:

Explore
The Web Voice Processor is open-source and available on GitHub . There is also a demo in the repository that explores more of the features of the library.

Python Wake Word Detection Tutorial — Picovoice

grace — Mon, 19 Aug 2024 18:12:39 +0000

August 19th, 2024 · 2 min read

A Wake Word Engine is a tiny algorithm that detects utterances of a given Wake Phrase within a stream of audio. There are good articles that focus on how to build a Wake Word Model using TensorFlow or PyTorch. These are invaluable for educational purposes. But training a production-ready Wake Word Model requires significant effort for data curation and expertise to simulate real-world environments during training.

Picovoice Porcupine Wake Word Engine uses Transfer Learning to eliminate the need for data collection per model. Porcupine enables you to train custom wake words instantly without requiring you to gather any data.

Porcupine Python SDK runs on Linux (x86_64), macOS (x86_64 / arm64), Windows (amd64), Raspberry Pi (Zero, 2, 3, 4), NVIDIA Jetson Nano, and BeagleBone.

Below we learn how to use Porcupine Python SDK for Wake Word Detection and train production-ready Custom Wake Words within seconds using Picovoice Console.

Porcupine also can run on modern Web browsers using its JavaScript SDK and on several Arm Cortex-M microcontrollers using its C SDK.

Install Porcupine Python SDK

Install the SDK using PIP from a terminal:

pip3 install pvporcupine

Sign up for Picovoice Console

Usage

Porcupine SDK ships with a few built-in Wake Word Models such as Alexa, Hey Google, OK Google, Hey Siri, and Jarvis. Check the list of builtin models:

Builtin Keyword Models

import pvporcupine

for keyword in pvporcupine.KEYWORDS:    
  print(keyword)

Initialization

When initializing Porcupine, you can use one of the built-in Wake Word Models:

porcupine = pvporcupine.create(        
  access_key=access_key,        
  keywords=[keyword_one, keyword_two])

Or you can provide Custom Keyword Models (more on this below):

porcupine = pvporcupine.create(        
  access_key=access_key,        
  keyword_paths=[keyword_paths_one, keyword_path_two])

*Processing
*
Porcupine takes in audio in chunks (frames). .frame_length property gives the size of each frame. Porcupine accepts 16 kHz audio with 16-bit samples. For each frame, Porcupine returns a number representing the detected keyword. -1 indicates no detection. Positive indices correspond to keyword detections.

keyword_index = porcupine.process(audio_frame)
if keyword_index >= 0:    
  # Logic to handle keyword detection events

Cleanup

When done, be sure to release resources:

porcupine.delete()

Create Custom Wake Words

Often you want to use Custom Wake Word Models with your project. Branded Wake Word Models are essential for enterprise products. Otherwise, you are pushing Amazon, Google, and Apple's brand, not yours! You can create Custom Wake Word Models using Picovoice Console in seconds:

Log in to Picovoice Console
Go to the Porcupine Page
Select the target language (e.g. English, Japanese, Spanish, etc.)
Select the platform you want to optimize the model for (e.g. Raspberry Pi, macOS, Windows, etc.)
Type in the wake phrase. A good wake phrase should have a few linguistic properties.
Click the train button. Your model with be ready momentarily (file with .ppn suffix). You can download this file for on-device inference.

A Working Example

All that is left is to wire up the audio recording. Then we have an end-to-end wake word solution. Install PvRecorder Python SDK using PIP:

pip3 install pvrecorder

The following code snippet records audio from the default microphone on the device and processes recorded audio using Porcupine to detect the utterances of selected keywords. Altogether, we need less than 20 lines of code!

import pvporcupine
from pvrecorder import PvRecorder

porcupine = pvporcupine.create(access_key=access_key, keywords=keywords)
recoder = PvRecorder(device_index=-1, frame_length=porcupine.frame_length)

try:    
  recoder.start()
  while True:        
    keyword_index = porcupine.process(recoder.read())        
    if keyword_index >= 0:            
      print(f"Detected {keywords[keyword_index]}")

except KeyboardInterrupt:    
  recoder.stop()
 finally:    
  porcupine.delete()    
  recoder.delete()

Text-to-Speech in Python: On-Device Solutions

grace — Fri, 16 Aug 2024 21:09:33 +0000

August 16th, 2024 · 2 min read

Text-to-Speech (TTS) technology, also known as Speech Synthesis, converts text into human-like speech. The rise of deep learning has led to major advancements in TTS quality and naturalness, but at the cost of increased computational requirements. Most big tech companies offer cloud-based TTS APIs, like Google Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech, and new companies with similar offerings have emerged, such as ElevenLabs, or Coqui Studio. While convenient, these services require an internet connection, raise privacy concerns, and are prone to network outages. On-device solutions allow for more flexibility and privacy by synthesizing speech directly on the user's device. However, few options exist for on-device TTS. This article explores three open-source Python libraries and Picovoice Orca Text-to-Speech.

🚀 Best-in-class Voice AI!
Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.

PyTTSx3

PyTTSx3 is a Python library that utilizes the popular eSpeak speech synthesis engine on Linux (NSSpeechSynthesizer is used on MacOS and SAPI5 on Windows). Getting started is straightforward:

Install pyTTSx3:

pip install pyttsx3

Save synthesized speech to a file in Python:

import pyttsx3

engine = pyttsx3.init()
engine.save_to_file(text='Hello World', filename='PATH/TO/OUTPUT.wav')
engine.runAndWait()

While simple to use, eSpeak's voice quality is robotic compared to more modern TTS systems.

Coqui TTS
Coqui TTS is the open-source repository of Coqui Studio. Developers can leverage Coqui's pretrained models or train custom voices. To synthesize speech, follow the steps:

Install Coqui TTS:

pip install TTS

List available models in Python:

from TTS.api import TTS

TTS().list_models()

Choose a model name and save synthesized speech to a file:

tts = TTS("CHOSEN/MODEL/NAME")
tts.tts_to_file(text="Hello World", output_path="PATH/TO/OUTPUT.wav")

Coqui offers high-quality voices with natural prosody, at the cost of larger model sizes and longer processing times.

Mimic3 from Mycroft

Mycroft is a free and open-source virtual assistant that offers a TTS system called Mimic3. This framework currently lacks a pure Python API, so we will use Python's subprocess:

Install Mycroft:

pip install mycroft-mimic3-tts

Synthesize speech and save file to directory OUTPUT/DIR:

import subprocess 

args = [    
  "mimic3",    
  "\"Hello World\"",    
  "--output-dir", "OUTPUT/DIR"]

try:    
  subprocess.check_call(args)

except subprocess.CalledProcessError as e:  
  # Handle error    
  pass

For prototyping on-device TTS, Mimic3 from Mycroft provides a balance of quality and performance.

Orca Text-to-Speech

Picovoice Orca Text-to-Speech leverages state-of-the-art Text-to-Speech (TTS) models to provide high-quality voices, while still being small and efficient.

Install Orca Text-to-Speech Python SDK

pip install pvorca

Import Orca and create an Orca instance.

import pvorca 
orca = pvorca.create(access_key="${ACCESS_KEY}")

Sign-up or Log in to Picovoice Console to copy your access key and replace ${ACCESS_KEY} with it.

Synthesize your desired text with

orca.synthesize(text="${TEXT}")

For more information refer to the Orca Text-to-Speech Python SDK Documentation.

Conclusion

On-device TTS removes privacy concerns, internet requirements, and minimizes latency. With Python solutions like PyTTSx3, Coqui TTS, and Mimic3, developers have several options for synthesizing speech directly on devices based on their needs. However, each solution comes with drawbacks such as poor voice quality, large resource requirements, or lack of flexible APIs. Another alternative is Orca Text-to-Speech, which combines state-of-the-art neural TTS with efficiency, allowing to synthesize high-quality speech even on a Raspberry Pi.

End-to-End Voice Recognition with Python

grace — Fri, 16 Aug 2024 16:30:34 +0000

August 16th, 2024 · 2 min read

There are several approaches for adding speech recognition capabilities to a Python application. In this article, I’d like to introduce a new paradigm for adding purpose-made & context-aware voice assistants into Python apps using the Picovoice platform.

Picovoice enables developers to create voice experiences similar to Alexa and Google for existing Python apps. Different from cloud-based alternatives, Picovoice is:

Private and secure — no voice data goes out of the app.
Accurate — focuses on the domain of interest
Cross-platform — Linux, macOS, Windows, Raspberry Pi, …
Reliable and zero latency — eliminates unpredictable network delay

In what follows, I’ll introduce Picovoice by building a voice-enabled alarm clock using Picovoice SDK, Picovoice Console, and Tkinter GUI framework. The code is open-source and available on Picovoice’s GitHub repository.

1 — Install Picovoice
Install Picovoice from a terminal:

pip3 install picovoice

2 — Create an Instance of Picovoice

Picovoice is an end-to-end voice recognition platform with wake word detection and intent inference capabilities. Picovoice uses the Porcupine Wake Word engine for voice activation and the Rhino Speech-to-Intent engine for inferring intent from follow-on voice commands. For example, when a user says:

Picovoice, set an alarm for 2 hours and 31 seconds.

Porcupine detects the utterance of thePicovoice wake word. Then Rhino infers the user’s intent from the follow-on command and provides a structured inference:

{
  is_understood: true,
  intent: setAlarm,
  slots: {
    hours: 2,
    seconds: 31
  }
}

Create an instance of Picovoice by providing paths to Porcupine and Rhino models and callbacks for wake word detection and inference completion:

from picovoice import Picovoice

keyword_path = ...  # path to Porcupine wake word file (.PPN)

def wake_word_callback():
  pass

context_path = ...  # path to Rhino context file (.RHN)

def inference_callback(inference):
  print(inference.is_understood)
  if inference.is_understood:
    print(inference.intent)
    for k, v in inference.slots.items():
      print(f"{k} : {v}")

pv = Picovoice(
  access_key=${YOUR_ACCESS_KEY}
  keyword_path=keyword_path(),
  wake_word_callback=wake_word_callback,
  context_path=context_path(),
  inference_callback=inference_callback)

Several pre-trained Porcupine and Rhino models are available on their GitHub repositories [1][2]. For this demo, we use the pre-trained PicovoicePorcupine model and the pre-trained Alarm Rhino model. Developers are also empowered to create custom models using Picovoice Console.

3 — Get your Free AccessKey

Sign up for Picovoice Console to get your AccessKey. It is free. AccessKey is used for authentication and authorization when using Picovoice SDK.

4 — Process Audio with Picovoice

Once the engine is instantiated it can process a stream of audio. Simply pass frames of audio to the engine:

pv.process(audio_frame)

5 — Read audio from the Microphone

Install pvrecorder. Then, read the audio:

from pvrecoder import PvRecoder
# `-1` is the default input audio device.
recorder = PvRecoder(device_index=-1)
recorder.start()

Read frames of audio from the recorder and pass it to Picovoice’s .process method:

pcm = recorder.read()
pv.process(pcm)

6— Create a Cross-Platform GUI using Tkinter

Tkinter is the standard GUI framework shipped with Python. Create a frame (window), add a label showing the remaining time to it, and launch the app:

window = tk.Tk()
time_label = tk.Label(window, text='00 : 00 : 00')
time_label.pack()

window.protocol('WM_DELETE_WINDOW', on_close)

window.mainloop()

7 — Putting it Together

There are about 200 lines of code for GUI, audio recording, and voice recognition. I also created a separate thread for audio processing to avoid blocking the main GUI thread.

If you have technical questions or suggestions please open a GitHub issue on Picovoice’s GitHub repository. If you wish to modify or improve this demo, feel free to submit a pull request.

How to Record Audio using Python — Picovoice

grace — Fri, 16 Aug 2024 15:43:04 +0000

August 16th, 2024 · 1 min read

Recording audio from a microphone using Python is tricky! Why? Because Python doesn't provide a standard library for it. Existing third-party libraries (e.g. PyAudio) are not cross-platform and have external dependencies. We learned this the hard way as we needed microphone recording functionality in our voice recognition demos and created Picovoice Audio Recorders.

Hence we created PvRecorder, a cross-platform library that supports Linux, macOS, Windows, Raspberry Pi, NVIDIA Jetson, and BeagleBone. PvRecorder has SDKs for Python, Node.js, .NET, Go, and Rust.

Below we learn how to record audio in Python using PvRecorder. The Python SDK captures audio suitable for speech recognition, meaning the audio captured is already 16 kHz and 16-bit. PvRecorder Python SDK runs on Linux, macOS, Windows, Raspberry Pi, NVIDIA Jetson, and BeagleBone.

Install

Install PvRecorder using PIP:

pip3 install pvrecorder

Find Available Microphones

A computer can have multiple microphones. For example, a laptop has a built-in microphone and might have a headset attached to it. The first step is to find the microphone we want to record.

from pvrecorder import PvRecorder

for index, device in enumerate(PvRecorder.get_audio_devices()):    
  print(f"[{index}] {device}")

Running above on a Dell XPS laptop gives:

[0] Monitor of sof-hda-dsp HDMI3/DP3 Output
[1] Monitor of sof-hda-dsp HDMI2/DP2 Output
[2] Monitor of sof-hda-dsp HDMI1/DP1 Output
[3] Monitor of sof-hda-dsp Speaker + Headphones
[4] sof-hda-dsp Headset Mono Microphone + Headphones Stereo Microphone
[5] sof-hda-dsp Digital Microphone

Take note of the index of your target microphone. We pass this to the constructor of PvRecorder. When unsure, pass -1 to the constructor to use the default microphone.

Record Audio

First, create an instance of PvRecoder. You need to provide a device_index (see above) and a frame_length. frame_length is the number of audio samples you wish to receive at each read. We set it to 512 (32 milliseconds of 16 kHz audio). Then call .start() to start the recording. Once recording, keep calling .read() in a loop to receive audio. Invoke .stop() to stop recording and then .delete() to release resources when done.

recorder = PvRecorder(device_index=-1, frame_length=512)

try:    recorder.start()

  while True:        
    frame = recorder.read()        
    # Do something ...

except KeyboardInterrupt:    
  recorder.stop()
finally:    
  recorder.delete()

Save Audio to File

You can do whatever you wish using the code snippet above. Whether you want to detect wake words, recognize voice commands, transcribe speech to text, index audio for search, or save it to a file. The code snippet below shows how to save audio into a WAVE file format.

recorder = PvRecorder(device_index=-1, frame_length=512)
audio = []

try:    
  recorder.start()

  while True:        
    frame = recorder.read()        
    audio.extend(frame)
except KeyboardInterrupt:    
  recorder.stop()    
  with wave.open(path, 'w') as f:        
      f.setparams((1, 2, 16000, 512, "NONE", "NONE"))        
      f.writeframes(struct.pack("h" * len(audio), *audio))
finally:    
  recorder.delete()