<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: grace </title>
    <description>The latest articles on DEV Community by grace  (@gracezzhang).</description>
    <link>https://dev.to/gracezzhang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1938744%2F25a56e56-9d38-434a-9244-07f472b811db.png</url>
      <title>DEV Community: grace </title>
      <link>https://dev.to/gracezzhang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/gracezzhang"/>
    <language>en</language>
    <item>
      <title>Suppress Noise in 3 Lines of Python</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Thu, 22 Aug 2024 21:03:42 +0000</pubDate>
      <link>https://dev.to/gracezzhang/suppress-noise-in-3-lines-of-python-1bg</link>
      <guid>https://dev.to/gracezzhang/suppress-noise-in-3-lines-of-python-1bg</guid>
      <description>&lt;p&gt;August 22, 2024 · 1 min read&lt;/p&gt;

&lt;p&gt;Learn how to suppress background acoustic noise using Picovoice &lt;a href="https://picovoice.ai/docs/quick-start/koala-python/" rel="noopener noreferrer"&gt;Koala Noise Suppression&lt;/a&gt; Python SDK.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://picovoice.ai/platform/koala/" rel="noopener noreferrer"&gt;Koala Noise Suppression&lt;/a&gt; performs speech enhancement locally, keeping your voice data private (i.e. GDPR and HIPAA-compliant by design). Furthermore, by running on the device, Koala Noise Suppression guarantees real-time processing with minimum latency. The SDK runs on Linux, macOS, Windows, Raspberry Pi, and NVIDIA Jetson. Koala Noise Suppression can also run on Android, iOS, and web browsers.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Noise Suppression, Noise Cancellation, and Speech Enhancement are the same.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Install Noise Suppression Python SDK&lt;/strong&gt;&lt;br&gt;
Install the SDK:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pvkoala
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sign up for Picovoice Console&lt;/strong&gt;&lt;br&gt;
Log in to (sign up for) &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;. It is free, and no credit card is required! Copy your AccessKey to the clipboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Line 1&lt;/strong&gt;&lt;br&gt;
Import the package:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvkoala
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 2&lt;/strong&gt;&lt;br&gt;
Create an instance of the noise cancellation object with your AccessKey:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;handle = pvkoala.create(access_key)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line 3&lt;/strong&gt;&lt;br&gt;
Suppress noise:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;enhanced_pcm = handle.process(pcm)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Koala Noise Suppression processes incoming audio in frames. The length of each frame can be attained via handle.frame_length. Koala Noise Suppression operations on single-channel and 16 kHz audio.&lt;/p&gt;

&lt;p&gt;It only takes 90 seconds to suppress noise and enhance speech!&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/gq_-OD9SsmQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source Code&lt;/strong&gt;&lt;br&gt;
The source code for a fully-working demo with Koala Noise Suppression Python SDK is available on &lt;a href="https://github.com/Picovoice/koala/tree/main/demo/python" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; .&lt;/p&gt;

&lt;p&gt;For more information, check out the &lt;a href="https://picovoice.ai/platform/koala/" rel="noopener noreferrer"&gt;Koala Noise Suppression product page&lt;/a&gt; or refer to the &lt;a href="https://picovoice.ai/docs/quick-start/koala-python/" rel="noopener noreferrer"&gt;Koala Noise Suppression Python SDK quick start guide&lt;/a&gt;.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>OctoTube: Voice Search for YouTube</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Thu, 22 Aug 2024 20:59:56 +0000</pubDate>
      <link>https://dev.to/gracezzhang/octotube-voice-search-for-youtube-1f3j</link>
      <guid>https://dev.to/gracezzhang/octotube-voice-search-for-youtube-1f3j</guid>
      <description>&lt;p&gt;Have you ever been in a situation where you are going back and forth in a YouTube video searching for a specific phrase? No more. There is a little script that can search any video (even without transcription) lightning-fast and point you to the exact second the phrase occurs. Enter OctoTube!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Get Started&lt;/strong&gt;&lt;br&gt;
Clone the Octopus GitHub repository:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git clone --recurse-submodules https://github.com/Picovoice/octopus.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run this from the root of the repository to install Python dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install -r demo/youtube/requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Get an AccessKey from &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;. It is free.&lt;/p&gt;

&lt;p&gt;Find a video on YouTube you like to search and from the root of the repository run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;python3 demo/youtube/octotube.py \
--access-key ${ACCESS_KEY} \
--url ${YOUTUBE_VIDEO_URL} \
--phrases ${SEARCH_PHRASE0} ${SEARCH_PHRASE1}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should get something like the below (yes, I watch too much &lt;a href="https://en.wikipedia.org/wiki/Silicon_Valley_(TV_series)" rel="noopener noreferrer"&gt;Silicon Valley&lt;/a&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;indexed 3024 seconds of audio in 54.36 seconds
searched 3024 seconds of audio for 1 phrases in 0.01013 seconds
pied piper &amp;gt;&amp;gt;&amp;gt;
[0.5] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=784
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=840
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=2355
[1.0] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=2940
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice that indexing is the bulk of the processing time. The good news is once the video is indexed, it is super fast to search for more (similar to how the Google search engine works):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;searched 3024 seconds of audio for 1 phrases in 0.00655 seconds
jian yang &amp;gt;&amp;gt;&amp;gt;
[0.3] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=1332
[0.7] https://www.youtube.com/watch?v=Lt6PPiTTwbE&amp;amp;t=2478
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How Does it Work?&lt;/strong&gt;&lt;br&gt;
OctoTube uses the Picovoice Speech-to-Index engine (also known as &lt;a href="https://picovoice.ai/platform/octopus/" rel="noopener noreferrer"&gt;Octopus&lt;/a&gt;). Octopus directly indexes audio without relying on a text representation (&lt;a href="https://picovoice.ai/blog/direct-speech-indexing/" rel="noopener noreferrer"&gt;Learn more&lt;/a&gt;). Octopus runs on Android, iOS, Ubuntu, macOS, Windows, and even modern web browsers.&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/s-HkScIDmCU"&gt;
&lt;/iframe&gt;
&lt;br&gt;
&lt;strong&gt;Start Building&lt;/strong&gt;&lt;br&gt;
Go to Octopus’s &lt;a href="https://github.com/Picovoice/octopus" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; and start building your applications with Octopus!&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Speaker Diarization in Python</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Thu, 22 Aug 2024 20:56:25 +0000</pubDate>
      <link>https://dev.to/gracezzhang/speaker-diarization-in-python-235i</link>
      <guid>https://dev.to/gracezzhang/speaker-diarization-in-python-235i</guid>
      <description>&lt;p&gt;August 22, 2024 · 2 min read&lt;br&gt;
EngineeringSpeaker Diarization&lt;br&gt;
Speaker diarization is the process of dividing an audio stream into distinct segments based on speaker identity. In simpler terms, it answers the question, "Who spoke when?"&lt;/p&gt;

&lt;p&gt;Previously, we introduced you to some of the &lt;a href="https://picovoice.ai/blog/top-speaker-diarization-apis-and-sdks/" rel="noopener noreferrer"&gt;Top Speaker Diarization&lt;/a&gt; APIs and SDKs currently available in the market. In this article, we'll dive into practical demonstrations of three Python-based speaker diarization frameworks, showcasing their capabilities through a straightforward speaker diarization task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;pyannote.audio&lt;/strong&gt;&lt;br&gt;
Getting started with &lt;a href="https://github.com/pyannote/pyannote-audio" rel="noopener noreferrer"&gt;pyannote.audio&lt;/a&gt;  for speaker diarization is straightforward. Follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the pyannote.audio package using pip:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install pyannote.audio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Obtain your authentication token to download pretrained models by visiting their &lt;a href="https://huggingface.co/pyannote/speaker-diarization" rel="noopener noreferrer"&gt;Hugging Face pages&lt;/a&gt; .&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Use the following Python code to perform speaker diarization on an audio file:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyannote.audio import Pipeline

# Replace "${ACCESS_TOKEN_GOES_HERE}" with your authentication token
pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization",
    use_auth_token="${ACCESS_TOKEN_GOES_HERE}")

# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
diarization = pipeline("${AUDIO_FILE_PATH}")

for segment, _, speaker in diarization.itertracks(yield_label=True):
    print(f'Speaker "{speaker}" - "{segment}"')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;This code will perform speaker diarization and print out the identified speakers along with their corresponding segments in the audio file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NVIDIA NeMo&lt;/strong&gt;&lt;br&gt;
To perform speaker diarization using &lt;a href="https://github.com/NVIDIA/NeMo" rel="noopener noreferrer"&gt;NVIDIA NeMo&lt;/a&gt; , follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install dependencies:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;apt-get update &amp;amp;&amp;amp; apt-get install -y libsndfile1 ffmpeg
pip3 install Cython
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Install NeMo:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install git+https://github.com/NVIDIA/NeMo.git@r1.20.0#egg=nemo_toolkit[all]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Download the config file for the inference from the &lt;a href="https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/inference" rel="noopener noreferrer"&gt;NeMo GitHub repository&lt;/a&gt; .&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Generate and store the manifest file by running the following code:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import json
import os

from nemo.collections.asr.models import ClusteringDiarizer
from omegaconf import OmegaConf

INPUT_FILE = '/PATH/TO/AUDIO_FILE.wav'
MANIFEST_FILE = '/PATH/TO/MANIFEST_FILE.json'

meta = {
    'audio_filepath': input_file,
    'offset': 0,
    'duration': None,
    'label': 'infer',
    'text': '-',
    'num_speakers': None,
    'rttm_filepath': None,
    'uem_filepath': None
}
with open(MANIFEST_FILE, 'w') as fp:
    json.dump(meta, fp)
    fp.write('\n')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Replace /PATH/TO/AUDIO_FILE.wav with the path to your audio file and /PATH/TO/MANIFEST_FILE.json with the desired path for your manifest file.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Load the config file and define a ClusteringDiarizer object:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OUTPUT_DIR = '/PATH/TO/OUTPUT_DIR'
MODEL_CONFIG = '/PATH/TO/CONFIG_FILE.yaml'

config = OmegaConf.load(MODEL_CONFIG)
config.diarizer.manifest_filepath = MANIFEST_FILE
config.diarizer.out_dir = OUTPUT_DIR
config.diarizer.oracle_vad = False
config.diarizer.clustering.parameters.oracle_num_speakers = False

sd_model = ClusteringDiarizer(cfg=config)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Replace /PATH/TO/OUTPUT_DIR and /PATH/TO/CONFIG_FILE.yaml with the desired paths for your output directory and config file, respectively.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Perform speaker diarization on the audio file:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sd_model.diarize()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The output of the speaker diarization will be stored in the OUTPUT_DIR directory as a &lt;a href="https://github.com/nryant/dscore#rttm" rel="noopener noreferrer"&gt;Rich Transcription Time Marked (RTTM)&lt;/a&gt;  file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Diarizer&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/cvqluu/simple_diarizer" rel="noopener noreferrer"&gt;Simple Diarizer&lt;/a&gt;  is a speaker diarization library that utilizes pretrained models from &lt;a href="https://speechbrain.github.io/" rel="noopener noreferrer"&gt;SpeechBrain&lt;/a&gt; . To get started with simple_diarizer, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the package using pip:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install simple_diarizer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Define a Diarizer object:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from simple_diarizer.diarizer import Diarizer

diarization = Diarizer(embed_model='xvec', cluster_method='sc')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Perform speaker diarization on an audio file by either passing the number of speakers:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = diarization.diarize("${AUDIO_FILE_PATH}", num_speakers=NUM_SPEAKERS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Or by passing a threshold value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;segments = diarization.diarize("${AUDIO_FILE_PATH}", threshold=THRESHOLD)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The speaker information and timing details, including the start and end times of each segment, are stored in the segment variable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Falcon Speaker Diarization&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/Picovoice/falcon/" rel="noopener noreferrer"&gt;Falcon Speaker Diarization&lt;/a&gt;  is an on-device speaker diarization engine powered by deep learning. To get started with Falcon Speaker Diarization, follow these steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install the package using pip:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pvfalcon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Sign up for &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; for free and copy your AccessKey. It handles authentication and authorization.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create an instance of the engine:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvfalcon

# Replace "${ACCESS_KEY}" with your Picovoice Console AccessKey
falcon = pvfalcon.create(access_key="${ACCESS_KEY}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Perform speaker diarization on an audio file:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Replace "${AUDIO_FILE_PATH}" with the path to your audio file
segments = falcon.process_file("${AUDIO_FILE_PATH}")
for segment in segments:
    print(
        "{speaker_tag=%d start_sec=%.2f end_sec=%.2f}"
        % (segment.speaker_tag, segment.start_sec, segment.end_sec)
    )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The segments variable represents an array of segments, each of which includes the segment's timing and speaker information.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/UJmXXHfP-NQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;For more information about Falcon Speaker Diarization, check out the &lt;a href="https://picovoice.ai/platform/falcon/" rel="noopener noreferrer"&gt;Falcon Speaker Diarization&lt;/a&gt; product page or refer to the &lt;a href="https://picovoice.ai/docs/quick-start/falcon-python/" rel="noopener noreferrer"&gt;Falcon Speaker Diarization Python SDK quick start&lt;/a&gt; guide.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Real-time Speaker Identification in Python</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Thu, 22 Aug 2024 20:48:18 +0000</pubDate>
      <link>https://dev.to/gracezzhang/real-time-speaker-identification-in-python-44b9</link>
      <guid>https://dev.to/gracezzhang/real-time-speaker-identification-in-python-44b9</guid>
      <description>&lt;p&gt;August 22, 2024 · 2 min read&lt;/p&gt;

&lt;p&gt;Speaker Recognition (or Speaker Identification) analyzes distinctive voice characteristics to identify and verify speakers. It is the technology behind voice authentication, speaker-based personalization, and speaker spotting. However, many applications of Speaker Recognition suffer from the high latency of cloud-based services, leading to poor user experience. That is where Picovoice's &lt;a href="https://picovoice.ai/platform/eagle/" rel="noopener noreferrer"&gt;Eagle Speaker Recognition SDK&lt;/a&gt; comes in, offering on-device Speaker Recognition without sacrificing accuracy. What's more, &lt;a href="https://picovoice.ai/platform/eagle/" rel="noopener noreferrer"&gt;Eagle Speaker Recognition&lt;/a&gt; makes it so easy, you can add Speaker Recognition to your app in just a few lines of Python.&lt;/p&gt;

&lt;p&gt;Speaker Recognition typically requires two steps. The first step is speaker Enrollment, where a speaker's voice is registered using a short clip of audio to produce a Speaker Profile. The second step is Recognition, where the Speaker Profile is used to detect when that speaker is speaking given an audio stream.&lt;/p&gt;

&lt;p&gt;Let's see how to use the &lt;a href="https://picovoice.ai/platform/eagle/" rel="noopener noreferrer"&gt;Eagle Speaker Recognition&lt;/a&gt; Python SDK / API to implement a speaker recognition app!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
Install &lt;a href="https://pypi.org/project/pveagle/" rel="noopener noreferrer"&gt;pveagle&lt;/a&gt;  using pip. We will be using &lt;a href="https://pypi.org/project/pvrecorder/" rel="noopener noreferrer"&gt;pvrecorder&lt;/a&gt;  to get cross-platform audio, so install that as well:&lt;/p&gt;

&lt;p&gt;pip3 install pveagle pvrecorder&lt;br&gt;
Lastly, you will need a Picovoice AccessKey, which can be obtained with a free &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; account.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enroll a speaker&lt;/strong&gt;&lt;br&gt;
Import pveagle and create an instance of the EagleProfiler class:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pveagle

access_key = "{YOUR_ACCESS_KEY}";
try:
    eagle_profiler = pveagle.create_profiler(access_key=access_key)
except pveagle.EagleError as e:
    # Handle error
    pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, import pvrecorder and create an instance of the recorder as well. Use the EagleProfiler's .min_enroll_samples as the frame_length:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pvrecorder import PvRecorder

DEFAULT_DEVICE_INDEX = -1
recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle_profiler.min_enroll_samples)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now it's time to enroll a speaker. The .enroll() function takes in frames of audio and provides feedback on the audio quality and Enrollment percentage. Use the percentage value to know when Enrollment is done and another speaker can be enrolled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recorder.start()

enroll_percentage = 0.0
while enroll_percentage &amp;lt; 100.0:
    audio_frame = recorder.read()
    enroll_percentage, feedback = eagle_profiler.enroll(audio_frame)

recorder.stop()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once Enrollment reaches 100%, export the speaker profile to use in the next step, Speaker Recognition:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;speaker_profile = eagle_profiler.export()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The speaker_profile object can be saved and reused; see the &lt;a href="https://picovoice.ai/docs/api/eagle-python/" rel="noopener noreferrer"&gt;docs&lt;/a&gt; for more details. Profiles can be made for additional users by calling the .reset() function on the EagleProfiler, and repeating the .enroll() step.&lt;/p&gt;

&lt;p&gt;Once profiles have been created for all speakers, don't forget to clean up used resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recorder.delete()
eagle_profiler.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Perform recognition&lt;/strong&gt;&lt;br&gt;
Import pveagle and create an instance of the Eagle class, using the speaker profiles created by the Enrollment step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pveagle

access_key = "{YOUR_ACCESS_KEY}"
profiles = [speaker_profile_1, speaker_profile_2]
try:
    eagle = pveagle.create_recognizer(
        access_key=access_key,
        speaker_profiles=profiles)
except pveagle.EagleError as e:
    # Handle error
    pass
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now set up pvrecorder to use with Eagle:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle.frame_length)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pass audio frames into the eagle.process() function get back speaker scores:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;while True:
    audio_frame = recorder.read()
    scores = eagle.process(audio_frame)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When finished, don't forget to clean up used resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;eagle.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Putting It All Together&lt;/strong&gt;&lt;br&gt;
Here is an example program bringing together everything that has been shown so far:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pveagle
from pvrecorder import PvRecorder

DEFAULT_DEVICE_INDEX = -1
access_key = "{YOUR_ACCESS_KEY}";

# Step 1: Enrollment
try:
    eagle_profiler = pveagle.create_profiler(access_key=access_key)
except pveagle.EagleError as e:
    pass

enroll_recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle_profiler.min_enroll_samples)

enroll_recorder.start()

enroll_percentage = 0.0
while enroll_percentage &amp;lt; 100.0:
    audio_frame = enroll_recorder.read()
    enroll_percentage, feedback = eagle_profiler.enroll(audio_frame)

enroll_recorder.stop()

speaker_profile = eagle_profiler.export()

enroll_recorder.delete()
eagle_profiler.delete()

# Step 2: Recognition
try:
    eagle = pveagle.create_recognizer(
        access_key=access_key,
        speaker_pofiles=[speaker_profile])
except pveagle.EagleError as e:
    pass

recognizer_recorder = PvRecorder(
    device_index=DEFAULT_DEVICE_INDEX,
    frame_length=eagle.frame_length)

recognizer_recorder.start()

while True:
    audio_frame = recorder.read()
    scores = eagle.process(audio_frame)
    print(scores)

recognizer_recorder.stop()

recognizer_recorder.delete()
eagle.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It just takes 2 minutes to get it up and running:&lt;br&gt;
&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/Fbt3Swkh7HM"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next Steps&lt;/strong&gt;&lt;br&gt;
See the &lt;a href="https://github.com/Picovoice/eagle/tree/main/demo/python" rel="noopener noreferrer"&gt;GitHub Python Demo&lt;/a&gt;  for a more complete example, including how to handle Enrollment feedback, save Speaker Profiles to disk and use files as the audio input. You can also view the &lt;a href="https://picovoice.ai/docs/api/eagle-python/" rel="noopener noreferrer"&gt;Python API docs&lt;/a&gt; for details on the package.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Adding Speaker Diarization to OpenAI Whisper using Picovoice Falcon</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Thu, 22 Aug 2024 20:43:06 +0000</pubDate>
      <link>https://dev.to/gracezzhang/adding-speaker-diarization-to-openai-whisper-using-picovoice-falcon-5c20</link>
      <guid>https://dev.to/gracezzhang/adding-speaker-diarization-to-openai-whisper-using-picovoice-falcon-5c20</guid>
      <description>&lt;p&gt;August 22, 2024 · 1 min read&lt;/p&gt;

&lt;p&gt;OpenAI Whisper Speech-to-Text is a locally executable speech recognition model that comes in various sizes, allowing users to choose a model that suits their device's specifications. Unfortunately, Whisper lacks speaker diarization, a crucial feature for applications that require speaker identification (e.g. discerning speakers in a meeting scenario).&lt;/p&gt;

&lt;p&gt;This article guides you through the process of integrating &lt;a href="https://picovoice.ai/platform/falcon/" rel="noopener noreferrer"&gt;Picovoice Falcon Speaker Diarization&lt;/a&gt; with OpenAI Whisper in Python. Adding speaker diarization will result in a more user-friendly, dialogue-style transcription.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setup&lt;/strong&gt;&lt;br&gt;
Start by installing the necessary packages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install -U openai-whisper
pip3 install -U pvfalcon
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;Both Falcon Speaker Diarization and Whisper Speech-to-Text run on CPU and do not require a GPU. While Whisper may be slow on CPU, utilizing a GPU can improve its runtime.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Speech Recognition with Whisper&lt;/strong&gt;&lt;br&gt;
Let's begin by utilizing Whisper for speech recognition. The code snippet below demonstrates how to transcribe speech using Whisper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, ${WHISPER_MODEL} refers to one of the available &lt;a href="https://github.com/openai/whisper?tab=readme-ov-file#available-models-and-languages" rel="noopener noreferrer"&gt;Whisper models&lt;/a&gt; , and ${AUDIO_FILE_PATH} is the path to the audio file. Since our goal is a dialogue-style transcription, we'll focus on extracting segments from the result, each representing a part of the transcript with its corresponding timestamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speaker Diarization with Falcon&lt;/strong&gt;&lt;br&gt;
Next, let's perform speaker diarization using Falcon. The following code snippet illustrates how to apply Falcon for this purpose:&lt;/p&gt;

&lt;p&gt;import pvfalcon&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;falcon = pvfalcon.create(${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, ${ACCESS_KEY} is your access key obtained from the &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;. The process method result is a list of speaker segments, similar to Whisper's segments but with speaker_tag fields indicating the speaker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Integrating Whisper and Falcon Speaker Diarization&lt;/strong&gt;&lt;br&gt;
By combining OpenAI Whisper for speech recognition and Picovoice Falcon Speaker Diarization for speaker diarization, we aim to create a dialogue-style transcription. To achieve this, we'll define a simple score to measure the overlap between Whisper and Falcon Speaker Diarization segments. The following code snippet demonstrates how to calculate this score:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Utilizing this score, we can find the best-matching Falcon Speaker Diarization segment for each Whisper segment. The code snippet below demonstrates this process:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score &amp;gt; max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;This is a basic approach for merging the two segment lists, intended for demonstration purposes. Results can be further enhanced with a more sophisticated matching algorithm.&lt;br&gt;
Putting everything together would result in the script below:&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvfalcon
import whisper

model = whisper.load_model({$WHISPER_MODEL})
result = model.transcribe({$AUDIO_FILE_PATH})
transcript_segments = result["segments"]

falcon = pvfalcon.create(access_key=${ACCESS_KEY})
speaker_segments = falcon.process_file(${AUDIO_FILE_PATH})


def segment_score(transcript_segment, speaker_segment):
    transcript_segment_start = transcript_segment["start"]
    transcript_segment_end = transcript_segment["end"]
    speaker_segment_start = speaker_segment.start_sec
    speaker_segment_end = speaker_segment.end_sec

    overlap = min(transcript_segment_end, speaker_segment_end) - max(transcript_segment_start, speaker_segment_start)
    overlap_ratio = overlap / (transcript_segment_end - transcript_segment_start)
    return overlap_ratio


for t_segment in transcript_segments:
    max_score = 0
    best_s_segment = None
    for s_segment in speaker_segments:
        score = segment_score(t_segment, s_segment)
        if score &amp;gt; max_score:
            max_score = score
            best_s_segment = s_segment

    print(f"Speaker {best_s_segment.speaker_tag}: {t_segment['text']}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And the expected result follows a format similar to the below output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Speaker 1:  Hey, has the task been completed?
Speaker 2:  I don't know anything about it.
Speaker 3:  Well, we're in the process of working on it. 
Speaker 3:  There's a bit of a delay because we're waiting on someone else to complete their part.
Speaker 1:  Waiting again? This is taking longer than expected. 
Speaker 1:  Can we get an update on the timeline?
Speaker 3:  I understand the urgency. 
Speaker 3:  I've followed up with the person responsible, and they've assured me they're working on it. 
Speaker 3:  We should have a clearer timeline by the end of the day.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It only takes a minute to add speaker diarization to Whisper using Falcon:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/hbqtq_FydeM"&gt;
&lt;/iframe&gt;
 &lt;br&gt;
For more in-depth information on the Falcon Speaker Diarization Python SDK, delve into the &lt;a href="https://picovoice.ai/docs/quick-start/falcon-python/" rel="noopener noreferrer"&gt;documentation&lt;/a&gt;. For those seeking a seamless solution that effortlessly combines speech recognition and speaker diarization, consider exploring &lt;a href="https://picovoice.ai/docs/quick-start/leopard-python/" rel="noopener noreferrer"&gt;Picovoice Leopard Speech-to-Text&lt;/a&gt;. Leopard Speech-to-Text, recognized for its &lt;a href="https://picovoice.ai/docs/benchmark/stt/#core-hour-1" rel="noopener noreferrer"&gt;lightweight and fast performance&lt;/a&gt;, internally incorporates &lt;a href="https://picovoice.ai/platform/falcon/" rel="noopener noreferrer"&gt;Falcon Speaker Diarization&lt;/a&gt;, resulting in optimized outcomes. It streamlines the transcription process, enabling you to effortlessly obtain speaker information through a single function call.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Speech-to-Text with React.js</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Mon, 19 Aug 2024 18:23:54 +0000</pubDate>
      <link>https://dev.to/gracezzhang/speech-to-text-with-reactjs-1hmc</link>
      <guid>https://dev.to/gracezzhang/speech-to-text-with-reactjs-1hmc</guid>
      <description>&lt;p&gt;August 19, 2024 · 1 min read&lt;/p&gt;

&lt;p&gt;Speech-to-text is a technology that converts spoken words into written text. Integrating speech-to-text (STT) technology into an application can bring significant benefits, such as enhancing user experience, accessibility, and overall functionality.&lt;/p&gt;

&lt;p&gt;In this article, we will walk you through the process of integrating speech-to-text into a React application using Picovoice's &lt;a href="https://picovoice.ai/platform/leopard/" rel="noopener noreferrer"&gt;Leopard Speech-to-Text&lt;/a&gt; engine.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Prerequisites&lt;/strong&gt;&lt;br&gt;
Sign up for a free &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; account. Once you've created an account, copy your AccessKey on the main dashboard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Create a React Project:&lt;/strong&gt;&lt;br&gt;
If you don't already have a React project, start by creating one with the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npx create-react-app leopard-react
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Install Dependencies:&lt;/strong&gt;&lt;br&gt;
Install &lt;a href="https://www.npmjs.com/package/@picovoice/leopard-react" rel="noopener noreferrer"&gt;@picovoice/leopard-react&lt;/a&gt;  and &lt;a href="https://www.npmjs.com/package/@picovoice/web-voice-processor" rel="noopener noreferrer"&gt;@picovoice/web-voice-processor&lt;/a&gt; :&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm install @picovoice/leopard-react @picovoice/web-voice-processor
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;4. Leopard Model&lt;/strong&gt;&lt;br&gt;
In order to initialize Leopard, you will need a model file. Download one of the default &lt;a href="https://github.com/Picovoice/leopard/tree/master/lib/common" rel="noopener noreferrer"&gt;model files&lt;/a&gt; for your desired language and place it in the /public directory of your project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create Components&lt;/strong&gt;&lt;br&gt;
Create a file within /src called VoiceWidget.js and paste the below into it. The code uses Leopard's hook to perform speech-to-text. Remember to replace ${ACCESS_KEY} with your AccessKey obtained from the Picovoice Console and ${MODEL_FILE_PATH} with the path to your model file.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import React from "react";
import { useLeopard } from "@picovoice/leopard-react";

export default function VoiceWidget() {
  const {
    result,
    isLoaded,
    error,
    init,
    processFile,
    startRecording,
    stopRecording,
    isRecording,
  } = useLeopard();

  const initEngine = async () =&amp;gt; {
    await init(
      "${ACCESS_KEY}",
      { publicPath: "${MODEL_FILE_PATH}" },
      { enableAutomaticPunctuation: true }
    );
  };

  const toggleRecord = async () =&amp;gt; {
    if (isRecording) {
      await stopRecording();
    } else {
      await startRecording();
    }
  };

  return (
    &amp;lt;div&amp;gt;
      {error &amp;amp;&amp;amp; &amp;lt;p className="error-message"&amp;gt;{error.toString()}&amp;lt;/p&amp;gt;}
      &amp;lt;br /&amp;gt;
      &amp;lt;button onClick={initEngine} disabled={isLoaded}&amp;gt;Initialize Leopard&amp;lt;/button&amp;gt;
      &amp;lt;br /&amp;gt;
      &amp;lt;br /&amp;gt;
      &amp;lt;label htmlFor="audio-file"&amp;gt;Choose audio file to transcribe:&amp;lt;/label&amp;gt;
      &amp;lt;input
        id="audio-file"
        type="file"
        accept="audio/*"
        disabled={!isLoaded}
        onChange={async (e) =&amp;gt; {
          if (!!e.target.files?.length) {
            await processFile(e.target.files[0])
          }
        }}
      /&amp;gt;
      &amp;lt;br /&amp;gt;
      &amp;lt;label htmlFor="audio-record"&amp;gt;Record audio to transcribe:&amp;lt;/label&amp;gt;
      &amp;lt;button id="audio-record" disabled={!isLoaded} onClick={toggleRecord}&amp;gt;
        {isRecording ? "Stop Recording" : "Start Recording"}
      &amp;lt;/button&amp;gt; 
      &amp;lt;h3&amp;gt;Transcript:&amp;lt;/h3&amp;gt;
      &amp;lt;p&amp;gt;{result?.transcript}&amp;lt;/p&amp;gt;
    &amp;lt;/div&amp;gt;
  );
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Modify App.js to display the VoiceWidget:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import VoiceWidget from "./VoiceWidget";

function App() {
  return (
    &amp;lt;div className="App"&amp;gt;
      &amp;lt;h1&amp;gt;
        Leopard React Demo
      &amp;lt;/h1&amp;gt;
      &amp;lt;VoiceWidget /&amp;gt;
    &amp;lt;/div&amp;gt;
  );
}

export default App;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Start the development server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;npm run start
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once it's running, navigate to localhost:3000 and click the "Initialize Leopard" button. Once Leopard has initialized, upload an audio file or record audio to see the transcription.&lt;/p&gt;

&lt;p&gt;It takes less than 90 seconds to get it up and running!&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/CtnYBHT-IxY"&gt;
&lt;/iframe&gt;
&lt;br&gt;
&lt;strong&gt;Additional Languages&lt;/strong&gt;&lt;br&gt;
Leopard supports many more languages aside from English. To use models in other languages, refer to the &lt;a href="https://picovoice.ai/docs/quick-start/leopard-react/#non-english-languages" rel="noopener noreferrer"&gt;Leopard Speech-to-Text React quick start&lt;/a&gt; guide.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Source Code&lt;/strong&gt;&lt;br&gt;
The source code for the complete demo with Leopard React is available on its &lt;a href="https://github.com/Picovoice/leopard/tree/master/lib/common" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; .&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Record Audio from a Web Browser</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Mon, 19 Aug 2024 18:19:12 +0000</pubDate>
      <link>https://dev.to/gracezzhang/how-to-record-audio-from-a-web-browser-5bag</link>
      <guid>https://dev.to/gracezzhang/how-to-record-audio-from-a-web-browser-5bag</guid>
      <description>&lt;p&gt;August 19th, 2024 · 2 min read&lt;/p&gt;

&lt;p&gt;Recording audio from a web browser is more challenging than it might seem at first glance. While the browser's abstraction from the hardware it's running on has its benefits, it can make it difficult to communicate with certain peripherals - e.g. a user's connected microphone. Luckily for us modern developers, the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/Web_Audio_API" rel="noopener noreferrer"&gt;Web Audio API&lt;/a&gt;  and the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/MediaStream" rel="noopener noreferrer"&gt;MediaStream API&lt;/a&gt;  came along over a decade ago and solved many of these problems.&lt;/p&gt;

&lt;p&gt;The Web Audio API is a powerful tool for manipulating audio in the browser. It allows developers to analyze, synthesize, and manipulate audio in real-time using some simple JavaScript. The MediaStream API allows developers to open streams of media content from many sources, including the microphone. In this article, we will look at how to use the Web Audio API and the MediaStream API to capture microphone audio in any modern web browser.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Setting up a basic HTML page&lt;/strong&gt;&lt;br&gt;
First, let's create a basic HTML page that we can use to control audio capture from the microphone. Create a new file called index.html and add the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!DOCTYPE html&amp;gt;
&amp;lt;html&amp;gt;
&amp;lt;head&amp;gt;
  &amp;lt;title&amp;gt;Microphone Capture Demo&amp;lt;/title&amp;gt;
&amp;lt;/head&amp;gt;
&amp;lt;body&amp;gt;
  &amp;lt;button id="start-button"&amp;gt;Start Capture&amp;lt;/button&amp;gt;
  &amp;lt;button id="stop-button"&amp;gt;Stop Capture&amp;lt;/button&amp;gt;
  &amp;lt;script src="main.js"&amp;gt;&amp;lt;/script&amp;gt;
&amp;lt;/body&amp;gt;
&amp;lt;/html&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Capturing audio from the microphone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now that we have our HTML page, let's create the main JavaScript file to capture microphone audio. Create a new file called main.js and add the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;const startButton = document.getElementById('start-button');
const stopButton = document.getElementById('stop-button');


let audioContext;
let micStreamAudioSourceNode;
let audioWorkletNode;


startButton.addEventListener('click', async () =&amp;gt; {
  // Check if the browser supports the required APIs
  if (!window.AudioContext || 
      !window.MediaStreamAudioSourceNode || 
      !window.AudioWorkletNode) {
    alert('Your browser does not support the required APIs');
    return;
  }


  // Request access to the user's microphone
  const micStream = await navigator
      .mediaDevices
      .getUserMedia({ audio: true });


  // Create the microphone stream
  audioContext = new AudioContext();
  mediaStreamAudioSourceNode = audioContext
      .createMediaStreamSource(micStream);


  // Create and connect AudioWorkletNode 
  // for processing the audio stream
  await audioContext
      .audioWorklet
      .addModule("my-audio-processor.js");
  audioWorkletNode = new AudioWorkletNode(
      audioContext,
      'my-audio-processor');
  micStreamAudioSourceNode.connect(audioWorkletNode);
});


stopButton.addEventListener('click', () =&amp;gt; {
  // Close audio stream
  micStreamAudioSourceNode.disconnect();
  audioContext.close();
});

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this code, we are able to capture microphone audio using the Web Audio API and the MediaStream API. When the user clicks the Start Capture button, we create an AudioContext and request access to the user's microphone. Once we know we have access to the microphone audio, we then create an audio processing graph using a MediaStreamAudioSourceNode to capture the audio and an AudioWorkletNode to process it.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Web Audio API and MediaStream API are supported on Google Chrome, Firefox, Safari, Microsoft Edge and Opera. A host of mobile web browsers are also supported.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Processing the captured audio data&lt;/strong&gt;&lt;br&gt;
Now that we have set up the basic infrastructure for capturing microphone audio, we can start processing the real-time audio data. To do this, we will need to define the behaviour of the AudioWorkletNode with an AudioWorkletProcessor implementation of our own.&lt;/p&gt;

&lt;p&gt;Create a new file called my-audio-processor.js and add the following code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;class MyAudioProcessor extends AudioWorkletProcessor {
    process(inputs, outputs, parameters) {
        // Get the input audio data from the first channel
        const inputData = inputs[0][0];

        // Do something with the audio data
        // ...

        return true;
    }
}

registerProcessor('my-audio-processor', MyAudioProcessor);
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the process function that we've defined, we can access the input audio data and perform various operations on it. For example, we can use the Web Audio API's AnalyserNode to analyze the frequency spectrum or buffer the audio to send to a speech recognition engine.&lt;/p&gt;

&lt;p&gt;With this final addition, we can now capture real-time microphone audio from the HTML page we created earlier.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capturing audio from the browser on Easy Mode&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, you might be thinking, "this approach seems complicated and limited (i.e. can't choose the sample rate of the incoming audio, audio processing on the main thread seems bad, etc.)", and you would be right. That's why we created the &lt;a href="https://picovoice.ai/docs/audio-recording-software/" rel="noopener noreferrer"&gt;Picovoice Audio Recorders&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;At Picovoice, we ran into a multitude of challenges getting audio from the web browser for speech recognition. We require specific audio properties for our speech recognition engines, and - since our audio processing happens all in the browser - we want the processing to happen on a worker thread. We found ourselves building out a complex array of utility functions to help, which we eventually merged into an open-source library: Picovoice &lt;a href="https://picovoice.ai/docs/quick-start/voiceprocessor-web/" rel="noopener noreferrer"&gt;Web Voice Processor&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;With Web Voice Processor imported, our main.js file would look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import { WebVoiceProcessor } from '@picovoice/web-voice-processor';

const startButton = document.getElementById('start-button');
const stopButton = document.getElementById('stop-button');

const engine = {
    onmessage: function(e) {
        switch (e.data.command) {
            case 'process':
                const inputData = e.data.inputFrame;
                // do something with the audio
                break;
        }
    }
}

startButton.addEventListener('click', async () =&amp;gt; {
    // Once WebVoiceProcessor has at least one engine
    // subscribed, audio capture begins
    WebVoiceProcessor.subscribe(engine);
});

stopButton.addEventListener('click', () =&amp;gt; {
    // Once WebVoiceProcessor no longer has engines
    // subscribed, audio capture stops
    WebVoiceProcessor.unsubscribe(engine);
});
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In addition to simplifying the audio capture process, Web Voice Processor adds options for resampling the input audio, selecting the audio device to record with and running audio processing on a Worker Thread.&lt;/p&gt;

&lt;p&gt;It takes less than 90 seconds to start recording audio from a web browser:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/_rG87Tf_sWQ"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explore&lt;/strong&gt;&lt;br&gt;
The Web Voice Processor is open-source and available on GitHub . There is also a demo  in the repository that explores more of the features of the library.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Python Wake Word Detection Tutorial — Picovoice</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Mon, 19 Aug 2024 18:12:39 +0000</pubDate>
      <link>https://dev.to/gracezzhang/python-wake-word-detection-tutorial-picovoice-5gm</link>
      <guid>https://dev.to/gracezzhang/python-wake-word-detection-tutorial-picovoice-5gm</guid>
      <description>&lt;p&gt;August 19th, 2024 · 2 min read&lt;/p&gt;

&lt;p&gt;A Wake Word Engine is a tiny algorithm that detects utterances of a given Wake Phrase within a stream of audio. There are good articles that focus on how to build a Wake Word Model using TensorFlow or PyTorch. These are invaluable for educational purposes. But training a production-ready Wake Word Model requires significant effort for data curation and expertise to simulate real-world environments during training.&lt;/p&gt;

&lt;p&gt;Picovoice &lt;a href="https://picovoice.ai/platform/porcupine/" rel="noopener noreferrer"&gt;Porcupine Wake Word Engine&lt;/a&gt; uses Transfer Learning to eliminate the need for data collection per model. Porcupine enables you to train custom wake words instantly without requiring you to gather any data.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Porcupine Python SDK runs on Linux (x86_64), macOS (x86_64 / arm64), Windows (amd64), Raspberry Pi (Zero, 2, 3, 4), NVIDIA Jetson Nano, and BeagleBone.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Below we learn how to use Porcupine Python SDK for Wake Word Detection and train production-ready Custom Wake Words within seconds using &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Porcupine also can run on modern Web browsers using its JavaScript SDK and on several Arm Cortex-M microcontrollers using its C SDK.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Install Porcupine Python SDK&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Install the SDK using PIP from a terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install pvporcupine
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sign up for Picovoice Console&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sign up for &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; for free and copy your AccessKey. It handles authentication and authorization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Usage&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Porcupine SDK ships with a few built-in Wake Word Models such as Alexa, Hey Google, OK Google, Hey Siri, and Jarvis. Check the list of builtin models:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Builtin Keyword Models&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvporcupine

for keyword in pvporcupine.KEYWORDS:    
  print(keyword)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Initialization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When initializing Porcupine, you can use one of the built-in Wake Word Models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;porcupine = pvporcupine.create(        
  access_key=access_key,        
  keywords=[keyword_one, keyword_two])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Or you can provide Custom Keyword Models (more on this below):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;porcupine = pvporcupine.create(        
  access_key=access_key,        
  keyword_paths=[keyword_paths_one, keyword_path_two])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;*&lt;em&gt;Processing&lt;br&gt;
*&lt;/em&gt;&lt;br&gt;
Porcupine takes in audio in chunks (frames). .frame_length property gives the size of each frame. Porcupine accepts 16 kHz audio with 16-bit samples. For each frame, Porcupine returns a number representing the detected keyword. -1 indicates no detection. Positive indices correspond to keyword detections.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;keyword_index = porcupine.process(audio_frame)
if keyword_index &amp;gt;= 0:    
  # Logic to handle keyword detection events

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cleanup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When done, be sure to release resources:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;porcupine.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Create Custom Wake Words&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Often you want to use Custom Wake Word Models with your project. Branded Wake Word Models are essential for enterprise products. Otherwise, you are pushing Amazon, Google, and Apple's brand, not yours! You can create Custom Wake Word Models using Picovoice Console in seconds:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Log in to &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Go to the Porcupine Page&lt;/li&gt;
&lt;li&gt;Select the target language (e.g. English, Japanese, Spanish, etc.)&lt;/li&gt;
&lt;li&gt;Select the platform you want to optimize the model for (e.g. Raspberry Pi, macOS, Windows, etc.)&lt;/li&gt;
&lt;li&gt;Type in the wake phrase. A good wake phrase should have a few &lt;a href="https://picovoice.ai/docs/tips/choosing-a-wake-word/" rel="noopener noreferrer"&gt;linguistic properties&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;Click the train button. Your model with be ready momentarily (file with .ppn suffix). You can download this file for on-device inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A Working Example&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;All that is left is to wire up the audio recording. Then we have an end-to-end wake word solution. Install &lt;a href="https://picovoice.ai/blog/how-to-record-audio-using-python/" rel="noopener noreferrer"&gt;PvRecorder Python SDK&lt;/a&gt; using PIP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install pvrecorder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The following code snippet records audio from the default microphone on the device and processes recorded audio using Porcupine to detect the utterances of selected keywords. Altogether, we need less than 20 lines of code!&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvporcupine
from pvrecorder import PvRecorder

porcupine = pvporcupine.create(access_key=access_key, keywords=keywords)
recoder = PvRecorder(device_index=-1, frame_length=porcupine.frame_length)

try:    
  recoder.start()
  while True:        
    keyword_index = porcupine.process(recoder.read())        
    if keyword_index &amp;gt;= 0:            
      print(f"Detected {keywords[keyword_index]}")

except KeyboardInterrupt:    
  recoder.stop()
 finally:    
  porcupine.delete()    
  recoder.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/bynDnS7wOUM?start=1"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Text-to-Speech in Python: On-Device Solutions</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Fri, 16 Aug 2024 21:09:33 +0000</pubDate>
      <link>https://dev.to/gracezzhang/text-to-speech-in-python-on-device-solutions-o62</link>
      <guid>https://dev.to/gracezzhang/text-to-speech-in-python-on-device-solutions-o62</guid>
      <description>&lt;p&gt;August 16th, 2024 · 2 min read&lt;/p&gt;

&lt;p&gt;Text-to-Speech (TTS) technology, also known as Speech Synthesis, converts text into human-like speech. The rise of deep learning has led to major advancements in TTS quality and naturalness, but at the cost of increased computational requirements. Most big tech companies offer cloud-based TTS APIs, like Google Text-to-Speech, Amazon Polly, or Microsoft Text-to-Speech, and new companies with similar offerings have emerged, such as ElevenLabs, or Coqui Studio. While convenient, these services require an internet connection, raise privacy concerns, and are prone to network outages. On-device solutions allow for more flexibility and privacy by synthesizing speech directly on the user's device. However, few options exist for on-device TTS. This article explores three open-source Python libraries and &lt;a href="https://picovoice.ai/docs/api/orca-python/" rel="noopener noreferrer"&gt;Picovoice Orca Text-to-Speech&lt;/a&gt;.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;🚀 Best-in-class Voice AI!&lt;br&gt;
Build compliant and low-latency AI apps using Python without sending user data to 3rd party servers.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;PyTTSx3&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/pyttsx3/" rel="noopener noreferrer"&gt;PyTTSx3&lt;/a&gt; is a Python library that utilizes the popular eSpeak speech synthesis engine on Linux (NSSpeechSynthesizer is used on MacOS and SAPI5 on Windows). Getting started is straightforward:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install pyTTSx3:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pyttsx3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Save synthesized speech to a file in Python:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pyttsx3

engine = pyttsx3.init()
engine.save_to_file(text='Hello World', filename='PATH/TO/OUTPUT.wav')
engine.runAndWait()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While simple to use, eSpeak's voice quality is robotic compared to more modern TTS systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Coqui TTS&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://github.com/coqui-ai/TTS" rel="noopener noreferrer"&gt;Coqui TTS&lt;/a&gt; is the open-source repository of Coqui Studio. Developers can leverage Coqui's pretrained models or train custom voices. To synthesize speech, follow the steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Coqui TTS:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install TTS
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;List available models in Python:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from TTS.api import TTS

TTS().list_models()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Choose a model name and save synthesized speech to a file:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tts = TTS("CHOSEN/MODEL/NAME")
tts.tts_to_file(text="Hello World", output_path="PATH/TO/OUTPUT.wav")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Coqui offers high-quality voices with natural prosody, at the cost of larger model sizes and longer processing times.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mimic3 from Mycroft&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Mycroft is a free and open-source virtual assistant that offers a TTS system called &lt;a href="https://github.com/MycroftAI/mimic3/" rel="noopener noreferrer"&gt;Mimic3&lt;/a&gt;. This framework currently lacks a pure Python API, so we will use Python's subprocess:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Mycroft:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install mycroft-mimic3-tts

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Synthesize speech and save file to directory OUTPUT/DIR:
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import subprocess 

args = [    
  "mimic3",    
  "\"Hello World\"",    
  "--output-dir", "OUTPUT/DIR"]

try:    
  subprocess.check_call(args)

except subprocess.CalledProcessError as e:  
  # Handle error    
  pass

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For prototyping on-device TTS, Mimic3 from Mycroft provides a balance of quality and performance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Orca Text-to-Speech&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://picovoice.ai/platform/orca/" rel="noopener noreferrer"&gt;Picovoice Orca Text-to-Speech&lt;/a&gt; leverages state-of-the-art Text-to-Speech (TTS) models to provide high-quality voices, while still being small and efficient.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install &lt;a href="https://picovoice.ai/docs/quick-start/orca-python/" rel="noopener noreferrer"&gt;Orca Text-to-Speech Python SDK&lt;/a&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install pvorca
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;Import Orca and create an Orca instance.
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pvorca 
orca = pvorca.create(access_key="${ACCESS_KEY}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Sign-up or Log in to &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; to copy your access key and replace ${ACCESS_KEY} with it.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Synthesize your desired text with
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;orca.synthesize(text="${TEXT}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For more information refer to the &lt;a href="https://picovoice.ai/docs/api/orca-python/" rel="noopener noreferrer"&gt;Orca Text-to-Speech Python SDK&lt;/a&gt; Documentation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;On-device TTS removes privacy concerns, internet requirements, and minimizes latency. With Python solutions like PyTTSx3, Coqui TTS, and Mimic3, developers have several options for synthesizing speech directly on devices based on their needs. However, each solution comes with drawbacks such as poor voice quality, large resource requirements, or lack of flexible APIs. Another alternative is &lt;a href="https://picovoice.ai/platform/orca/" rel="noopener noreferrer"&gt;Orca Text-to-Speech&lt;/a&gt;, which combines state-of-the-art neural TTS with efficiency, allowing to synthesize high-quality speech even on a Raspberry Pi.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/uUk1PZJfIns"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
    </item>
    <item>
      <title>End-to-End Voice Recognition with Python</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Fri, 16 Aug 2024 16:30:34 +0000</pubDate>
      <link>https://dev.to/gracezzhang/end-to-end-voice-recognition-with-python-1b1</link>
      <guid>https://dev.to/gracezzhang/end-to-end-voice-recognition-with-python-1b1</guid>
      <description>&lt;p&gt;August 16th, 2024 · 2 min read&lt;/p&gt;

&lt;p&gt;There are several approaches for adding speech recognition capabilities to a Python application. In this article, I’d like to introduce a new paradigm for adding purpose-made &amp;amp; context-aware voice assistants into Python apps using the &lt;a href="https://picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice platform&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Picovoice enables developers to create voice experiences similar to Alexa and Google for existing Python apps. Different from cloud-based alternatives, Picovoice is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Private and secure — no voice data goes out of the app.&lt;/li&gt;
&lt;li&gt;Accurate — focuses on the domain of interest&lt;/li&gt;
&lt;li&gt;Cross-platform — Linux, macOS, Windows, Raspberry Pi, …&lt;/li&gt;
&lt;li&gt;Reliable and zero latency — eliminates unpredictable network delay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In what follows, I’ll introduce Picovoice by building a voice-enabled alarm clock using &lt;a href="https://picovoice.ai/docs/picovoice/" rel="noopener noreferrer"&gt;Picovoice SDK&lt;/a&gt;, &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt;, and &lt;a href="https://docs.python.org/3/library/tkinter.html" rel="noopener noreferrer"&gt;Tkinter&lt;/a&gt; GUI framework. The code is open-source and available on Picovoice’s GitHub &lt;a href="https://github.com/Picovoice/picovoice/tree/master/demo/tkinter" rel="noopener noreferrer"&gt;repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1 — Install Picovoice&lt;/strong&gt;&lt;br&gt;
Install Picovoice from a terminal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install picovoice
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;2 — Create an Instance of Picovoice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Picovoice is an end-to-end voice recognition platform with wake word detection and intent inference capabilities. Picovoice uses the &lt;a href="https://picovoice.ai/platform/porcupine/" rel="noopener noreferrer"&gt;Porcupine Wake Word&lt;/a&gt; engine for voice activation and the &lt;a href="https://picovoice.ai/platform/rhino/" rel="noopener noreferrer"&gt;Rhino Speech-to-Intent&lt;/a&gt; engine for inferring intent from follow-on voice commands. For example, when a user says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Picovoice, set an alarm for 2 hours and 31 seconds.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Porcupine detects the utterance of thePicovoice wake word. Then Rhino infers the user’s intent from the follow-on command and provides a structured inference:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
  is_understood: true,
  intent: setAlarm,
  slots: {
    hours: 2,
    seconds: 31
  }
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create an instance of Picovoice by providing paths to Porcupine and Rhino models and callbacks for wake word detection and inference completion:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from picovoice import Picovoice

keyword_path = ...  # path to Porcupine wake word file (.PPN)

def wake_word_callback():
  pass

context_path = ...  # path to Rhino context file (.RHN)

def inference_callback(inference):
  print(inference.is_understood)
  if inference.is_understood:
    print(inference.intent)
    for k, v in inference.slots.items():
      print(f"{k} : {v}")

pv = Picovoice(
  access_key=${YOUR_ACCESS_KEY}
  keyword_path=keyword_path(),
  wake_word_callback=wake_word_callback,
  context_path=context_path(),
  inference_callback=inference_callback)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Several pre-trained Porcupine and Rhino models are available on their GitHub repositories [1][2]. For this demo, we use the pre-trained PicovoicePorcupine model and the pre-trained Alarm Rhino model. Developers are also empowered to create custom models using &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3 — Get your Free AccessKey&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Sign up for &lt;a href="https://console.picovoice.ai/" rel="noopener noreferrer"&gt;Picovoice Console&lt;/a&gt; to get your AccessKey. It is free. AccessKey is used for authentication and authorization when using Picovoice SDK.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4 — Process Audio with Picovoice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the engine is instantiated it can process a stream of audio. Simply pass frames of audio to the engine:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pv.process(audio_frame)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;5 — Read audio from the Microphone&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Install &lt;a href="https://pypi.org/project/pvrecorder/" rel="noopener noreferrer"&gt;pvrecorder&lt;/a&gt;. Then, read the audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pvrecoder import PvRecoder
# `-1` is the default input audio device.
recorder = PvRecoder(device_index=-1)
recorder.start()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Read frames of audio from the recorder and pass it to Picovoice’s .process method:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pcm = recorder.read()
pv.process(pcm)

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;6— Create a Cross-Platform GUI using Tkinter&lt;/p&gt;

&lt;p&gt;Tkinter is the standard GUI framework shipped with Python. Create a frame (window), add a label showing the remaining time to it, and launch the app:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;window = tk.Tk()
time_label = tk.Label(window, text='00 : 00 : 00')
time_label.pack()

window.protocol('WM_DELETE_WINDOW', on_close)

window.mainloop()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;7 — Putting it Together&lt;/p&gt;

&lt;p&gt;There are about 200 lines of code for GUI, audio recording, and voice recognition. I also created a separate thread for audio processing to avoid blocking the main GUI thread.&lt;/p&gt;

&lt;p&gt;If you have technical questions or suggestions please open a GitHub issue on Picovoice’s GitHub &lt;a href="https://github.com/Picovoice/picovoice" rel="noopener noreferrer"&gt;repository&lt;/a&gt;. If you wish to modify or improve this demo, feel free to submit a pull request.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/sIfSuOjnmVU"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
    </item>
    <item>
      <title>How to Record Audio using Python — Picovoice</title>
      <dc:creator>grace </dc:creator>
      <pubDate>Fri, 16 Aug 2024 15:43:04 +0000</pubDate>
      <link>https://dev.to/gracezzhang/how-to-record-audio-using-python-picovoice-ff8</link>
      <guid>https://dev.to/gracezzhang/how-to-record-audio-using-python-picovoice-ff8</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.tourl"&gt;&lt;/a&gt;August 16th, 2024 · 1 min read&lt;/p&gt;

&lt;p&gt;Recording audio from a microphone using Python is tricky! Why? Because Python doesn't provide a standard library for it. Existing third-party libraries (e.g. PyAudio) are not cross-platform and have external dependencies. We learned this the hard way as we needed microphone recording functionality in our voice recognition demos and created &lt;a href="https://picovoice.ai/docs/audio-recording-software/" rel="noopener noreferrer"&gt;Picovoice Audio Recorders.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Hence we created &lt;a href="https://picovoice.ai/docs/quick-start/pvrecorder-python/" rel="noopener noreferrer"&gt;PvRecorder&lt;/a&gt;, a cross-platform library that supports Linux, macOS, Windows, Raspberry Pi, NVIDIA Jetson, and BeagleBone. PvRecorder has SDKs for Python, Node.js, .NET, Go, and Rust.&lt;/p&gt;

&lt;p&gt;Below we learn how to record audio in Python using PvRecorder. The Python SDK captures audio suitable for speech recognition, meaning the audio captured is already 16 kHz and 16-bit. PvRecorder Python SDK runs on Linux, macOS, Windows, Raspberry Pi, NVIDIA Jetson, and BeagleBone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Install PvRecorder using PIP:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip3 install pvrecorder
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Find Available Microphones&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A computer can have multiple microphones. For example, a laptop has a built-in microphone and might have a headset attached to it. The first step is to find the microphone we want to record.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pvrecorder import PvRecorder

for index, device in enumerate(PvRecorder.get_audio_devices()):    
  print(f"[{index}] {device}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running above on a Dell XPS laptop gives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[0] Monitor of sof-hda-dsp HDMI3/DP3 Output
[1] Monitor of sof-hda-dsp HDMI2/DP2 Output
[2] Monitor of sof-hda-dsp HDMI1/DP1 Output
[3] Monitor of sof-hda-dsp Speaker + Headphones
[4] sof-hda-dsp Headset Mono Microphone + Headphones Stereo Microphone
[5] sof-hda-dsp Digital Microphone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Take note of the index of your target microphone. We pass this to the constructor of PvRecorder. When unsure, pass -1 to the constructor to use the default microphone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Record Audio&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;First, create an instance of PvRecoder. You need to provide a device_index (see above) and a frame_length. frame_length is the number of audio samples you wish to receive at each read. We set it to 512 (32 milliseconds of 16 kHz audio). Then call .start() to start the recording. Once recording, keep calling .read() in a loop to receive audio. Invoke .stop() to stop recording and then .delete() to release resources when done.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recorder = PvRecorder(device_index=-1, frame_length=512)

try:    recorder.start()

  while True:        
    frame = recorder.read()        
    # Do something ...

except KeyboardInterrupt:    
  recorder.stop()
finally:    
  recorder.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Save Audio to File&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can do whatever you wish using the code snippet above. Whether you want to detect wake words, recognize voice commands, transcribe speech to text, index audio for search, or save it to a file. The code snippet below shows how to save audio into a WAVE file format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recorder = PvRecorder(device_index=-1, frame_length=512)
audio = []

try:    
  recorder.start()

  while True:        
    frame = recorder.read()        
    audio.extend(frame)
except KeyboardInterrupt:    
  recorder.stop()    
  with wave.open(path, 'w') as f:        
      f.setparams((1, 2, 16000, 512, "NONE", "NONE"))        
      f.writeframes(struct.pack("h" * len(audio), *audio))
finally:    
  recorder.delete()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/2kM_AoVktMc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
