<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Karan Shingde</title>
    <description>The latest articles on DEV Community by Karan Shingde (@karan842).</description>
    <link>https://dev.to/karan842</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1138376%2F8653d447-78a1-4df5-823d-37062336ba03.jpg</url>
      <title>DEV Community: Karan Shingde</title>
      <link>https://dev.to/karan842</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/karan842"/>
    <language>en</language>
    <item>
      <title>Build an Audio-Driven Speaker Recognition System Using Open-Source Technologies — Resemblyzer and QdrantDB.</title>
      <dc:creator>Karan Shingde</dc:creator>
      <pubDate>Thu, 18 Jan 2024 16:58:40 +0000</pubDate>
      <link>https://dev.to/karan842/build-an-audio-driven-speaker-recognition-system-using-open-source-technologies-resemblyzer-and-qdrantdb-4ono</link>
      <guid>https://dev.to/karan842/build-an-audio-driven-speaker-recognition-system-using-open-source-technologies-resemblyzer-and-qdrantdb-4ono</guid>
      <description>&lt;h2&gt;
  
  
  Introduction:
&lt;/h2&gt;

&lt;p&gt;In this article, we are going to explore how to match the voice of a speaker with an existing set of voices. You can think of it like a biometric system but using the human voice, unlike physical senses such as the thumb and the eye. To achieve this, we will use the magic of vector embeddings and open-source technologies.&lt;/p&gt;

&lt;p&gt;This type of technology is used in Google Assistant or Siri. When you buy a new device, like an Android phone, while setting up Google on your system, it asks for your voice to capture its pattern, vocals, and so on, for security reasons. This is so that only you can access Google Assistant by saying “Ok, google”.&lt;/p&gt;

&lt;p&gt;Before we get into the details, let’s understand first what vector embedding is and how it has been used for audio.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vector Embeddings for Audio
&lt;/h2&gt;

&lt;p&gt;Vector embeddings are a way to represent objects, such as words, sentences or, in our case, audio data, as vectors in a mathematical space. Audio data can be represented as vectors, where different aspects of the audio (features like frequency, amplitude, etc.) are mapped to specific positions in the vector.&lt;/p&gt;

&lt;p&gt;In the context of audio data, machine learning models can be trained to learn these embeddings. The model analyzes the patterns and characteristics of the audio data to generate meaningful vector representations. Once the model is trained, it can encode audio data by transforming it into a vector representation. This vector now captures important information about the audio’s content and characteristics.&lt;/p&gt;

&lt;p&gt;Audio content will have vectors that are close together in the embedding space. This allows for tasks like audio similarity comparison, where you can quickly identify how similar two audio clips are by measuring the distance between their respective embeddings. To generate vector embeddings, we will use an open-source tool called Resemblyzer and store those vectors in Qdrant DB.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;We have a set of audio clips of some famous personalities: Cristiano Ronaldo, Donald Trump, and Homer Simpson (yes, he is famous).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Resemblyzer: An Overview
&lt;/h2&gt;

&lt;p&gt;Resemblyzer allows us to derive high-level representation of voice through a deep learning model. It simplifies the life of developers by enabling them to convert audio clips into vectors with just a few lines of code, eliminating the need for neural networks. &lt;a href="https://github.com/resemble-ai/Resemblyzer"&gt;See official github repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Install Resemblyzer for Python (3.5+)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;resemblyzer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;I’m using Google Colab with a free T4 GPU for this task. You can also use CPU, but it may take a long time. &lt;a href="https://drive.google.com/drive/folders/1803t9CxJDiwtNvPif16S4chnLxkYZkEY?usp=sharing"&gt;Click here to get audio data&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;
&lt;/blockquote&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# import necessary libraries
&lt;/span&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;resemblyzer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;preprocess_wav&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;VoiceEncoder&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;  &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tqdm&lt;/span&gt; 
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt; 
&lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;Ipython&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;display&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Audio&lt;/span&gt; 
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;itertools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;groupby&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;heapq&lt;/span&gt;

&lt;span class="c1"&gt;# run sample audio
&lt;/span&gt;&lt;span class="n"&gt;audio_sample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Audio&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;‘&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;folder&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Trump&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mp3&lt;/span&gt;&lt;span class="err"&gt;’&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;autoplay&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_sample&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;a href="http://trump.mp/"&gt;PLAY: Trump.mp3&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'Ronaldo'&lt;/span&gt;, &lt;span class="s1"&gt;'Ronaldo2'&lt;/span&gt;, &lt;span class="s1"&gt;'Homer'&lt;/span&gt;, &lt;span class="s1"&gt;'Homer2'&lt;/span&gt;, &lt;span class="s1"&gt;'Trump'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, the important part, we will unleash the power of Resemblyzer and convert audio clips into the vector embeddings with just a few lines of code. Preprocess the waves first for all audio clips.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;wavs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;preprocess_wav&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;tqdm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wav_fpaths&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;Preprocessing&lt;/span&gt; &lt;span class="n"&gt;wavs&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wav_fpaths&lt;/span&gt;&lt;span class="p"&gt;)))),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;object&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;speaker_wavs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;wavs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;indiices&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;speaker&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indices&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wavs&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;speakers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])}&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;speaker_wavs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s1"&gt;'Ronaldo'&lt;/span&gt;: array&lt;span class="o"&gt;([&lt;/span&gt;array&lt;span class="o"&gt;([&lt;/span&gt; 0.00045622, &lt;span class="nt"&gt;-0&lt;/span&gt;.00088888,  0.00016845, ..., &lt;span class="nt"&gt;-0&lt;/span&gt;.00079568,
               &lt;span class="nt"&gt;-0&lt;/span&gt;.00718354, &lt;span class="nt"&gt;-0&lt;/span&gt;.01011641], &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float32&lt;span class="o"&gt;)&lt;/span&gt;               &lt;span class="o"&gt;]&lt;/span&gt;,
       &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;object&lt;span class="o"&gt;)&lt;/span&gt;,
 &lt;span class="s1"&gt;'Ronaldo2'&lt;/span&gt;: array&lt;span class="o"&gt;([&lt;/span&gt;array&lt;span class="o"&gt;([&lt;/span&gt; 0.0025312 ,  0.00321749,  0.00460094, ..., &lt;span class="nt"&gt;-0&lt;/span&gt;.01093079,
               &lt;span class="nt"&gt;-0&lt;/span&gt;.01293177, &lt;span class="nt"&gt;-0&lt;/span&gt;.01618683], &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float32&lt;span class="o"&gt;)&lt;/span&gt;               &lt;span class="o"&gt;]&lt;/span&gt;,
       &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;object&lt;span class="o"&gt;)&lt;/span&gt;,
 &lt;span class="s1"&gt;'Homer'&lt;/span&gt;: array&lt;span class="o"&gt;([&lt;/span&gt;array&lt;span class="o"&gt;([&lt;/span&gt;0., 0., 0., ..., 0., 0., 0.], &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float32&lt;span class="o"&gt;)]&lt;/span&gt;, &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;object&lt;span class="o"&gt;)&lt;/span&gt;,
 &lt;span class="s1"&gt;'Homer2'&lt;/span&gt;: array&lt;span class="o"&gt;([&lt;/span&gt;array&lt;span class="o"&gt;([&lt;/span&gt; 1.33051715e-14,  3.98843861e-14, &lt;span class="nt"&gt;-3&lt;/span&gt;.70518893e-15, ...,
                5.39025990e-04, &lt;span class="nt"&gt;-5&lt;/span&gt;.10490616e-04, &lt;span class="nt"&gt;-5&lt;/span&gt;.79551968e-04], &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float32&lt;span class="o"&gt;)]&lt;/span&gt;,
       &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;object&lt;span class="o"&gt;)&lt;/span&gt;,
 &lt;span class="s1"&gt;'Trump'&lt;/span&gt;: array&lt;span class="o"&gt;([&lt;/span&gt;array&lt;span class="o"&gt;([&lt;/span&gt;&lt;span class="nt"&gt;-0&lt;/span&gt;.0165875 ,  0.03297266, &lt;span class="nt"&gt;-0&lt;/span&gt;.01565401, ..., &lt;span class="nt"&gt;-0&lt;/span&gt;.03698713,
               &lt;span class="nt"&gt;-0&lt;/span&gt;.03372933, &lt;span class="nt"&gt;-0&lt;/span&gt;.02938525], &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;float32&lt;span class="o"&gt;)&lt;/span&gt;               &lt;span class="o"&gt;]&lt;/span&gt;,
       &lt;span class="nv"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;object&lt;span class="o"&gt;)}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the above code, we converted these sound waves into numerical representations with a few lines of code and without using any neural network.&lt;/p&gt;

&lt;p&gt;Now, convert these numerical representations into embeddings.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# compute the embeddings
&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VoiceEncoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;cuda&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;utterance_embeds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;embed_utterance&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;utterance_embeds&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[[&lt;/span&gt;0.         0.         0.0173962  ... 0.         0.04333723 0.00142971]
 &lt;span class="o"&gt;[&lt;/span&gt;0.         0.00967959 0.00503905 ... 0.04058945 0.09630667 0.0495304 &lt;span class="o"&gt;]&lt;/span&gt;
 &lt;span class="o"&gt;[&lt;/span&gt;0.15830468 0.         0.01373593 ... 0.         0.         0.        &lt;span class="o"&gt;]&lt;/span&gt;
 &lt;span class="o"&gt;[&lt;/span&gt;0.18647183 0.         0.11558624 ... 0.         0.         0.        &lt;span class="o"&gt;]&lt;/span&gt;
 &lt;span class="o"&gt;[&lt;/span&gt;0.         0.11265804 0.         ... 0.         0.         0.14819394]]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This vector contains a float object where each row represents each audio clip.&lt;/p&gt;

&lt;p&gt;For this task we are using Qdrant DB as our primary vector database. So, for that, we need to convert this representation into a suitable format. Basically, we need a list of dictionaries where each dictionary contains key as id and vector as keys. Id will be an incremental numeral value.&lt;/p&gt;

&lt;p&gt;To get similar vectors, a unique id must be assigned to the vectors.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create an empty list to hold the embeddings in the desired format
&lt;/span&gt;&lt;span class="n"&gt;embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;

&lt;span class="c1"&gt;# Iterate through each embedding in the array
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;utterance_embeds&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
   &lt;span class="c1"&gt;# Create a dictionary with “id” and “vector” keys
&lt;/span&gt;   &lt;span class="n"&gt;embedding_dict&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()}&lt;/span&gt; &lt;span class="c1"&gt;# Start IDs from 1
&lt;/span&gt;   &lt;span class="c1"&gt;# Append the dictionary to the embeddings list
&lt;/span&gt;   &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding_dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  QdrantDB: An Overview
&lt;/h2&gt;

&lt;p&gt;Qdrant DB is one of the most popular vector databases out there. Using Qdrant DB, web developers can store embeddings and retrieve them seamlessly. Here is the &lt;a href="https://qdrant.tech/documentation/"&gt;official documentation&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;To start with Qdrant DB, &lt;a href="https://qdrant.to/cloud"&gt;Sign up&lt;/a&gt; on their Cloud Service to start with their free tier which has limits of up to 1GB per cluster. Get your API key (copy it locally and safely — you can’t see the API key again after copying).&lt;/p&gt;

&lt;p&gt;For Python Qdrant DB has its own API qdrant_client which is very easy to use with fewer lines of code. Let’s set-up Qdrant DB.&lt;/p&gt;

&lt;p&gt;Install &lt;em&gt;&lt;strong&gt;qdrant_client&lt;/strong&gt;&lt;/em&gt; via pip:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;qdrant_client
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt;
&lt;span class="n"&gt;qdrant_uri&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;paste-your-db-uri&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;# Paste your URI
&lt;/span&gt;&lt;span class="n"&gt;qdrant_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;paste-your-api-key&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="c1"&gt;# Paste your API KEY
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let’s create a collection in the database; here the meaning of collection is the same as MongoDB collection.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Create a collection
&lt;/span&gt;&lt;span class="n"&gt;vectors_config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;VectorParams&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# requires for embeddings from resemblyzer
&lt;/span&gt;  &lt;span class="n"&gt;distance&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;qdrant_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;http&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Distance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;COSINE&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, after initializing QdrantDB, we will upsert (or add) embeddings from Resemblyzer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Upsert embeddings
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;upsert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;my-collection&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Till here we stored our audio samples in an encoded version in Qdrant DB. Now we will test this using a new voice, which has a record in the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speaker Recognition:
&lt;/h2&gt;

&lt;p&gt;To recognize the user with a new voice is all about finding a similarity between the new voice and the set of voices already stored. For example, take the new voice of Cristiano Ronaldo and check if it’s recognized or not. We already have Ronaldo’s voice in the database.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;I’m taking the iconic short speech by Ronaldo, which he gave after &lt;br&gt;
 winning the UCL:&lt;br&gt;
“&lt;strong&gt;Muchas gracias afición esto para vosotros. Siuuuuuuuuu!”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="http://siuu.mp/"&gt;PLAY: Siuu.mp3&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Convert the new voice into embeddings&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;test_wav&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;preprocess_wav&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;drive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;MyDrive&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;audio_data_colab&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;Siuu&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mp3&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Create a voice encoder object
&lt;/span&gt;&lt;span class="n"&gt;test_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed_utterance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_wav&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Search related embeddings
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;my&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;collection&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="o"&gt;[&lt;/span&gt;ScoredPoint&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;2, &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.6956655, &lt;span class="nv"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;={}&lt;/span&gt;, &lt;span class="nv"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;shard_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None&lt;span class="o"&gt;)&lt;/span&gt;,
 ScoredPoint&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1, &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.6705738, &lt;span class="nv"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;={}&lt;/span&gt;, &lt;span class="nv"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;shard_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None&lt;span class="o"&gt;)&lt;/span&gt;,
 ScoredPoint&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;5, &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.56731033, &lt;span class="nv"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;={}&lt;/span&gt;, &lt;span class="nv"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;shard_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None&lt;span class="o"&gt;)&lt;/span&gt;,
 ScoredPoint&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;3, &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.535391, &lt;span class="nv"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;={}&lt;/span&gt;, &lt;span class="nv"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;shard_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None&lt;span class="o"&gt;)&lt;/span&gt;,
 ScoredPoint&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4, &lt;span class="nv"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0, &lt;span class="nv"&gt;score&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0.42906034, &lt;span class="nv"&gt;payload&lt;/span&gt;&lt;span class="o"&gt;={}&lt;/span&gt;, &lt;span class="nv"&gt;vector&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None, &lt;span class="nv"&gt;shard_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;None&lt;span class="o"&gt;)]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By the above result you can see that the ids 1 and 2 are associated with the Ronaldo clip (we did this in the embedding code). The highest score is about 70%, which is fine because we have a very small amount of data. Also the length of the clips are 3–4 seconds on an average. You can add more data and try this out.&lt;/p&gt;

&lt;p&gt;To get top 2 similar results, just run the following code (you can also do it for top 1 or top 3 and then decide based on mode in top 3 case).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Get the top two results based on scores, handling potential ties
&lt;/span&gt;&lt;span class="n"&gt;top_two_results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;heapq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;nlargest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Extract and align IDs, considering potential ties
&lt;/span&gt;&lt;span class="n"&gt;top_two_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sorted&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_two_results&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="c1"&gt;# Remove duplicates
&lt;/span&gt;
&lt;span class="c1"&gt;# Get corresponding names, checking for valid IDs
&lt;/span&gt;&lt;span class="n"&gt;top_two_names&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;aligned_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;top_two_ids&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="n"&gt;aligned_id&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;speakers&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;top_two_names&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;speakers&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;aligned_id&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

 &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="n"&gt;Invalid&lt;/span&gt; &lt;span class="n"&gt;ID&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;aligned_id&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="n"&gt;encountered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="err"&gt;”&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="n"&gt;Top&lt;/span&gt; &lt;span class="n"&gt;two&lt;/span&gt; &lt;span class="n"&gt;speakers&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="err"&gt;“&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;top_two_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;Top two speakers: &lt;span class="o"&gt;[&lt;/span&gt;&lt;span class="s1"&gt;'Ronaldo'&lt;/span&gt;, &lt;span class="s1"&gt;'Ronaldo2'&lt;/span&gt;&lt;span class="o"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yes! It’s a match. We have successfully verified the new voice with the existing set of voices.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;In this article we implemented the audio-driven speaker recognition with just a few lines of code by using open-source technologies such as &lt;strong&gt;Resemblyzer&lt;/strong&gt; and &lt;strong&gt;Qdrant DB&lt;/strong&gt;. Resemblyzer is the easiest way to work on audio data and encode them into embeddings. There is no need for a neural network or transformer architecture. Qdrant DB, on other hand, provides an efficient way to store and retrieve embeddings.&lt;/p&gt;

&lt;p&gt;Thanks for reading this article!&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://qdrant.tech/documentation/"&gt;Qdrant Documentation — Qdrant&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/resemble-ai/Resemblyzer"&gt;resemble-ai/Resemblyzer: A python package to analyze and compare voices with deep learning (github.com)&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>machinelearning</category>
      <category>ai</category>
      <category>genai</category>
    </item>
  </channel>
</rss>
