<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: PRAJAK SEN</title>
    <description>The latest articles on DEV Community by PRAJAK SEN (@prajak002).</description>
    <link>https://dev.to/prajak002</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F953459%2F70541f0b-5e20-4a08-861a-ea64ec163197.jpeg</url>
      <title>DEV Community: PRAJAK SEN</title>
      <link>https://dev.to/prajak002</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/prajak002"/>
    <language>en</language>
    <item>
      <title>Building a Vector Database from Scratch in Python</title>
      <dc:creator>PRAJAK SEN</dc:creator>
      <pubDate>Wed, 16 Apr 2025 09:30:31 +0000</pubDate>
      <link>https://dev.to/prajak002/building-a-vector-database-from-scratch-in-python-5eg</link>
      <guid>https://dev.to/prajak002/building-a-vector-database-from-scratch-in-python-5eg</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Goal: Create a basic vector database in Python to store sentence vectors and perform similarity searches using cosine similarity.&lt;/li&gt;
&lt;li&gt;Use Case: Useful in NLP and machine learning for tasks like semantic search and information retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Workflow Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Tokenization &amp;amp; Vocabulary Creation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tokenize each sentence (split into words, convert to lowercase).&lt;/li&gt;
&lt;li&gt;Build a vocabulary: Collect all unique tokens from the sentences.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  B. Assign Indices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Map each word in the vocabulary to a unique integer index.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  C. Vectorization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For each sentence:

&lt;ul&gt;
&lt;li&gt;Create a zero vector of size equal to the vocabulary.&lt;/li&gt;
&lt;li&gt;For each token in the sentence, increment the corresponding index in the vector.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  D. Store Vectors
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add each sentence vector to the VectorStore with the sentence as the key.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  E. Similarity Search
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Convert the query sentence into a vector using the same vocabulary and process.&lt;/li&gt;
&lt;li&gt;Compute cosine similarity between the query vector and all stored vectors.&lt;/li&gt;
&lt;li&gt;Retrieve the top-N most similar sentences.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Example Code Walkthrough&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Example sentences
&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I eat mango&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mango is my favorite fruit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mango, apple, oranges are fruits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fruits are good for health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Tokenization and vocabulary creation
&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;word_to_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

&lt;span class="c1"&gt;# Vectorization
&lt;/span&gt;&lt;span class="n"&gt;sentence_vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;sentence_vectors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;

&lt;span class="c1"&gt;# VectorStore class (simplified)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VectorStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_similar_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Store vectors
&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentence_vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query
&lt;/span&gt;&lt;span class="n"&gt;query_sentence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mango is the best fruit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;similar_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_similar_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query Sentence:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similar Sentences:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;similar_sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Similarity = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;4. Key Concepts Illustrated&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tokenization&lt;/td&gt;
&lt;td&gt;Splitting sentences into lowercase words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vocabulary Creation&lt;/td&gt;
&lt;td&gt;Collecting all unique tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vectorization&lt;/td&gt;
&lt;td&gt;Creating frequency-based vectors for each sentence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storing in VectorStore&lt;/td&gt;
&lt;td&gt;Adding vectors to a custom Python class&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Similarity Search&lt;/td&gt;
&lt;td&gt;Using cosine similarity to find and rank similar sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;This approach demonstrates the fundamentals of vector databases: vectorization, storage, and similarity search.&lt;/li&gt;
&lt;li&gt;The design is simple but forms the basis for more advanced, scalable vector database systems used in real-world AI applications[1][2].&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Summary:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By following these steps, you can build a basic vector database in Python that supports efficient storage and retrieval of text data using vector representations and cosine similarity searches[1][2].&lt;/p&gt;

&lt;p&gt;Citations:&lt;br&gt;
[1] &lt;a href="https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt" rel="noopener noreferrer"&gt;https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt&lt;/a&gt;&lt;br&gt;
[2] &lt;a href="https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt" rel="noopener noreferrer"&gt;https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt&lt;/a&gt;&lt;br&gt;
[3] &lt;a href="https://www.datastax.com/guides/python-vector-databases" rel="noopener noreferrer"&gt;https://www.datastax.com/guides/python-vector-databases&lt;/a&gt;&lt;br&gt;
[4] &lt;a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye" rel="noopener noreferrer"&gt;https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye&lt;/a&gt;&lt;br&gt;
[5] &lt;a href="https://www.youtube.com/watch?v=QLsBsWLvz-k" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=QLsBsWLvz-k&lt;/a&gt;&lt;br&gt;
[6] &lt;a href="https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj"&gt;https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj&lt;/a&gt;&lt;br&gt;
[7] &lt;a href="https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial" rel="noopener noreferrer"&gt;https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial&lt;/a&gt;&lt;br&gt;
[8] &lt;a href="https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/" rel="noopener noreferrer"&gt;https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/&lt;/a&gt;&lt;br&gt;
[9] &lt;a href="https://www.youtube.com/watch?v=9fScWrfmICc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=9fScWrfmICc&lt;/a&gt;&lt;br&gt;
[10] &lt;a href="https://www.youtube.com/watch?v=c1ggPsErF9s" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=c1ggPsErF9s&lt;/a&gt;&lt;br&gt;
[11] &lt;a href="https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python" rel="noopener noreferrer"&gt;https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python&lt;/a&gt;&lt;br&gt;
[12] &lt;a href="https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho"&gt;https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho&lt;/a&gt;&lt;br&gt;
[13] &lt;a href="https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python" rel="noopener noreferrer"&gt;https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python&lt;/a&gt;&lt;br&gt;
[14] &lt;a href="https://www.youtube.com/watch?v=OU3m34zVKbY" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=OU3m34zVKbY&lt;/a&gt;&lt;br&gt;
[15] &lt;a href="https://www.youtube.com/watch?v=DIs6DmyGS-M" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=DIs6DmyGS-M&lt;/a&gt;&lt;br&gt;
[16] &lt;a href="https://www.youtube.com/watch?v=d6JFZF4gclo" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=d6JFZF4gclo&lt;/a&gt;&lt;br&gt;
[17] &lt;a href="https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/" rel="noopener noreferrer"&gt;https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/&lt;/a&gt;&lt;br&gt;
[18] &lt;a href="https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep"&gt;https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep&lt;/a&gt;&lt;br&gt;
[19] &lt;a href="https://realpython.com/learning-paths/database-access-in-python/" rel="noopener noreferrer"&gt;https://realpython.com/learning-paths/database-access-in-python/&lt;/a&gt;&lt;br&gt;
[20] &lt;a href="https://realpython.com/chromadb-vector-database/" rel="noopener noreferrer"&gt;https://realpython.com/chromadb-vector-database/&lt;/a&gt;&lt;br&gt;
[21] &lt;a href="https://pypi.org/project/vectordb/" rel="noopener noreferrer"&gt;https://pypi.org/project/vectordb/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Building a Vector Database from Scratch in Python</title>
      <dc:creator>PRAJAK SEN</dc:creator>
      <pubDate>Wed, 16 Apr 2025 09:30:30 +0000</pubDate>
      <link>https://dev.to/prajak002/building-a-vector-database-from-scratch-in-python-4n57</link>
      <guid>https://dev.to/prajak002/building-a-vector-database-from-scratch-in-python-4n57</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Overview&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Goal: Create a basic vector database in Python to store sentence vectors and perform similarity searches using cosine similarity.&lt;/li&gt;
&lt;li&gt;Use Case: Useful in NLP and machine learning for tasks like semantic search and information retrieval.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  2. Workflow Steps
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Tokenization &amp;amp; Vocabulary Creation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Tokenize each sentence (split into words, convert to lowercase).&lt;/li&gt;
&lt;li&gt;Build a vocabulary: Collect all unique tokens from the sentences.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  B. Assign Indices
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Map each word in the vocabulary to a unique integer index.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  C. Vectorization
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;For each sentence:

&lt;ul&gt;
&lt;li&gt;Create a zero vector of size equal to the vocabulary.&lt;/li&gt;
&lt;li&gt;For each token in the sentence, increment the corresponding index in the vector.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  D. Store Vectors
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Add each sentence vector to the VectorStore with the sentence as the key.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  E. Similarity Search
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Convert the query sentence into a vector using the same vocabulary and process.&lt;/li&gt;
&lt;li&gt;Compute cosine similarity between the query vector and all stored vectors.&lt;/li&gt;
&lt;li&gt;Retrieve the top-N most similar sentences.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;3. Example Code Walkthrough&lt;/strong&gt;
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="c1"&gt;# Example sentences
&lt;/span&gt;&lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;I eat mango&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mango is my favorite fruit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mango, apple, oranges are fruits&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;fruits are good for health&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Tokenization and vocabulary creation
&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;word_to_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;)}&lt;/span&gt;

&lt;span class="c1"&gt;# Vectorization
&lt;/span&gt;&lt;span class="n"&gt;sentence_vectors&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
    &lt;span class="n"&gt;sentence_vectors&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;

&lt;span class="c1"&gt;# VectorStore class (simplified)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;VectorStore&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;add_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_similar_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vector_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;linalg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;norm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;vector_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;reverse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Store vectors
&lt;/span&gt;&lt;span class="n"&gt;vector_store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;VectorStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentence_vectors&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Query
&lt;/span&gt;&lt;span class="n"&gt;query_sentence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Mango is the best fruit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;query_vector&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vocabulary&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query_sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;word_to_index&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="n"&gt;similar_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vector_store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;find_similar_vectors&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_vector&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;num_results&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Output
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query Sentence:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query_sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similar Sentences:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;similar_sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: Similarity = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  &lt;strong&gt;4. Key Concepts Illustrated&lt;/strong&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tokenization&lt;/td&gt;
&lt;td&gt;Splitting sentences into lowercase words&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vocabulary Creation&lt;/td&gt;
&lt;td&gt;Collecting all unique tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vectorization&lt;/td&gt;
&lt;td&gt;Creating frequency-based vectors for each sentence&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Storing in VectorStore&lt;/td&gt;
&lt;td&gt;Adding vectors to a custom Python class&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Similarity Search&lt;/td&gt;
&lt;td&gt;Using cosine similarity to find and rank similar sentences&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;5. Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;This approach demonstrates the fundamentals of vector databases: vectorization, storage, and similarity search.&lt;/li&gt;
&lt;li&gt;The design is simple but forms the basis for more advanced, scalable vector database systems used in real-world AI applications[1][2].&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Summary:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
By following these steps, you can build a basic vector database in Python that supports efficient storage and retrieval of text data using vector representations and cosine similarity searches[1][2].&lt;/p&gt;

&lt;p&gt;Citations:&lt;br&gt;
[1] &lt;a href="https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt" rel="noopener noreferrer"&gt;https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt&lt;/a&gt;&lt;br&gt;
[2] &lt;a href="https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt" rel="noopener noreferrer"&gt;https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt&lt;/a&gt;&lt;br&gt;
[3] &lt;a href="https://www.datastax.com/guides/python-vector-databases" rel="noopener noreferrer"&gt;https://www.datastax.com/guides/python-vector-databases&lt;/a&gt;&lt;br&gt;
[4] &lt;a href="https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye" rel="noopener noreferrer"&gt;https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye&lt;/a&gt;&lt;br&gt;
[5] &lt;a href="https://www.youtube.com/watch?v=QLsBsWLvz-k" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=QLsBsWLvz-k&lt;/a&gt;&lt;br&gt;
[6] &lt;a href="https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj"&gt;https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj&lt;/a&gt;&lt;br&gt;
[7] &lt;a href="https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial" rel="noopener noreferrer"&gt;https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial&lt;/a&gt;&lt;br&gt;
[8] &lt;a href="https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/" rel="noopener noreferrer"&gt;https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/&lt;/a&gt;&lt;br&gt;
[9] &lt;a href="https://www.youtube.com/watch?v=9fScWrfmICc" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=9fScWrfmICc&lt;/a&gt;&lt;br&gt;
[10] &lt;a href="https://www.youtube.com/watch?v=c1ggPsErF9s" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=c1ggPsErF9s&lt;/a&gt;&lt;br&gt;
[11] &lt;a href="https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python" rel="noopener noreferrer"&gt;https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python&lt;/a&gt;&lt;br&gt;
[12] &lt;a href="https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho"&gt;https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho&lt;/a&gt;&lt;br&gt;
[13] &lt;a href="https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python" rel="noopener noreferrer"&gt;https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python&lt;/a&gt;&lt;br&gt;
[14] &lt;a href="https://www.youtube.com/watch?v=OU3m34zVKbY" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=OU3m34zVKbY&lt;/a&gt;&lt;br&gt;
[15] &lt;a href="https://www.youtube.com/watch?v=DIs6DmyGS-M" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=DIs6DmyGS-M&lt;/a&gt;&lt;br&gt;
[16] &lt;a href="https://www.youtube.com/watch?v=d6JFZF4gclo" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=d6JFZF4gclo&lt;/a&gt;&lt;br&gt;
[17] &lt;a href="https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/" rel="noopener noreferrer"&gt;https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/&lt;/a&gt;&lt;br&gt;
[18] &lt;a href="https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep"&gt;https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep&lt;/a&gt;&lt;br&gt;
[19] &lt;a href="https://realpython.com/learning-paths/database-access-in-python/" rel="noopener noreferrer"&gt;https://realpython.com/learning-paths/database-access-in-python/&lt;/a&gt;&lt;br&gt;
[20] &lt;a href="https://realpython.com/chromadb-vector-database/" rel="noopener noreferrer"&gt;https://realpython.com/chromadb-vector-database/&lt;/a&gt;&lt;br&gt;
[21] &lt;a href="https://pypi.org/project/vectordb/" rel="noopener noreferrer"&gt;https://pypi.org/project/vectordb/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
  </channel>
</rss>
