Building a Vector Database from Scratch in Python

Overview

Goal: Create a basic vector database in Python to store sentence vectors and perform similarity searches using cosine similarity.
Use Case: Useful in NLP and machine learning for tasks like semantic search and information retrieval.

2. Workflow Steps

A. Tokenization & Vocabulary Creation

Tokenize each sentence (split into words, convert to lowercase).
Build a vocabulary: Collect all unique tokens from the sentences.

B. Assign Indices

Map each word in the vocabulary to a unique integer index.

C. Vectorization

For each sentence:
- Create a zero vector of size equal to the vocabulary.
- For each token in the sentence, increment the corresponding index in the vector.

D. Store Vectors

Add each sentence vector to the VectorStore with the sentence as the key.

E. Similarity Search

Convert the query sentence into a vector using the same vocabulary and process.
Compute cosine similarity between the query vector and all stored vectors.
Retrieve the top-N most similar sentences.

3. Example Code Walkthrough

import numpy as np

# Example sentences
sentences = [
    "I eat mango",
    "mango is my favorite fruit",
    "mango, apple, oranges are fruits",
    "fruits are good for health",
]

# Tokenization and vocabulary creation
vocabulary = set()
for sentence in sentences:
    tokens = sentence.lower().split()
    vocabulary.update(tokens)

word_to_index = {word: i for i, word in enumerate(vocabulary)}

# Vectorization
sentence_vectors = {}
for sentence in sentences:
    tokens = sentence.lower().split()
    vector = np.zeros(len(vocabulary))
    for token in tokens:
        vector[word_to_index[token]] += 1
    sentence_vectors[sentence] = vector

# VectorStore class (simplified)
class VectorStore:
    def __init__(self):
        self.vector_data = {}

    def add_vector(self, vector_id, vector):
        self.vector_data[vector_id] = vector

    def find_similar_vectors(self, query_vector, num_results=2):
        results = []
        for vector_id, vector in self.vector_data.items():
            similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
            results.append((vector_id, similarity))
        results.sort(key=lambda x: x[1], reverse=True)
        return results[:num_results]

# Store vectors
vector_store = VectorStore()
for sentence, vector in sentence_vectors.items():
    vector_store.add_vector(sentence, vector)

# Query
query_sentence = "Mango is the best fruit"
query_vector = np.zeros(len(vocabulary))
for token in query_sentence.lower().split():
    if token in word_to_index:
        query_vector[word_to_index[token]] += 1

similar_sentences = vector_store.find_similar_vectors(query_vector, num_results=2)

# Output
print("Query Sentence:", query_sentence)
print("Similar Sentences:")
for sentence, similarity in similar_sentences:
    print(f"{sentence}: Similarity = {similarity:.4f}")

4. Key Concepts Illustrated

Step	Description
Tokenization	Splitting sentences into lowercase words
Vocabulary Creation	Collecting all unique tokens
Vectorization	Creating frequency-based vectors for each sentence
Storing in VectorStore	Adding vectors to a custom Python class
Similarity Search	Using cosine similarity to find and rank similar sentences

5. Conclusion

This approach demonstrates the fundamentals of vector databases: vectorization, storage, and similarity search.
The design is simple but forms the basis for more advanced, scalable vector database systems used in real-world AI applications[1][2].

Summary:

By following these steps, you can build a basic vector database in Python that supports efficient storage and retrieval of text data using vector representations and cosine similarity searches[1][2].

Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt
[2] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt
[3] https://www.datastax.com/guides/python-vector-databases
[4] https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye
[5] https://www.youtube.com/watch?v=QLsBsWLvz-k
[6] https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj
[7] https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial
[8] https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/
[9] https://www.youtube.com/watch?v=9fScWrfmICc
[10] https://www.youtube.com/watch?v=c1ggPsErF9s
[11] https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python
[12] https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho
[13] https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python
[14] https://www.youtube.com/watch?v=OU3m34zVKbY
[15] https://www.youtube.com/watch?v=DIs6DmyGS-M
[16] https://www.youtube.com/watch?v=d6JFZF4gclo
[17] https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/
[18] https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep
[19] https://realpython.com/learning-paths/database-access-in-python/
[20] https://realpython.com/chromadb-vector-database/
[21] https://pypi.org/project/vectordb/

DEV Community