Overview
- Goal: Create a basic vector database in Python to store sentence vectors and perform similarity searches using cosine similarity.
- Use Case: Useful in NLP and machine learning for tasks like semantic search and information retrieval.
2. Workflow Steps
A. Tokenization & Vocabulary Creation
- Tokenize each sentence (split into words, convert to lowercase).
- Build a vocabulary: Collect all unique tokens from the sentences.
B. Assign Indices
- Map each word in the vocabulary to a unique integer index.
C. Vectorization
- For each sentence:
- Create a zero vector of size equal to the vocabulary.
- For each token in the sentence, increment the corresponding index in the vector.
D. Store Vectors
- Add each sentence vector to the VectorStore with the sentence as the key.
E. Similarity Search
- Convert the query sentence into a vector using the same vocabulary and process.
- Compute cosine similarity between the query vector and all stored vectors.
- Retrieve the top-N most similar sentences.
3. Example Code Walkthrough
import numpy as np
# Example sentences
sentences = [
"I eat mango",
"mango is my favorite fruit",
"mango, apple, oranges are fruits",
"fruits are good for health",
]
# Tokenization and vocabulary creation
vocabulary = set()
for sentence in sentences:
tokens = sentence.lower().split()
vocabulary.update(tokens)
word_to_index = {word: i for i, word in enumerate(vocabulary)}
# Vectorization
sentence_vectors = {}
for sentence in sentences:
tokens = sentence.lower().split()
vector = np.zeros(len(vocabulary))
for token in tokens:
vector[word_to_index[token]] += 1
sentence_vectors[sentence] = vector
# VectorStore class (simplified)
class VectorStore:
def __init__(self):
self.vector_data = {}
def add_vector(self, vector_id, vector):
self.vector_data[vector_id] = vector
def find_similar_vectors(self, query_vector, num_results=2):
results = []
for vector_id, vector in self.vector_data.items():
similarity = np.dot(query_vector, vector) / (np.linalg.norm(query_vector) * np.linalg.norm(vector))
results.append((vector_id, similarity))
results.sort(key=lambda x: x[1], reverse=True)
return results[:num_results]
# Store vectors
vector_store = VectorStore()
for sentence, vector in sentence_vectors.items():
vector_store.add_vector(sentence, vector)
# Query
query_sentence = "Mango is the best fruit"
query_vector = np.zeros(len(vocabulary))
for token in query_sentence.lower().split():
if token in word_to_index:
query_vector[word_to_index[token]] += 1
similar_sentences = vector_store.find_similar_vectors(query_vector, num_results=2)
# Output
print("Query Sentence:", query_sentence)
print("Similar Sentences:")
for sentence, similarity in similar_sentences:
print(f"{sentence}: Similarity = {similarity:.4f}")
4. Key Concepts Illustrated
Step | Description |
---|---|
Tokenization | Splitting sentences into lowercase words |
Vocabulary Creation | Collecting all unique tokens |
Vectorization | Creating frequency-based vectors for each sentence |
Storing in VectorStore | Adding vectors to a custom Python class |
Similarity Search | Using cosine similarity to find and rank similar sentences |
5. Conclusion
- This approach demonstrates the fundamentals of vector databases: vectorization, storage, and similarity search.
- The design is simple but forms the basis for more advanced, scalable vector database systems used in real-world AI applications[1][2].
Summary:
By following these steps, you can build a basic vector database in Python that supports efficient storage and retrieval of text data using vector representations and cosine similarity searches[1][2].
Citations:
[1] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/f9801498-79a7-4a63-b350-9249d6d88e00/paste-1.txt
[2] https://ppl-ai-file-upload.s3.amazonaws.com/web/direct-files/56506619/858808ac-20b3-4d09-a3cc-05162a8c6374/paste-2.txt
[3] https://www.datastax.com/guides/python-vector-databases
[4] https://www.linkedin.com/pulse/vector-databases-demystified-part-2-building-your-own-adie-kaye
[5] https://www.youtube.com/watch?v=QLsBsWLvz-k
[6] https://dev.to/sebastiandevelops/understanding-vector-databases-a-beginners-guide-20nj
[7] https://www.pluralsight.com/resources/blog/ai-and-data/langchain-local-vector-database-tutorial
[8] https://myscale.com/blog/mastering-vector-database-implementation-in-python-tips/
[9] https://www.youtube.com/watch?v=9fScWrfmICc
[10] https://www.youtube.com/watch?v=c1ggPsErF9s
[11] https://hackernoon.com/vector-databases-basics-of-vector-search-and-langchain-package-in-python
[12] https://dev.to/mehmetakar/scaling-vector-search-for-ai-powered-applications-2pho
[13] https://www.datacamp.com/code-along/vector-databases-for-data-science-with-weaviate-in-python
[14] https://www.youtube.com/watch?v=OU3m34zVKbY
[15] https://www.youtube.com/watch?v=DIs6DmyGS-M
[16] https://www.youtube.com/watch?v=d6JFZF4gclo
[17] https://myscale.com/blog/python-vector-databases-revolutionize-data-storage/
[18] https://dev.to/vivekalhat/building-a-tiny-vector-store-from-scratch-59ep
[19] https://realpython.com/learning-paths/database-access-in-python/
[20] https://realpython.com/chromadb-vector-database/
[21] https://pypi.org/project/vectordb/
Top comments (0)