🤖 100 Days of Generative AI - Day 6 - What is a Vector Database? 🤖

#ai #llm #gpt3 #vectordatabase

Whenever I read about vector databases, I often find overly complicated explanations of this topic. Here is my two cents, explaining it in the simplest possible terms.

✅ What is a vector?
In simple language, a vector represents a list of numbers.
For example, let's say an apple costs $1, weighs 160 grams, and has a quality rating of 8. Then, in vector form, it will be represented by [1, 160, 8]. As you can see, each number in the list represents some aspect or characteristic of the apple. This is one of the reasons Large Language Models (LLMs) like GPT can perform tasks like question answering, generating text, or finding similar text. Similarly, if a banana costs $0.5, weighs 110 grams, and has a quality rating of 7, it will be represented by [0.5, 110, 7].
Let's take one more item: chocolate costs $2, weighs 210 grams, and has a quality rating of 5; it will be represented by [2, 210, 5].
Now, if we want the model to find words similar to "apple," it will look at vectors that are closer to apple, which in this case is banana, and that is how models find related words or concepts.

✅ But why do we need them?
We need vectors because they help to represent words or phrases in a way that computers can understand. As computers don't understand words, we convert them into numbers, which computers do understand.

✅ Now the next question is, where can we store them?
We can store these vectors in a special database known as a vector database. These are a special type of database designed to store, search, and manage data represented by vectors.

✅ Some of the popular vector databases are:
✔️ Pinecone: It's a vector database service that helps you find similar items in large datasets, like finding similar images, texts, etc.
✔️ FAISS (Facebook AI Similarity Search): It is a library developed by Facebook to perform vector similarity searches. It's more of a backend library than a full-fledged database, as it doesn't manage data persistence, which is a typical feature of a database.
✔️ ChromaDB: It is an open-source vector database used to store and manage data based on similarity, such as in recommendation systems or semantic search.

👉 Note: This is an oversimplified explanation where I didn't mention terms like text chunking for simplicity. I will cover that in a future post.

📚 If you want to learn more about this topic, please check out my book. Building an LLMOps Pipeline Using Hugging Face
https://pratimuniyal.gumroad.com/l/BuildinganLLMOpsPipelineUsingHuggingFace

Top comments (1)

Kevin • Aug 29 '24

This article is really solid! We've also done some digging into vector databases and found a few key factors to consider when choosing one:

Scalability: As your data grows, your database needs to keep up. A good vector database should handle increasing data volumes without missing a beat. The best ones, like Milvus and Pinecone, are built to scale horizontally, meaning they can spread the load across multiple servers to keep things running smoothly as your data expands.
Performance: Speed matters, especially if you're working with real-time applications. High-performance vector databases use smart indexing and in-memory processing to make sure queries run fast, even with massive datasets. For example, Facebook's Faiss is known for its lightning-fast vector search, which is a game-changer for large-scale data.
Integration: Your vector database should play nice with the rest of your tech stack. Seamless integration with data sources, pipelines, and analytics tools is crucial. Many vector databases, like Milvus, offer APIs and SDKs in popular languages like Python, Java, and Go, making it easy to connect everything together. Plus, they often work well with machine learning libraries like TensorFlow and PyTorch, so deploying models is a breeze.

For more insights into the power of vector databases in AI and machine learning, I recommend checking out this article by my colleague Jatin Malhotra: scalablepath.com/back-end/vector-d...