Whenever I read about vector databases, I often find overly complicated explanations of this topic. Here is my two cents, explaining it in the simplest possible terms.
âś… What is a vector?
In simple language, a vector represents a list of numbers.
For example, let's say an apple costs $1, weighs 160 grams, and has a quality rating of 8. Then, in vector form, it will be represented by [1, 160, 8]. As you can see, each number in the list represents some aspect or characteristic of the apple. This is one of the reasons Large Language Models (LLMs) like GPT can perform tasks like question answering, generating text, or finding similar text. Similarly, if a banana costs $0.5, weighs 110 grams, and has a quality rating of 7, it will be represented by [0.5, 110, 7].
Let's take one more item: chocolate costs $2, weighs 210 grams, and has a quality rating of 5; it will be represented by [2, 210, 5].
Now, if we want the model to find words similar to "apple," it will look at vectors that are closer to apple, which in this case is banana, and that is how models find related words or concepts.
âś… But why do we need them?
We need vectors because they help to represent words or phrases in a way that computers can understand. As computers don't understand words, we convert them into numbers, which computers do understand.
âś… Now the next question is, where can we store them?
We can store these vectors in a special database known as a vector database. These are a special type of database designed to store, search, and manage data represented by vectors.
âś… Some of the popular vector databases are:
✔️ Pinecone: It's a vector database service that helps you find similar items in large datasets, like finding similar images, texts, etc.
✔️ FAISS (Facebook AI Similarity Search): It is a library developed by Facebook to perform vector similarity searches. It's more of a backend library than a full-fledged database, as it doesn't manage data persistence, which is a typical feature of a database.
✔️ ChromaDB: It is an open-source vector database used to store and manage data based on similarity, such as in recommendation systems or semantic search.
👉 Note: This is an oversimplified explanation where I didn't mention terms like text chunking for simplicity. I will cover that in a future post.
đź“š If you want to learn more about this topic, please check out my book. Building an LLMOps Pipeline Using Hugging FaceÂ
https://pratimuniyal.gumroad.com/l/BuildinganLLMOpsPipelineUsingHuggingFace
Top comments (1)
This article is really solid! We've also done some digging into vector databases and found a few key factors to consider when choosing one:
For more insights into the power of vector databases in AI and machine learning, I recommend checking out this article by my colleague Jatin Malhotra: scalablepath.com/back-end/vector-d...