DEV Community

Kostas Pardalis
Kostas Pardalis

Posted on • Edited on • Originally published at devtools.wtf

WTF is a Vector Database?

It’s obviously a database, right? 😄 but how is it different from whatever you’ve heard until now that is a database? Like MySQL or PostgreSQL?

Let’s start by going through the basics and trust me, by the end of this you will have a much better understanding of WTF is a Vector Database!

WTF is a Vector Database?

It’s a database

I’ll perform some plagiarism here but it’s better to hear from someone who knows much better than me of what a database is.

“[01 Course Intro & Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University.

01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

Vector databases do organize inter-related data that models some aspect of the real-world! They are not a core component of most computer applications yet but maybe if the AI revolution proves its current hype, they might be.

Databases usually come packaged as Database Management Systems (DBMS) you probably have also heard this term already and it’s important to keep in mind the difference between a database and a DBMS.

A set of CSV files in your file system can definitely be a database. It absolutely follows the above definition. It can contain information that is inter-related and that models some aspect of the real-world.

Database to DBMS

But what turns a database into a DBMS is what makes databases hard in general. A DBMS includes functionality for:

  1. Ensuring Data Integrity
  2. Data manipulation and access, i.e. add new data
  3. Durability, i.e. what if the database crashes?

If in the above functionality we also add APIs for generic software to interact with the database for storing and processing data, then we have the definition of what a DBMS is.

Data Models & Databases

Let’s see what CMU-DB and Prof. Pavlo have to say about data models.

“[01 Course Intro & Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University.

01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

And most importantly let’s see some examples of Data models.

“[01 Course Intro & Relational Model - Intro to database systems (15-445/645)](https://15445.courses.cs.cmu.edu/fall2022/slides/01-introduction.pdf)” Andy Pavlo, Carnegie Mellon University.

01 Course Intro & Relational Model - Intro to database systems (15-445/645)” Andy Pavlo, Carnegie Mellon University.

The course is about Relational databases but you might have noticed that there’s a mention to vectors in there!

This is important here because it gives us the first concrete definition of what a vector database is.

💡 A vector database is a DBMS that supports a Vector Data Model, in other words it’s a DBMS that uses vectors for describing the data in a database.

As we will see it’s pretty easy to add support for vectors in most of the existing relational databases that exist today but what makes vector databases a different breed of databases is the native support they have for specific operations around vectors that are important useful for Machine Learning and AI.

What is a vector?

Lets take a trip down memory lane. Hopefully the following definition rings some good memories from your youth.

💡 a vector is a mathematical object that represents a quantity that has both magnitude and direction.

This is the definition of a vector that most people have encountered at some point in their life.

If we go a little bit deeper into Wikipedia, we will also find the following general definition of vectors.

💡 In mathematics and physics, a vector  is a term that refers colloquially to some quantities that cannot be expressed by a single number (a scalar, or to elements of some vector spaces).

To the above definition let’s add the one that refers to what a vector is in computer science.

💡 In computer science, an array is a data structure consisting of a collection of elements (values or variables), each identified by at least one array index or key.

The above definition refers to “array” but array and vector are used interchangeably.

We will talk more about vector spaces and features and all the cool stuff of AI a bit later but for now the above definitions are what matter.

First, you have to forget what you might think of vectors at school, we are not talking about euclidian vectors here. Magnitude and direction are not important.

What is important is the way we plan to represent the world in our database.

💡 we use quantities that cannot be expressed by a single number, instead we care about elements of some kind of vector space and the way we can represent these values in a computer is as a collection of values with each one being identified by a key or index.

The above gives us the how we want to represent the world and the how to store this information in a way that a machine can process.

Why do we need vectors?

tldr - vectors can allow machines to understand how things like, text, photos and video are related to one another

So far we’ve been a bit too technical and offering definitions that might make things a bit more clear but we haven’t talked at all about why we even care about vectors. What is wrong with whatever traditional relational databases already offer?

It all started with our need to represent rich text documents not just syntactically but also semantically.

The idea is that we can try to represent a document as vectors of identifiers. These vectors now define a document or vector space which happens to also be an algebraic model.

Because of that, we hope that we can use the mathematical tools of algebraic vector spaces to do interesting things like figuring out how similar two documents are!

This idea is not new. Do you know this guy?

By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0,

By Tim Bray (talk) - I created this work entirely by myself., CC BY-SA 3.0,

In case you don’t, this is Doug Cutting and he’s the author of Apache Lucene that was open sourced in 1999. Lucene is probably the first and most well known library for indexing and searching text. Lucene implements the “vector space model” we talked about.

Vectors and vector spaces are a powerful way to represent information in a way that we can perform search and comparisons beyond what the standard scalar operators allow us to do.

But hopefully you are already wondering why although Lucene and the vector space model concept exists since the 90’s, we care about vector databases today. Also, is Lucene a vector database?

To understand why, we need to talk about a few more things first. But before we do that, let’s summarize.

💡 Vectors are useful because we can turn rich information into vectors in an algebraic model in which we can apply standard algebraic operations like comparisons and measurements. These operations can then be used for information retrieval.

Embeddings

Since 1999 and Lucene, it took us about another 13 years to do the next step in information retrieval.

Welcome to 2013 and to the work of Tomas Mikolov at Google, called Word2Vec.

Word2Vec is a technique that uses neural networks to learn word associations in from a large corpus of text. These neural networks are generating what is usually called in the NLP literature, word embeddings, which are representations of words.

the representation is a real-valued vector that encodes the meaning of the word in such a way that words that are closer in the vector space are expected to be similar in meaning.

I hope you see how often the term “vector” is being used.

The beauty of these algorithms is that after we have created these embeddings or vectors, we can use mathematical functions like the cosine similarity, to measure the semantic similarity of words.

It’s also important to note that these embeddings or representations are represented as real-valued vectors.

Enter Transformers

Today, Word2Vec is not the state of the art in generating embeddings anymore. Instead we are using Transformers that are deep learning models. Models like GPT are based on Transformers.

Regardless of the model used though, the output remains the same. Our information is represented as a real-valued vector and we can still use math to retrieve semantic information from our data.

💡 Embeddings are representations of words that turns them into real-valued vectors that then can be used in conjunction with standard algebraic tools to extract semantic information, i.e. compare semantically two words.

let’s put everything together

In 2023 we have some amazing technologies that can take rich information as input, e.g. a novel, and turn it into a new representation which we can query using machines.

To do that, these technologies turn the information into real-valued vectors.

To work with this information we now need efficient systems to store and process these real-valued vectors and do that at scale.

That’s exactly what a vector database is.

💡 A Vector Database is a DBMS that can efficiently store real-valued vectors of arbitrary dimensions and perform operations on them like applying the cosine-similarity function. On top of that a Vector Database has to also offer all the functionalities commonly found in a DBMS like durability, integrity and manipulation of the data by the user.

Let’s see now what are the unique characteristics of a Vector Database and how one is built.

Check The Next Article in the Series for how to build one!

Top comments (0)