DEV Community

Cover image for How to Use Every Vector Database in Python with DocArray
Jina AI
Jina AI

Posted on

How to Use Every Vector Database in Python with DocArray

Back in the day, pre-Google, the Internet was mostly text. Whether it was news updates, sports scores, blog posts or emails, ASCII and Unicode were the way to go.

Image description

Aaah, the good old days. Just pure ASCII as God intended.

But nowadays, data is becoming increasingly complex and multimodal, mostly coming in unstructured forms such as images, videos, text, 3D mesh, etc. Gone are the days of being limited to 26 characters and 10 numbers (or more for other character sets). Now there’s much more stuff to deal with.

Just think about your favorite YouTube videos, Spotify songs, or game NPCs.

Typical databases can’t handle these kinds of multimodal data. They can only store and process structured data (like simple text strings or numbers). This really limits our ability to extract valuable business insights and value from a huge chunk of the 21st century's data.

Lucky for us, recent advancements in machine learning techniques and approximate nearest neighbor search have made it possible to better utilize unstructured data:

  • Deep learning models and representation learning to effectively represent complex data using vector embeddings.

  • Vector databases leverage vector embeddings to store and analyze unstructured data.

What are vector databases?

A vector database is a type of database that can index and retrieve data using vectors, similar to how a traditional database uses keys or text to search for items using an index.

A vector database uses a vector index to enable fast retrieval and insertion by a vector, and also offers typical database features such as CRUD operations, filtering, and scalability.

This gives us the best of both worlds - we get the CRUDiness of traditional databases, coupled with the ability to store complex, unstructured data like images, videos, and 3D meshes.

So, vector databases are great, right? What’s even more awesome is having a library to use them all while being capable of handling unstructured data at the same time! One unstructured data library to rule them all!

We are, of course, talking about DocArray. Let’s see what this project is all about.

DocArray's universal Pythonic API to all vector databases

As the description suggests on the project home page, DocArray is a library for nested, unstructured and multimodal data.

This means that if you want to process unstructured data and represent it as vectors, DocArray is perfect for you.

DocArray is also a universal entrypoint for many vector databases.

Image description

For the remainder of this post, we’ll be using DocArray to index and search data in the Amazon Berkeley Objects Dataset. This dataset contains product items with accompanying images and metadata such as brand, country, and color, and represents the inventory of an e-commerce website.

Although a traditional database can perform filtering on metadata, it is unable to search image data or other unstructured data formats. That’s why we’re using a vector database!

We’ll start by loading a subset of the Amazon Berkeley Objects Dataset that comes in CSV format into DocArray and computing vector embeddings.

Image description

Sample images from the dataset

Then, we'll use DocArray with each database to perform search and insertion operations using vectors.

We’ll use the following databases via DocArray in Python:

  • Milvus - cloud-native vector database with storage and computation separated by design

  • Weaviate - vector search engine that stores both objects and vectors and can be accessed through REST or GraphQL

  • Qdrant - vector database written in Rust and designed to be fast and reliable under high loads

  • Redis - in-memory key-value database that supports different kinds of data structures with vector search capabilities

  • ElasticSearch - distributed, RESTful search engine with Approximate Nearest Neighbor search capabilities

  • OpenSearch - open-source search software based on Apache Lucene originally forked from ElasticSearch

  • AnnLite - a Python library for fast Approximate Nearest Neighbor Search with filtering capabilities

For each database, we’ll:

  • Setup the database and install requirements

  • Index the data in the vector database

  • Perform a vector search operation with filtering

  • Display the search results

Image description

In the next few chapters, we'll show you how to prepare the data, generating embeddings, preparing a search Document, indexing the data. Read the whole article.

Top comments (0)