DEV Community: Milvus

Join Hacktoberfest 2021 with Milvus!

Milvus — Wed, 13 Oct 2021 10:09:20 +0000

How's your #Hacktoberfest so far?

If you have not started yet, it's still not too late. Join the event with Milvus community and win some awesome prizes!

LET'S GET STARTED!👇🏻

How to participate

Sign up for the event on the Hacktoberfest website and visit Milvus Hactoberfest 2021 landing page. If you’re new to the open source world, no worries, we’ve got you covered. Here are some useful resources for beginners:

Make your first contribution

How to create a good commit message

Below we’ve prepared a number of issues of different specialties to kick off the event, but you are free to wander around the Milvus Project and take on any existing issue or open a new issue you find interesting. Just don’t forget to mention Hacktoberfest in your PR title to make sure it counts toward your Hacktoberfest participation.

Code 👩🏻‍💻

As the traditional way to contribute to open source projects, we happily welcome code contributions - the more the merrier! We’ve labeled a series of issues based on difficulty as “beginner”, “intermediate”, and “advanced” (Jackpot if you are to solve it!) for contributors of all skill levels. Click the issue links to learn more.

Documentation 📚

Help improve our documentation! We’re always looking to improve our website and associated docs. If you are a newbie contributor, this is a great place to get your feet wet with Milvus. We’ve prepared a list of newbie-friendly Hacktoberfest issues ready to be tackled.

Milvus technical documentation
Milvus Go API reference documentation

Community content 📝

The Milvus community is more than just code! In addition to documentation and direct code contributions, there are lots of other areas to show off your creativity. Some examples include:

Showcases of a project you’ve built with Milvus
Blog posts about interesting use cases or topics related to vector database technology
Video content
Website design
Artwork
Anything else you can think of!

Bootcamp

Milvus Bootcamp is a repository of sample projects that demonstrate some of the possibilities of Milvus. You can contribute to the bootcamp by creating GitHub issues or pull requests for bug fixes, improvement suggestions, or other changes.

We’re also looking for new projects to add to the Bootcamp. If you have a project idea you’d like to build on, even if it is just a concept, we encourage you to post a proposal in the “Hacktoberfest” category to stir up some discussions.
Community content contribution

Prizes

This year we are going big! Besides the limited edition Hacktoberfest T-shirt 2, by contributing to the Milvus project between October 1 and October 31 you’ll also earn some additional swag and prizes sponsored by Zilliz.

To be eligible for prizes, you need to have:

1 merged PR to receive a sticker pack & a digital Milvus contributor badge (for you to showcase on your LinkedIn profile).
2+ merged PRs to receive a sticker pack, a digital Milvus contributor badge, and a Milvus T-shirt (only for the first 50 participants).

But that’s not all. You also have the chance to bring home:

A Logitech Pro X Keyboard (USD $149 value) and a physical Hacktoberfest badge if recognized as 🏆 Top Contributor 🏆 (Best Quality PR).
A DJI Drone (USD $799 value) if you complete at least 1 PR merged with the issue tagged #advanced and is awarded the 🏆🏆 Grand Prize 🏆🏆 (Most Difficult PR).

📌 The above-mentioned prizes might be replaced with alternatives of the equivalent value should shipping restrictions occur. The final interpretation right belongs to Zilliz.

Claim your prize

Once you’ve submitted your PR, don’t forget to come back and fill in your information here before October 31 to claim your prizes.

Contact us

💬 Milvus Discussion Forum: You can DM @kateshaowanjou or @Renshi8 for any questions you might have about the event or the Forum.
👥 Follow us on Twitter and if your contribution is highlighted during the event you could earn special swag!

// Detect dark theme var iframe = document.getElementById('tweet-1440720658044911619-868'); if (document.body.className.includes('dark-theme')) { iframe.src = "https://platform.twitter.com/embed/Tweet.html?id=1440720658044911619&theme=dark" }

4 Steps to Building a Video Search System

Milvus — Sat, 12 Sep 2020 07:18:45 +0000

As its name suggests, searching for videos by image is the process of retrieving from the repository videos containing similar frames to the input image. One of the key steps is to turn videos into embeddings, which is to say, extract the key frames and convert their features to vectors.

Now, let's dive into how to build a video search system.

1. System overview

The following diagram illustrates the typical workflow of such a video search system.

When importing videos, we use the OpenCV library to cut each video into frames, extract vectors of the key frames using image feature extraction model VGG, and then insert the extracted vectors (embeddings) into Milvus. We use Minio for storing the original videos and Redis for storing correlations between videos and vectors.

When searching for videos, we use the same VGG model to convert the input image into a feature vector and insert it into Milvus to find vectors with the most similarity. Then, the system retrieves the corresponding videos from Minio on its interface according to the correlations in Redis.

2. Data preparation

In this article, we use about 100,000 GIF files from Tumblr as a sample dataset in building an end-to-end solution for searching for video. You can use your own video repositories.

3. Deployment

The code for building the video retrieval system in this article is on GitHub.

Step 1: Build Docker images.

The video retrieval system requires Milvus v0.7.1 docker, Redis docker, Minio docker, the front-end interface docker, and the back-end API docker. You need to build the front-end interface docker and the back-end API docker by yourself, while you can pull the other three dockers directly from Docker Hub.

# Get the video search code
$ git clone -b 0.10.0 https://github.com/JackLCL/search-video-demo.git

# Build front-end interface docker and api docker images
$ cd search-video-demo & make all

Step 2: Configure the environment.

Here we use docker-compose.yml to manage the above-mentioned five containers. See the following table for the configuration of docker-compose.yml:

The IP address 192.168.1.38 in the table above is the server address especially for building the video retrieval system in this article. You need to update it to your server address.

You need to manually create storage directories for Milvus, Redis, and Minio, and then add the corresponding paths in docker-compose.yml. In this example, we created the following directories:

/mnt/redis/data /mnt/minio/data /mnt/milvus/db

You can configure Milvus, Redis, and Minio in docker-compose.yml as follows:

Step 3: Start the system.

Use the modified docker-compose.yml to start up the five docker containers to be used in the video retrieval system:

$ docker-compose up -d

Then, you can run docker-compose ps to check whether the five docker containers have started up properly. The following screenshot shows a typical interface after a successful startup.

Now, you have successfully built a video search system, though the database has no videos.

Step 4: Import videos.

In the deploy directory of the system repository, lies import_data.py, script for importing videos. You only need to update the path to the video files and the importing interval to run the script.

data_path: The path to the videos to import.

time.sleep(0.5): The interval at which the system imports videos. The server that we use to build the video search system has 96 CPU cores. Therefore, it is recommended to set the interval to 0.5 second. Set the interval to a greater value if your server has fewer CPU cores. Otherwise, the importing process will put a burden on the CPU, and create zombie processes.

Run import_data.py to import videos.

$ cd deploy
$ python3 import_data.py

Once the videos are imported, you are all set with your own video search system!

4. Interface display

Open your browser and enter 192.168.1.38:8001 to see the interface of the video search system as shown below.

Toggle the gear switch in the top right to view all videos in the repository.

Click on the upload box on the top left to input a target image. As shown below, the system returns videos containing the most similar frames.

Next, have fun building a video search system!

An Open-Source Tool for Accelerating New Drug Discovery

Milvus — Mon, 06 Jul 2020 08:47:56 +0000

Introduction

Drug discovery, as the source of medical innovation, is an important part of new medical research and development. Drug discovery is implemented by target selection and confirmation. When fragments or lead compounds are discovered, similar compounds are usually searched in internal or commercial compound libraries in order to discover structure-activity relationship (SAR), compound availability, thus evaluating the potential of the lead compounds to be optimized to candidate compounds.

In order to discover available compounds in the fragment space from billion-scale compound libraries, chemical fingerprints are usually retrieved for substructure search and similarity search. However, the traditional solution is time-consuming and error-prone when it comes to billion-scale high-dimensional chemical fingerprints.

Some potential compounds may also be lost in the process. This article discusses using Milvus, a similarity search engine for massive-scale vectors, with RDKit to build a system for high-performance chemical structure similarity search.

Compared with traditional methods, Milvus has faster search speed and broader coverage. By processing chemical fingerprints, Milvus can perform substructure search, similarity search, and exact search in chemical structure libraries in order to discover potentially available medicine.

System Overview

The system uses RDKit to generate chemical fingerprints, and Milvus to perform chemical structure similarity search. Refer to https://github.com/milvus-io/bootcamp/blob/master/EN_solutions/mols_search/README.md to learn more about the system.

1. Generating Chemical Fingerprints

Chemical fingerprints are usually used for substructure search and similarity search. The following image shows a sequential list represented by bits. Each digit represents an element, atom pair, or functional groups. The chemical structure is C1C(=O)NCO1.

We can use RDKit to generate Morgan fingerprints, which define a radius from a specific atom and calculate the number of chemical structures within the range of the radius to generate a chemical fingerprint. Specify different values for the radius and bits to acquire the chemical fingerprints of different chemical structures. The chemical structures are represented in SMILES format.

Python

from rdkit import Chem

mols = Chem.MolFromSmiles(smiles)

mbfp = AllChem.GetMorganFingerprintAsBitVect(mols, radius=2, bits=512)

mvec = DataStructs.BitVectToFPSText(mbfp)

2. Searching Chemical Structures

We can then import the Morgan fingerprints into Milvus to build a chemical structure database. With different chemical fingerprints, Milvus can perform substructure search, similarity search, and exact search.

Python

from milvus import Milvus

Milvus.add_vectors(table_name=MILVUS_TABLE, records=mvecs)

Milvus.search_vectors(table_name=MILVUS_TABLE, query_records=query_mvec, top_k=topk)

Substructure Search

Checks whether a chemical structure contains another chemical structure.

Similarity Search

Searches for similar chemical structures. Tanimoto distance is used as the metric by default.

Exact Search

Checks whether a specified chemical structure exists. This kind of search requires an exact match.

3. Computing Chemical Fingerprints

Tanimoto distance is often used as a metric for chemical fingerprints. In Milvus, Jaccard distance corresponds with Tanimoto distance.

Based on the previous parameters, chemical fingerprint computation can be described as:

We can see that 1 - Jaccard = Tanimoto. Here, we use Jaccard in Milvus to compute the chemical fingerprint, which is actually consistent with Tanimoto distance.

System Demo

To better demonstrate how the system works, we have built a demo that uses Milvus to search for more than 90 million chemical fingerprints. The data used comes from ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF. The initial interface looks as follows:

We can search specified chemical structures in the system and returns similar chemical structures:

Conclusion

Similarity search is indispensable in a number of fields, such as images and videos. For drug discovery, similarity search can be applied to chemical structure databases to discover potentially available compounds, which are then converted to seeds for practical synthesis and point-of-care testing. Milvus, as an open-source similarity search engine for massive-scale feature vectors, is built with heterogeneous computing architecture for the best cost efficiency. Searches over billion-scale vectors take only milliseconds with minimum computing resources. Thus, Milvus can help implement accurate, fast chemical structure search in fields such as biology and chemistry.

You can access the demo by visiting http://40.117.75.127:8002/, and don't forget to also pay a visit to our GitHub https://github.com/milvus-io/milvus to learn more!

Building an Intelligent QA System With NLP and Milvus

Milvus — Tue, 23 Jun 2020 10:09:06 +0000

The question answering system is commonly used in the field of natural language processing. It is used to answer questions in the form of natural language and has a wide range of applications. Typical applications include: intelligent voice interaction, online customer service, knowledge acquisition, personalized emotional chatting, and more. Most question answering systems can be classified as: generative and retrieval question answering systems, single-round question answering and multi-round question answering systems, open question answering systems, and specific question answering systems.

This article mainly deals with a QA system designed for a specific field, which is usually called an intelligent customer service robot. In the past, building a customer service robot usually required conversion of the domain knowledge into a series of rules and knowledge graphs. The construction process relies heavily on “human” intelligence. Once the scenes were changed, a lot of repetitive work would be required.

With the application of deep learning in natural language processing (NLP), machine reading can automatically find answers to matching questions directly from documents. The deep learning language model converts the questions and documents to semantic vectors to find the matching answer.

This article uses Google’s open source BERT model and Milvus, an open source vector search engine, to quickly build a Q&A bot based on semantic understanding.

Overall Architecture

This article implements a question answering system through semantic similarity matching. The general construction process is as follows:

Obtain a large number of questions with answers in a specific field ( a standard question set).
Use the BERT model to convert these questions into feature vectors and store them in Milvus. And Milvus will assign a vector ID to each feature vector at the same time.
Store these representative question IDs and their corresponding answers in PostgreSQL.

When a user asks a question:

The BERT model converts it to a feature vector.
Milvus performs a similarity search and retrieves the ID most similar to the question.
PostgreSQL returns the corresponding answer.

The system architecture diagram is as follows (the blue lines represent the import process and the yellow lines represent the query process):

Next, we will show you how to build an online Q&A system step by step.

Steps to Build the Q&A System

Before you start, you need to install Milvus and PostgreSQL. For the specific installation steps, see the Milvus official website.

1. Data preparation

The experimental data in this article comes from: https://github.com/chatopera/insuranceqa-corpus-zh

The data set contains question and answer data pairs related to the insurance industry. In this article we extracts 20,000 question and answer pairs from it. Through this set of question and answer data sets, you can quickly build a customer service robot for the insurance industry.

2. Generate feature vectors

This system uses a model that BERT has pre-trained. Download it from the link below before starting a service: https://storage.googleapis.com/bert_models/2018_10_18/cased_L-24_H-1024_A-16.zip

Use this model to convert the question database to feature vectors for future similarity search. For more information about the BERT service, see https://github.com/hanxiao/bert-as-service.

3. Import to Milvus and PostgreSQL

Normalize and import the generated feature vectors import to Milvus, and then import the IDs returned by Milvus and the corresponding answers to PostgreSQL. The following shows the table structure in PostgreSQL:

4. Retrieve Answers

The user inputs a question, and after generating the feature vector through BERT, they can find the most similar question in the Milvus library. This article uses the cosine distance to represent the similarity between two sentences. Because all vectors are normalized, the closer the cosine distance of the two feature vectors to 1, the higher the similarity.

In practice, your system may not have perfectly matched questions in the library. Then, you can set a threshold of 0.9. If the greatest similarity distance retrieved is less than this threshold, the system will prompt that it does not include related questions.

System Demonstration

The following shows an example interface of the system:

Enter your question in the dialog box and you will receive a corresponding answer:

Summary

After reading this article, we hope you find it easy to build your own Q&A System.

With the BERT model, you no longer need to sort and organize the text corpora beforehand. At the same time, thanks to the high performance and high scalability of the open source vector search engine Milvus, your QA system can support a corpus of up to hundreds of millions of texts.

Milvus has officially joined the Linux AI (LF AI) Foundation for incubation. You are welcome to join the Milvus community and work with us to accelerate the application of AI technologies!

=> Try our online demo here: https://www.milvus.io/scenarios

Building a Reverse Image Search System Based on Milvus and VGG

Milvus — Tue, 09 Jun 2020 09:16:47 +0000

When you heard “Search by Image”, did you first think of the reverse image search function by search engines such as Baidu and Google? In fact, you can build your own image search system: build your own picture library; select a picture to search in the library yourself, and get several pictures similar to it.

As a similarity search engine for massive feature vectors, Milvus aims to help analyze increasingly large unstructured data and discover the great value behind it. In order to allow Milvus to be applied to the scene of similar image retrieval, we designed a reverse image search system based on Milvus and image feature extraction model VGG.

This article is divided into the following parts:

Data preparation: introduces the data support of the system.
System overview: presents the overall system architecture.
VGG model: introduces the structure, features, block structure and weight parameters.
API introduction: describes the five fundamental working principles of the system API.
Image construction: explains how to build client and server docker images from source code.
System deployment: shows how to set up the system in three steps.
Interface display: display the system GUI.

1. Data Preparation

This article uses the PASCAL VOC image data set as an example to build an end-to-end solution for reverse image search. The data set contains 17,125 images, covering 20 directories: Person; Animal (bird, cat, cow, dog, horse, sheep); Vehicle (airplane, bicycle, boat, bus, car, motorcycle, train); Indoor (bottle, chair, dining table, pot plant, sofa, TV). The data set size is approximately 2GB. You can download the training/validation data through this link: http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar

Note: You can also use other image data set. The currently supported image formats are .jpg format and .png format.

2. System Overview

In order to allow users to interact on web pages, we have adopted a C / S architecture. The webclient is responsible for receiving the user’s request and sending it to the webserver. The webserver, after receiving the HTTP request from the webclient, performs the operation and returns the results to the webclient.

The webserver is mainly composed of two parts, the image feature extraction model VGG and the vector search engine Milvus. The VGG model converts images into vectors, and Milvus is responsible for storing vectors and performing similar vector retrieval. The architecture of the webserver is shown below:

Webserver architecture

3. VGG Model

VGGNet was proposed by researchers from the University of Oxford’s Visual Geometry Group and Google DeepMind. It is the winner in the localization task and the 1st runner-up in the classification task in ILSVRC-2014. Its outstanding contribution is to prove that using a small convolution (3 \ * 3) and increasing the network depth can effectively improve the model’s performance. VGGNet is highly scalable and its generalization ability to migrate to other data sets is very good. The VGG model outperforms GoogleNet on multiple transfer learning tasks, and it is the preferred algorithm for using CNN to extract features from images. Therefore, VGG is selected as the deep learning model in this solution.

VGGNet explored the relationship between the depth of CNN and its performance. By repeatedly stacking 3 \ * 3 small convolution kernels and 2 \ * 2 maximum pooling layers, VGGNet successfully constructed CNNs with a depth of 16–19 layers. The VGG16 model provided by Keras’s application module (keras.applications) is used in this solution.

(1) VGG16 structure

VGG16 contains 13 Convolutional Layers, 3 Fully Connected Layers, and 5 Pooling Layers. Among them, the convolutional layer and the fully connected layer have weight coefficients, so they are also called weighting layers. The total number of weighting layers is 13 + 3 = 16, which explains why the structure is called VGG16. (The pooling layer does not involve weights, so it does not belong to the weighting layer and is not counted).

(2) VGG16 Features

The convolutional layers all use the same convolution kernel parameters.
All pooling layers use the same pooling kernel parameters.
The model is constructed by stacking several convolutional layers and pooling layers, which is relatively easy to form a deeper network structure.

(3) VGG16 block structure

The convolution layers and pooling layers of VGG16 can be divided into different blocks, which are numbered Block1 ~ Block5 in order from front to back. Each block contains several convolutional layers and a pooling layer. For example: Block2 contains 2 convolutional layers (conv3–256) and 1 pooling layer (maxpool). And in the same block, the number of channels of the convolutional layer is the same. According to the VGG16 structure diagram given below, the input image of VGG16 is 224x224x3. During the process, the number of channels doubles, from 64 to 128 progressively, and then to 256, until finally to 512 which no longer changes. The height and width of the image halve from 224 → 112 → 56 → 28 → 14 → 7.

(4) Weight parameter

VGG has a simple structure, but it contains a large number of weights, reaching 139,357,544 parameters. These parameters include convolution kernel weights and fully connected layer weights. Therefore, it has a high fitting ability.

4. API Introduction

The webserver of the entire system provides five APIs that correspond to train, process, count, search, and delete operations. Users can perform image loading, load progress query, Milvus vector number query, image retrieval, and Milvus table deletion. These five APIs cover all the basic functions of the reverse image search system. The rest of the article will explain in detail each of these functions.

(1) Train

The parameters of the train API are shown in the following table:

MethodsNameTypePOSTFileString

Before performing similar image retrieval, you need to load the image library into Milvus, and then call the train API to pass the path of the image to the system. Because Milvus only supports the retrieval of vector data, it is necessary to convert the images to feature vectors. The conversion process is mainly achieved by using Python to call the VGG model:

from preprocessor.vggnet import VGGNet
norm_feat = model.vgg_extract_feat (img_path)

After obtaining the feature vectors of the images, import these vectors into Milvus using Milvus’s insert_vectors interface:

from indexer.index import milvus_client, insert_vectors
status, ids = insert_vectors (index_client, table_name, vectors)

After importing these feature vectors into Milvus, Milvus will assign a unique id to each vector. In order to better find the image based on the vector id during subsequent retrieval, you need to save the relationship between the vector ids and the corresponding images:

from diskcache import Cache
for i in range (len (names)):
    cache [ids [i]] = names [i]

(2) Process

The method of the process API are GET, and no other parameters need to be passed in the call. The process API can be called to view the progress of image loading, for example the number of converted images that have been loaded and the total number of images in the incoming path.

(3) Count

The count API’s method is POST, and no other parameters need to be passed in the call. The count API can be called to view the total number of vectors in the current Milvus. Each vector is converted from an image.

(4) Search

The parameters of the search API are shown in the following table:

MethodsNumFilePOSTTopk (int)Image File

Before importing the image you want to query into the system, call the VGG model to convert the images to vectors first:

from preprocessor.vggnet import VGGNet
norm_feat = model.vgg_extract_feat (img_path)

After obtaining the query vectors, call Milvus’s search_vectors interface for similar vector search:

from milvus import Milvus, IndexType, MetricType, Status
status, results = client.search_vectors (table_name = table_name, query_records = vectors, top_k = top_k, nprobe = 16)

After getting the result vector id, the corresponding image name can be retrieved according to the correspondence between the previously stored vector ids and the image names:

from diskcache import Cache
def query_name_from_ids (vids):
    res = []
    cache = Cache (default_cache_dir)
    for i in vids:
        if i in cache:
            res.append (cache [i])
    return res

(5) Delete

The delete API’s method is POST, and no additional parameters need to be passed in the call. The delete API can be used to delete the tables in Milvus and clean up the previously imported vector data.

5. Docker Image Build

(1) Build pic-search-webserver image

First pull the Milvus bootcamp code, then use the Dockerfile we provided to build the image of the webserver:

$ git clone https://github.com/milvus-io/bootcamp.git
$ cd bootcamp / solutions / pic_search / webserver
# Build image
$ docker build -t pic-search-webserver.
# View the generated image
$ docker images | grep pic-search-webserver

Of course, you can also directly use the image we uploaded to dockerhub:

$ docker pull milvusbootcamp / pic-search-webserver: 0.1.0

(2) Build pic-search-webclient image

First pull the Milvus bootcamp code, then use the Dockerfile we provided to build the image of webclient:

$ git clone https://github.com/milvus-io/bootcamp.git
$ cd bootcamp / solutions / pic_search / webclient
# Build image
$ docker build -t pic-search-webclient.
# View the generated image
$ docker images | grep pic-search-webclient

Of course, you can also directly use the image we uploaded to dockerhub:

$ docker pull milvusbootcamp / pic-search-webclient: 0.1.0

6. System Deployment

We provide GPU deployment schemes and CPU deployment schemes, and users can choose for themselves. The detailed deployment process is available through this link: https://github.com/milvus-io/bootcamp/tree/0.9.0/EN_solutions/pic_search

Step 1 Start Milvus Docker

For detailed steps, please refer to the link: https://milvus.io/docs/guides/get_started/install_milvus/install_milvus.md

Step 2 Start pic-search-webserver docker

$ docker run -d --name zilliz_search_images_demo \
-v IMAGE_PATH1: / tmp / pic1 \
-v IMAGE_PATH2: / tmp / pic2 \
-p 35000: 5000 \
-e "DATA_PATH = / tmp / images-data" \
-e "MILVUS_HOST = 192.168.1.123" \
milvusbootcamp / pic-search-webserver: 0.1.0

Step 3 Start pic-search-webclient docker

$ docker run --name zilliz_search_images_demo_web \
-d --rm -p 8001: 80 \
-e API_URL = http: //192.168.1.123: 35000 \
milvusbootcamp / pic-search-webclient: 0.1.0

7. Interface display

When the above deployment procedures are completed, enter “localhost: 8001” in the browser to access the reverse image search interface.

Fill in the path of the images, and wait until all the images are converted to vectors. Load the vectors into Milvus, and you can get started with your image search:

Conclusion

This article demonstrates how to use Milvus and VGG to build a reverse image search system. Milvus is compatible with various deep learning platforms, and searches over billions of vectors take only milliseconds. You can explore more AI applications with Milvus!

If you have any suggestions or comments, you can raise an issue on our GitHub repo or contact us on the Slack community.

Milvus source code: https://github.com/milvus-io/milvus

Milvus official website: https://milvus.io/

Milvus Bootcamp: https://github.com/milvus-io/bootcamp

Milvus Slack community: http://milvusio.slack.com/

For more information about the VGG model, please visit:

VGG official website: http://www.robots.ox.ac.uk/~vgg/research/very_deep/

VGG GitHub: [https://github.com/machrisaa/tensorflow-vgg

The Easiest Way to Search Among 1 Billion Image Vectors

Milvus — Sat, 06 Jun 2020 10:01:09 +0000

That’s correct. 1 server and 10 lines of code are all you need to search among 1 billion images in a few hundred milliseconds. Its ease of use a few lines of code to handle massive-scale reverse image search. Its superior standalone performance satisfies your need for low-latency, real-time searches. Its support for distributed systems and cloud-native extensions can always deal with 10-billion or 100-billion scale searches. It is Milvus, a fantastic vector search engine with high performance. On November 5,2019, the Milvus team officially declared Milvus an open source project on GitHub for developers and AI scientists across the globe.

Unstructured Data, AI, and Vector Search

With the development of information technology, we are experiencing both data explosion and interesting changes of data types. Since the advent of electronic computers in the mid 1900s, developers have been processing structured data (integers and floats, etc.), semi-structured data (webpages and logs) in the internet era around 2000, and unstructured data (images, videos, voices, and texts, etc.) in the AI era around 2012.

For each type of data, computer scientists have created indexing algorithms to organize, search, and analyze data. For structured data, common indexing algorithms include Bitmap, hash table, and B-tree that are used in relational databases such as Oracle and DB2. For semi-structured data, common indexing algorithms include Inverted Index that are used in search engines such as Solr and ElasticSearch.

It is difficult to use traditional computation methods and processors to process and mine unstructured data. This bottleneck in computer science did not have a breakthrough before the advent of AI algorithms, which uses models (CNN, RNN, VGG, BERT, etc.) to convert images, videos, voices, and texts to corresponding feature vectors. Each feature vector is composed of a string of integers or floats. AI algorithms convert complex unstructured data processing to vector computations that are more friendly to computer processors. Tasks such as reverse image search, reverse video search, and natural language processing (NLP) become computations of vector similarities based on Euclidean distance or cosine similarity.

AI algorithms convert unstructured data to vectors

Although it is relatively simple to compute vector similarity, the quantity of unstructured data is much larger than traditional structured and semi-structured data (by more than 3 orders of magnitudes) and grows faster (1 KB of structured data is generated with 1 GB unstructured data). Similarity computation of massive-scale vectors have become one of the challenges of large-scale deployment of AI algorithms. Thus, ANNS (Approximate Nearest Neighbor Search) comes into being. ANNS clusters similar vectors to decrease the search space and reduce computation load, resulting in faster vector search speed. Common ANNS algorithms include quantization, tree, graph, and combined algorithm (tree-graph, quantization-graph), etc.

A High-Performance Vector Search Engine

Using the world-leading ANNS indexing technique, Milvus has a 99% recall rate of top 5 search. The data loading speed reaches more than 1 million entries per minute. Milvus supports heterogenous acceleration and is compatible with x86/GPU/ARM/Power architecture. In the future, It will also support TPU and other ASIC processors. In the standalone scenario, Milvus can search among billion-scale vectors within a second. Distributed systems and cloud-native extensions can also deal with 10-billion or 100-billion scale searches. Milvus is licensed under Apache License, Version 2.0.

Milvus architecture

During the development of Milvus, we did deep research into the ANNS algorithm and went through lots of thesis and references. We constantly adjusted the hardware and software architecture, carefully designed and adjusted each algorithm, and implemented numerous optimizations per different processors and instruction sets.

After 300 days of hard work, we finally released the first stable version of Milvus, 0.5.1, and completed rigorous tests and production deployment in multiple well-known technology companies. We open sourced Milvus to help more developers deal with opportunities and challenges in unstructured data for more AI scenarios. We also wish to attract a group of geeks to our open source community to continue developing and improving Milvus. Our goal is to make Milvus a next-generation search engine for unstructured data with global influence.

Application Scenarios

So, what are some of the fields that we can apply Milvus to? For example, an e-commerce website contains about 50 million product SKUs. In average, each product has 20 images from vendors and customer comments. 1 billion images are stored in the backend in total. Developers can use pre-trained AI models to convert these 1 billion images to 1 billion feature vectors and then use Milvus to search products by images. Thus, customers can conveniently find their favorite products via reverse image searches.

Apart from reverse image searches, Milvus can also handle massive-scale unstructured data, such as videos, voices, and texts. For example, 1 million short videos are uploaded to a video UGC website per day and each short video is 1 minute in length and 720 P in resolution. For every 2 seconds, a keyframe image is retrieved. There will be 900 million keyframes per month and 10.8 billion keyframes per year. Developers can use AI models to convert 10.8 billion keyframes to 10.8 billion vectors and then use Milvus to perform reverse video searches so that users can conveniently navigate to their interested video fragments.

Milvus can also help NLP developers deal with tasks such as massive-scale duplicate text detection and semantics search. In this way, search engine developers can implement recommendation systems and precise advertising.

Currently, Milvus is adopted by more than 200 renowned technology companies to boost fields such as internet entertainment (reverse image/video search), new retail (product search by image), intelligent finance (user authentication), new drug discovery, etc.

If you would like to experience using 10 lines of code to search among 1 billion images, please refer to Milvus GitHub: https://github.com/milvus-io