Vector Databases for Data-Centric AI (Part 2)

GeorgePearse — Fri, 26 Aug 2022 13:36:00 +0000

Building applications with QDrant, Hugging-Face and Streamlit.

This article also lives here:

https://medium.com/@george.pearse

QDrant have created an excellent vector database and I suspect ML Engineers are only beginning to scratch its potential applications.

Vector databases support hybrid similarity search and provide a CRUD API to run updates to your datasets. They are a significant improvement upon first-wave Approximate Nearest Neighbour tools like Faiss and Annoy which enable very high performance in-memory vector search but little in the way of support for update flows, nor metadata filters.

Hybrid search is vector or "semantic" search combined with attribute filtering.

The semantic search implemented by QDrant requires a list of positive and negative examples. Each positive datapoint is an example of what you want the responses to be similar to, each negative datapoint is an example of what you want the responses to be different to.

This allows you to build up arbitrarily complex decision boundaries within your feature space.
An example QDrant query:

{"positive": [0],    
"negative": [1],    
"top": 10,   
"with_payload": true}

This enables the interactive definition of classes:

Start with a single positive datapoint.
Look through the responses.
Add those that you consider to be similar to the list of positives
Add those you consider to be different to the list of negatives
Run that new query, and repeat.

After a batch of labelling you would also be in a position to improve your embeddings and continue the process with a 'better' separated dataset (I'll be experimenting more with this next).

I've built a mini Streamlit application to support this flow and enable you to save each query once complete along with a CSV containing its results.

QDrant-NLP. A short demo

How to Run

Just clone the repo QDrant-NLP
and run:
docker-compose up
I would like to increase the number of datasets this can be tried on, either with GPU backed lambda functions or by saving many example datasets to S3. So far I've only made a 6K subset of ag_news available. ag_news · Datasets at Hugging Face
This is the code snippet used to generate the embeddings via hugging-face:

The embeddings were generated with this code snippet.

Where to Use

Shout out to both Kern.AI (an excellent open-source NLP labelling tool)
https://github.com/code-kern-ai/refinery
and Voxel51 (an excellent open-source Computer Vision analysis tool)
https://github.com/voxel51/fiftyone
for being early adopters of the technology in their platforms, but I don't believe either have yet made use of all of the value it can provide.

Vector Databases for Data-Centric AI

GeorgePearse — Fri, 26 Aug 2022 13:12:00 +0000

Other homes for this article:

https://medium.com/@george.pearse

Vector Databases are one of the newest tools in the MLOps / Data Engineering space. They're designed to be efficient at nearest neighbour queries over embeddings while providing a simple CRUD interface for maintainability.

Embeddings are the outputs of a layer of a Deep Learning model with respect to an input (single datapoint). They are learned representations within which objects of the same class are projected near to each other.

The best vector databases enable you to combine metadata queries, e.g. the dataset split or class you want the results to belong to, along with a nearest neighbour request e.g. return the nearest neighbours to this input example that do not have the same label. This is hard to achieve with nearest neighbour libraries such as Faiss and Annoy because the index is built up-front and cannot be filtered. To achieve an equivalent result you would need to return an excess of nearest neighbours and then apply the filter after.

I'd recommend the tensorflow embedding projector to develop an intuition if you're not familiar (MNIST with images is best).
Embedding projector - visualization of high-dimensional data
Visualize high dimensional TensorFlow Projector

Though it's important to note that the embeddings undergo dimensionality reduction via PCA, T-SNE or U-MAP in order to be projected into 3 dimensions.

This article:

Not All Vector Databases Are Made Equal

A detailed comparison of Milvus, Pinecone, Vespa, Weaviate, Vald, GSI and Qdrant

Does an excellent job of comparing the best offerings currently available but what might you actually want to use a Vector Database for? The below are most relevant to image classification problems.

Active Learning
Got an error in a validation set and want to fix it? Get the nearest neighbours to your error in your unlabelled dataset labelled, and retrain. Repeat until the problem is reduced or resolved. If the nearest neighbour query does not return many similar examples, consider using a package that enables you to increase the number of augmentations, or weighting, of these instances.
Unit Test Construction
Identified a specific type of problem that's particularly costly (e.g. can't distinguish between spoons and forks) and want to monitor your progress against it? Retrieve the nearest neighbours to an instance of the error case within the labelled set, provide a description of the problem and track how performance changes over time.
Closest Counterfactual
Think the labels of your training dataset may be inconsistent? Look at the nearest instances with a different label. Consider getting your experts to review the examples and come to consensus, add further descriptions to your labelling rules or keep them as examples to use in the training of labellers. NB: here you may be better off using something like KNN conformity or simply looking at the cases where there's the largest disagreement between your model and the label. Closest counterfactual is great but it is quite manual and doesn't scale well compared to more systematic approaches.
Finding Mislabelled Instances
Is your model making an error that seems like an easy case? Check the nearest neighbours within the training sets for mislabelled instances and check that you actually have some instances that are similar in the training set in the first place.

Ideally you want your Vector Database to be updated directly from your live ML service so that you always have access to the latest embeddings and don't have to maintain a separate batch pipeline just for the task. Let me know if you have any other uses of Vector Databases (particularly if valuable in image classification) and I'll add them to the list.
Let me know your thoughts. Please click follow if the content interests you.

DEV Community: GeorgePearse

Vector Databases for Data-Centric AI (Part 2)

QDrant-NLP. A short demo

How to Run

Where to Use

Vector Databases for Data-Centric AI