DEV Community: PLABAN NAYAK

How to use FiftyOne and Qdrant to Search through Billions of Images in Computer Vision Applications

PLABAN NAYAK — Tue, 30 Jan 2024 02:23:47 +0000

Enhancing Computer Vision Workflows with Multi-Modal Search: A Deep Dive into CLIP Integration

Machine learning has witnessed a remarkable evolution in recent years, with one of the most exciting developments being the significant strides in multi-modal AI. This progress has fostered a synergistic relationship between computer vision and natural language processing, catalyzed by breakthroughs such as OpenAI’s CLIP model. Employing a contrastive learning technique, CLIP seamlessly embeds diverse multimedia content — ranging from language to images — into a unified latent space.

The unparalleled capabilities of multi-modal models like CLIP have propelled advancements in various domains, including zero-shot image classification, knowledge transfer, synthetic data generation, and semantic search. In this article, our focus will be on the latter — leveraging the power of CLIP for natural language image search.

Unveiling the Integration

Traditionally, vector search tools and libraries have stood as independent entities, providing glimpses into the potential of cross-domain searches. Today, however, we present a groundbreaking approach to seamlessly incorporate natural language image search directly into your computer vision workflows. Our integration involves three key components: the open-source computer vision toolkit FiftyOne, the vector database Qdrant, and the powerful CLIP model from OpenAI.

Understanding CLIP’s Role

At the heart of our integration lies OpenAI’s CLIP model, which acts as the bridge between textual and visual domains. CLIP’s contrastive learning methodology allows it to encode textual descriptions and corresponding images into a shared latent space. This intrinsic capability forms the foundation for our natural language image search.

OpenAI’s CLIP (Contrastive Language-Image Pre-training) model is a powerful and innovative deep learning model designed for understanding and processing both natural language and images in a unified framework. Developed by OpenAI, CLIP leverages a contrastive learning approach to learn a shared representation space for text and images, enabling it to understand the relationships between the two modalities. The model was introduced in a research paper titled “CLIP: Connecting Text and Images for Zero-Shot Learning.”

Key characteristics and components of the CLIP model include:

Unified Embedding Space :CLIP is trained to embed images and corresponding textual descriptions into the same high-dimensional space. This shared space allows for direct comparison and understanding of the relationships between visual and textual concepts.
Contrastive Learning : The training of CLIP involves contrastive learning, a technique where the model learns by contrasting positive pairs (correct image-text pairs) with negative pairs (incorrect image-text pairs). This process encourages the model to bring similar images and texts closer together while pushing dissimilar pairs apart in the embedding space.
Vision Transformer (ViT) Architecture : CLIP is built upon the Vision Transformer (ViT) architecture, which has proven effective in image processing tasks. ViT divides images into fixed-size patches and processes them using transformer layers, allowing the model to capture both local and global information in images.
Text Tokenization : CLIP uses a text tokenization method to convert textual descriptions into a format suitable for input to the model. This ensures that both images and text can be processed consistently.
Zero-Shot Learning :One of the notable features of CLIP is its ability to perform zero-shot learning. This means the model can generalize to recognize images or understand text related to categories it has not seen during training. This is achieved by leveraging the shared embedding space, allowing the model to make predictions based on semantic similarities.
Versatility in Tasks : CLIP has demonstrated effectiveness across a wide range of tasks, including image classification, object detection, natural language image retrieval, and even tasks that require understanding nuanced textual prompts.
Pre-Trained Models : OpenAI provides pre-trained CLIP models in various configurations, allowing users to leverage the model’s capabilities without extensive training. These pre-trained models can be fine-tuned for specific tasks if needed. Applications of CLIP span various domains, including computer vision, natural language processing, and AI-driven systems that require a unified understanding of both visual and textual information. The model’s versatility and performance make it a valuable tool for developers and researchers working on multi-modal AI applications.

FiftyOne: Bridging Vision and Language

FiftyOne serves as the glue that seamlessly connects CLIP with your computer vision workflows. This open-source toolkit provides an intuitive interface for exploring, visualizing, and iterating over your multi-modal datasets. We will delve into how FiftyOne facilitates the integration, enabling you to harness the power of CLIP with ease.

FiftyOne is an open-source Python package designed to facilitate the exploration, visualization, and analysis of computer vision datasets. It provides a user-friendly interface for working with diverse datasets, especially those involving images and annotations. FiftyOne aims to simplify the process of understanding and debugging machine learning models by offering tools for visualizing model predictions, analyzing dataset statistics, and iteratively refining the dataset during the development process.

Key features of the FiftyOne package include:

Interactive Exploration : Interactive and customizable UI that allows users to explore images and their associated annotations. This facilitates a deeper understanding of the dataset’s characteristics.
Visualization Tools : Users can visualize images, ground truth annotations, and model predictions directly within the FiftyOne interface. This is particularly useful for inspecting model outputs and assessing the model’s performance.
Dataset Statistics : The package offers tools to compute and visualize various statistics about the dataset, such as class distribution, label co-occurrence, and image quality metrics. This aids in gaining insights into the dataset’s composition and potential biases.
Debugging Models : FiftyOne is designed to help users debug and analyze the outputs of machine learning models. It allows users to visualize model predictions, compare them with ground truth annotations, and identify areas where the model may need improvement.
Annotation Integration : The package supports various annotation formats commonly used in computer vision tasks, including bounding boxes, segmentation masks, keypoints, and classification labels. This flexibility makes it suitable for a wide range of computer vision applications.
Iterative Dataset Refinement : Users can iteratively refine datasets by adding or modifying annotations directly within the FiftyOne interface. This supports an agile development process, where datasets can be improved based on insights gained during exploration.
Compatibility with Deep Learning Frameworks : FiftyOne integrates with popular deep learning frameworks such as TensorFlow and PyTorch. This allows users to seamlessly incorporate it into their machine learning workflows.
Extensibility :The package is designed to be extensible, and users can build custom plugins and extensions to tailor it to their specific needs. FiftyOne is a valuable tool for researchers, data scientists, and machine learning practitioners working on computer vision projects. It enhances the efficiency of the data exploration and model development process by providing a unified platform for visualizing and interacting with image datasets.

Qdrant: Powering Vector Search

To augment our natural language image search, we leverage Qdrant — a vector database that excels in handling high-dimensional data efficiently. Qdrant’s ability to index and search vectors at scale is pivotal in making our integration scalable and performant.

Qdrant “is a vector similarity search engine that provides a production-ready service with a convenient API to store, search, and manage points (i.e. vectors) with an additional payload.” You can think of the payloads as additional pieces of information that can help you hone in on your search and also receive useful information that you can give to your users.

You can get started using Qdrant with the Python qdrant-client, by pulling the latest docker image of Qdrant and connecting to it locally, or by trying out Qdrant’s Cloud free tier option until you are ready to make the full switch.

For step by step Implementation please follow the below link

Conclusion

The amalgamation of FiftyOne, Qdrant, and CLIP introduces a new dimension to computer vision workflows. By seamlessly incorporating natural language image search, users can unlock novel applications, ranging from content discovery to interactive exploration of multi-modal datasets. As the landscape of multi-modal AI continues to evolve, this integration stands at the forefront, exemplifying the potential of cross-pollination between computer vision and natural language processing.

References

https://qdrant.tech/documentation/frameworks/fifty-one/
https://voxel51.com/
connect with me

Building an Application for Facial Recognition Using Python, OpenCV, Transformers and Qdrant

PLABAN NAYAK — Fri, 29 Dec 2023 05:17:29 +0000

Method 1. Facial Recognition Using Python, OpenCV and Qdrant.
Facial recognition technology has become a ubiquitous force, reshaping industries like security, social media, and smartphone authentication. In this blog, we dive into the captivating realm of facial recognition armed with the formidable trio of Python, OpenCV, image embeddings and Qdrant. Join us on this journey as we unravel the intricacies of creating a robust facial recognition system.

Part 1: An Introduction to Facial Recognition
In Part 1, we lay the foundation by delving into the fundamentals of facial recognition technology. Understand the underlying principles, explore its applications, and grasp the significance of Python and OpenCV in our development stack.

Part 2: Setting Up the Environment
A crucial step in any project is preparing the development environment. Learn how to seamlessly integrate Python, OpenCV, and Qdrant to create a harmonious ecosystem for our facial recognition system. We provide step-by-step instructions, ensuring you have solid groundwork before moving forward.

Part 3: Implementation of Facial Recognition Algorithms
With the groundwork in place, we dive into the core of the project. Explore the intricacies of facial recognition algorithms and witness the magic unfold as we implement them using Python and OpenCV. Uncover the inner workings of face detection, feature extraction, and model training.

Part 4: Database Integration with Qdrant
No facial recognition system is complete without a robust database to store and manage facial data efficiently. In the final installment, we guide you through the integration of Qdrant, to enhance the storage and retrieval capabilities of our system. Witness the synergy between Python, OpenCV, and Qdrant as we bring our project to its culmination.

By the end of this blog, you would have gained a comprehensive understanding of facial recognition technology and the practical skills to develop your own system.

Step-by-Step Implementation

Download all the pictures of interest into a local folder.
Identify and extract faces from the pictures.
Calculate facial embeddings from the extracted faces.
Store these facial embeddings in a Qdrant database.
Obtain a colleague’s picture for identification purposes.
Match the face with the provided picture.
Calculate embeddings for the identified face in the provided picture.
Utilize the Qdrant distance function to retrieve the closest matching faces and corresponding photos from the database.

This experiment demonstrates the practical implementation of Python OpenCV and advanced AI technologies in creating a sophisticated Facial recognition / Search Application, showcasing the potential for enhanced user interactions and cognitive responses. Since images are sensitive data ,we do not want to rely on any online service or upload them onto the internet. The entire pipeline defined above is developed to work 100% locally.

The Technology Stack

Qdrant: Vector store for storing image embeddings.
OpenCV: Detect faces from the images. To “extract” faces from the pictures we used Python, OpenCV, a computer vision tool, and a pre-trained Haar Cascade model.
imgbeddings: A Python package to generate embedding vectors from images, using OpenAI’s robust CLIP model via Hugging Face transformers.

An Overview of OpenCV

OpenCV, or Open Source Computer Vision Library, is an open-source computer vision and machine learning software library. Originally developed by Intel, OpenCV is now maintained by a community of developers. It provides a wide range of tools and functions for image and video analysis, including various algorithms for image processing, computer vision, and machine learning.

Key features of OpenCV include:

Image Processing: OpenCV offers a plethora of functions for basic and advanced image processing tasks, such as filtering, transformation, and color manipulation.
Computer Vision Algorithms: _The library includes implementation of various computer vision algorithms, including feature detection, object recognition, and image stitching.
Machine Learning: OpenCV integrates with machine learning frameworks and provides tools for training and deploying machine learning models. This is particularly useful for tasks like object detection and facial recognition._
Camera Calibration: OpenCV includes functions for camera calibration, essential in computer vision applications to correct distortions caused by camera lenses.
Real-time Computer Vision: It supports real-time computer vision applications, making it suitable for tasks like video analysis, motion tracking, and augmented reality.
Cross-Platform Support: OpenCV is compatible with various operating systems, including Windows, Linux, macOS, Android, and iOS. This makes it versatile for a wide range of applications.
Community Support: With a large and active community, OpenCV is continuously evolving, with contributions from researchers, developers, and engineers worldwide.

OpenCV is widely used in academia, industry, and research for tasks ranging from simple image manipulation to complex computer vision and machine learning applications. Its versatility and comprehensive set of tools make it a go-to library for developers working in the field of computer vision.

An Overview of imgbeddings
Here’s a Python package to generate embedding vectors from images, using OpenAI’s robust CLIP model via Hugging Face transformers. These image embeddings, derived from an image model that has seen the entire internet up to mid-2020, can be used for many things: unsupervised clustering (e.g. via umap), embeddings search (e.g. via faiss), and using downstream for other framework-agnostic ML/AI tasks such as building a classifier or calculating image similarity.

The embeddings generation models are ONNX INT8-quantized — meaning they’re 20–30% faster on the CPU, much smaller on disk, and don’t require PyTorch or TensorFlow as a dependency!
Works for many different image domains, thanks to CLIP’s zero-shot performance.
Includes utilities for using principal component analysis (PCA) to reduce the dimensionality of generated embeddings without losing much info.

Vector Store Explained

Definition
Vector stores are specialized databases designed for efficient storage and retrieval of vector embeddings. This specialization is crucial, as conventional databases like SQL are not finely tuned for handling extensive vector data.

Role of Embeddings
Embeddings represent data, typically unstructured data like text or images, in numerical vector formats within a high-dimensional space. Traditional relational databases are ill-suited for storing and retrieving these vector representations.

Key Features of Vector Stores

Efficient Indexing: Vector stores can index and rapidly search for similar vectors using similarity algorithms.
Enhanced Retrieval: This functionality allows applications to identify related vectors based on a provided target vector query.

An Overview of Qdrant

Getting started with Qdrant is seamless. Utilize the Python qdrant-client, access the latest Docker image of Qdrant and establish a local connection, or explore Qdrant’s Cloud free tier option until you are prepared for a comprehensive transition.

High-Level Qdrant Architecture

Understanding Semantic Similarity
Semantic similarity, in the context of a set of documents or terms, is a metric that gauges the distance between items based on the similarity of their meaning or semantic content, rather than relying on lexicographical similarities. This involves employing mathematical tools to assess the strength of the semantic relationship between language units, concepts, or instances. The numerical description obtained through this process results from comparing the information that supports their meaning or describes their nature.

It’s crucial to distinguish between semantic similarity and semantic relatedness. Semantic relatedness encompasses any relation between two terms, whereas semantic similarity specifically involves ‘is a’ relation. This distinction clarifies the nuanced nature of semantic comparisons and their application in various linguistic and conceptual contexts.

Method 2. Using Transformers and Qdrant for Image Recognition
Apart from OpenCV, we can also use Vision Transformers to perform the same task.

For detailed code implementation please refer here
References
https://qdrant.tech/documentation
https://github.com/opencv/opencv
connect with me

Building an Ecommerce-Based Search Application Using Langchain and Qdrant’s Latest Pure Vector-Based Hybrid Search

PLABAN NAYAK — Mon, 25 Dec 2023 11:43:26 +0000

What Is Keyword Search?

Keyword search, also known as keyword-based search, is a traditional and fundamental method of retrieving information from a database or a search engine. It involves using specific words or phrases (keywords) to search for documents, web pages, or other forms of data that contain those exact terms or closely related variations.

Here’s how keyword search typically works:

User Input: The user enters one or more keywords or key phrases into a search box, representing the information they are looking for.
Matching Algorithm: The search engine or system uses a matching algorithm to identify documents or content that contain the exact keywords or closely related terms.
Ranking: The search results are ranked based on relevance, often using algorithms that consider factors like keyword frequency, proximity, and other relevance indicators.
Display of Results: The system displays the search results to the user, usually in a list format, with each result containing a title, snippet, and a link to the full content.

The key characteristics of keyword search include:

Explicit Query: The user provides a specific query made up of terms they believe are relevant to their information needs.
Literal Matching: The search system matches the keywords literally with the content available, aiming to find documents that contain those exact words or phrases.
No Context Analysis: Keyword search does not deeply analyze the context or meaning of the keywords; it primarily focuses on matching the terms provided.

Limitations:

Overly Broad or Narrow Results: Depending on the keywords used, the search results may be too broad, resulting in irrelevant matches, or too narrow, potentially missing relevant information.
Limited Understanding of User Intent: Keyword search often struggles to grasp the user’s underlying intent, as it relies solely on the terms input by the user without considering the context or semantics.
Difficulty with Synonyms and Variations: Keyword search may miss relevant content due to variations in language, synonyms, or different ways of expressing the same concept.
Vulnerability to Manipulation: The search results can be influenced or manipulated by strategic keyword usage, which can impact the relevancy and trustworthiness of the results.

What Is Dense Vector Search ?

Dense vector search, often referred to as vector search or semantic search, is a modern approach to information retrieval that involves representing textual data (such as documents, queries, or other pieces of text) as dense vectors in a high-dimensional vector space. In this approach, words or phrases are mapped to multi-dimensional vectors using techniques like Word2Vec, Doc2Vec, or embeddings from transformer-bas
ed models (e.g., BERT, GPT, etc.).

Here’s how dense vector search typically works:

Vector Representation: Textual data (e.g., sentences, documents, queries) is converted into dense vectors, where each dimension of the vector represents a different aspect of the semantic meaning of the text.
Vector Space: The dense vectors are placed in a high-dimensional vector space, where similar vectors are close to each other in terms of cosine similarity. Similarity measures, such as cosine similarity, are used to calculate the similarity between vectors.
Query Processing: When a user submits a query, it is also converted into a dense vector using the same representation model. This vector is then used to search for similar vectors (representing documents) in the vector space.
Ranking: The system ranks the documents based on their similarity to the query vector, with more similar documents ranked higher.
Display of Results: The search results are displayed to the user, typically in order of relevance based on the similarity scores.

Key characteristics of dense vector search include:

Semantic Understanding: Dense vector search aims to capture the semantic understanding and meaning of the text, allowing for more accurate and contextually relevant search results.
Contextual Analysis: The approach considers the context and relationships between words, phrases, and documents, enabling a deeper understanding of the content.
Less Reliance on Exact Keywords: Unlike keyword search, which relies heavily on exact keyword matches, dense vector search can find relevant information even if the exact keywords are not present. 4.** Flexibility and Adaptability:** Dense vector search is more flexible in handling synonyms, variations, and related terms. It is also adaptable to different languages and domains.
Reduced Sensitivity to Noise: The dense vector representation tends to be more robust against noise or irrelevant terms in the query, improving the overall search experience.

Limitations:

High Dimensionality and Resource Intensiveness: Dense vector representations often reside in high-dimensional spaces, which can be computationally intensive and require substantial memory and processing power, especially when dealing with large datasets.
Training Data Dependency: The quality and effectiveness of dense vectors heavily depend on the availability and quality of training data. If the training data is biased, insufficient, or not representative, it can lead to suboptimal vector representations.
Semantic Drift: The semantic meaning of words and phrases can change over time, and dense vectors may not always capture these changes accurately. The embeddings may become outdated and not reflect current semantic relationships.
Difficulty Capturing Ambiguity: Dense vector representations struggle to capture polysemy (multiple meanings of a word) and homonymy (different words with the same form) effectively. A single vector representation may not accurately capture all possible meanings.
Context Sensitivity: Dense vector representations may not fully capture context, especially complex contextual understanding that involves understanding long-range dependencies or multiple layers of context.
Out-of-Vocabulary Words: Words not present in the training data may pose challenges as they lack pre-trained vector representations. Handling previously unseen words (out-of-vocabulary words) requires special techniques.
Difficulty with Domain-Specific Language: Pre-trained models might not perform optimally in specialized domains or specific jargon-laden language where the vocabulary and usage are unique.
Scalability Issues: As the amount of data grows, maintaining and querying a high-dimensional vector space becomes computationally expensive, potentially affecting the scalability of the search system.
Lack of Explainability: Dense vectors lack inherent interpretability, making it challenging to understand how the model arrived at a particular similarity score or ranking, which can be crucial for certain applications.
Cold Start Problem: Initializing the vector space for a new system or domain without pre-existing embeddings can be challenging, especially when there’s limited training data available.
Need for Regular Updating: Continuous monitoring and updates to the dense vector models are necessary to ensure that the representations stay relevant and accurate over time.

Understanding these limitations is essential for effectively utilizing dense vector search and considering appropriate strategies to address these challenges in various applications and contexts.

In summary, keyword search relies on exact keyword matches and is limited in semantic understanding, whereas dense vector search uses vector representations to capture semantic meaning and provide more contextually relevant search results. Dense vector search is flexible in handling synonyms and related terms, making it suitable for a wide range of applications, especially those that require a deeper understanding of user intent and content semantics.

Hybrid Search

Hybrid vector search is a combination of traditional keyword search and modern dense vector search. It has emerged as a powerful tool for e-commerce companies looking to improve the search experience for their customers.

By combining the strengths of traditional text-based search algorithms with the visual recognition capabilities of deep learning models, hybrid vector search allows users to search for products using a combination of text and images. This can be especially useful for product searches, where customers may not know the exact name or details of the item they are looking for.

Here we will implement ecommerce chat with fashion products using hybrid search. The components include:

Embedding Model: (Sparse + Dense)
Qdrant : for Storage and Retrieval
LLM: gpt-3.5-turbo : Generative Question Answering

What Is SPLADE ?

SPLADE leverages a transformer architecture to generate sparse representations of documents and queries, enabling efficient retrieval. Let’s dive into the process.

The output logits from the transformer backbone are inputs upon which SPLADE builds. The transformer architecture can be something familiar like BERT. Rather than producing dense probability distributions, SPLADE utilizes these logits to construct sparse vectors — think of them as a distilled essence of tokens, where each dimension corresponds to a term from the vocabulary and its associated weight in the context of the given document or query.

This sparsity is critical; it mirrors the probability distributions from a typical Masked Language Modeling task but is tuned for retrieval effectiveness, emphasizing terms that are both:

Contextually relevant: Terms that represent a document well should be given more weight.
Discriminative across documents: Terms that a document has, and other documents don’t, should be given more weight.
The token-level distributions that you’d expect in a standard transformer model are now transformed into token-level importance scores in SPLADE. These scores reflect the significance of each term in the context of the document or query, guiding the model to allocate more weight to terms that are likely to be more meaningful for retrieval purposes.

The resulting sparse vectors are not only memory-efficient but also tailored for precise matching in the high-dimensional space of a search engine like Qdrant.

Interpreting SPLADE

A downside of dense vectors is that they are not interpretable, making it difficult to understand why a document is relevant to a query.

SPLADE importance estimation can provide insights into the ‘why’ behind a document’s relevance to a query. By shedding light on which tokens contribute most to the retrieval score, SPLADE offers some degree of interpretability alongside performance, a rare feat in the realm of neural IR systems. For engineers working on search, this transparency is invaluable.

Compared to other sparse methods, retrieval with SPLADE is slow. There are three primary reasons for this:
The number of non-zero values in SPLADE query and document vectors is typically greater than in traditional sparse vectors, and sparse retrieval systems are not optimized for this
The distribution of non-zero values deviates from the traditional distribution expected by the sparse retrieval systems, again causing slowdowns.
SPLADE vectors are not natively supported by most sparse retrieval systems. Meaning we must perform multiple pre and post-processing steps, weight discretization, etc.

Key idea in SPLADE mechanism is as follows:

The term frequency component measures how often a query term appears within a document, giving more weight to rare terms.
The inverse document frequency factor considers the overall frequency of a term in the entire collection, penalizing common terms.
Document length normalization helps to adjust for variations in document length, ensuring fairness in scoring.

Note: Qdrant supports a separate index for Sparse Vectors. This enables us to use the same collection for both dense and sparse vectors. Each “Point” in Qdrant can have both dense and sparse vectors.

Hybrid Search Implementation in Qdrant

SPLADE Implementation for Sparse Vector :This is a new feature in Qdrant added in their latest release.

_For document embedding : naver/efficient-splade-VI-BT-large-doc
For query embedding : naver/efficient-splade-VI-BT-large-query_

CLIP MODEL for Dense Vector.

For the complete code implementation please refer :here