Building an Ecommerce-Based Search Application Using Langchain and Qdrant’s Latest Pure Vector-Based Hybrid Search

#ai #python #datascience

What Is Keyword Search?

Keyword search, also known as keyword-based search, is a traditional and fundamental method of retrieving information from a database or a search engine. It involves using specific words or phrases (keywords) to search for documents, web pages, or other forms of data that contain those exact terms or closely related variations.

Here’s how keyword search typically works:

User Input: The user enters one or more keywords or key phrases into a search box, representing the information they are looking for.
Matching Algorithm: The search engine or system uses a matching algorithm to identify documents or content that contain the exact keywords or closely related terms.
Ranking: The search results are ranked based on relevance, often using algorithms that consider factors like keyword frequency, proximity, and other relevance indicators.
Display of Results: The system displays the search results to the user, usually in a list format, with each result containing a title, snippet, and a link to the full content.

The key characteristics of keyword search include:

Explicit Query: The user provides a specific query made up of terms they believe are relevant to their information needs.
Literal Matching: The search system matches the keywords literally with the content available, aiming to find documents that contain those exact words or phrases.
No Context Analysis: Keyword search does not deeply analyze the context or meaning of the keywords; it primarily focuses on matching the terms provided.

Limitations:

Overly Broad or Narrow Results: Depending on the keywords used, the search results may be too broad, resulting in irrelevant matches, or too narrow, potentially missing relevant information.
Limited Understanding of User Intent: Keyword search often struggles to grasp the user’s underlying intent, as it relies solely on the terms input by the user without considering the context or semantics.
Difficulty with Synonyms and Variations: Keyword search may miss relevant content due to variations in language, synonyms, or different ways of expressing the same concept.
Vulnerability to Manipulation: The search results can be influenced or manipulated by strategic keyword usage, which can impact the relevancy and trustworthiness of the results.

What Is Dense Vector Search ?

Dense vector search, often referred to as vector search or semantic search, is a modern approach to information retrieval that involves representing textual data (such as documents, queries, or other pieces of text) as dense vectors in a high-dimensional vector space. In this approach, words or phrases are mapped to multi-dimensional vectors using techniques like Word2Vec, Doc2Vec, or embeddings from transformer-bas
ed models (e.g., BERT, GPT, etc.).

Here’s how dense vector search typically works:

Vector Representation: Textual data (e.g., sentences, documents, queries) is converted into dense vectors, where each dimension of the vector represents a different aspect of the semantic meaning of the text.
Vector Space: The dense vectors are placed in a high-dimensional vector space, where similar vectors are close to each other in terms of cosine similarity. Similarity measures, such as cosine similarity, are used to calculate the similarity between vectors.
Query Processing: When a user submits a query, it is also converted into a dense vector using the same representation model. This vector is then used to search for similar vectors (representing documents) in the vector space.
Ranking: The system ranks the documents based on their similarity to the query vector, with more similar documents ranked higher.
Display of Results: The search results are displayed to the user, typically in order of relevance based on the similarity scores.

Key characteristics of dense vector search include:

Semantic Understanding: Dense vector search aims to capture the semantic understanding and meaning of the text, allowing for more accurate and contextually relevant search results.
Contextual Analysis: The approach considers the context and relationships between words, phrases, and documents, enabling a deeper understanding of the content.
Less Reliance on Exact Keywords: Unlike keyword search, which relies heavily on exact keyword matches, dense vector search can find relevant information even if the exact keywords are not present. 4.** Flexibility and Adaptability:** Dense vector search is more flexible in handling synonyms, variations, and related terms. It is also adaptable to different languages and domains.
Reduced Sensitivity to Noise: The dense vector representation tends to be more robust against noise or irrelevant terms in the query, improving the overall search experience.

Limitations:

High Dimensionality and Resource Intensiveness: Dense vector representations often reside in high-dimensional spaces, which can be computationally intensive and require substantial memory and processing power, especially when dealing with large datasets.
Training Data Dependency: The quality and effectiveness of dense vectors heavily depend on the availability and quality of training data. If the training data is biased, insufficient, or not representative, it can lead to suboptimal vector representations.
Semantic Drift: The semantic meaning of words and phrases can change over time, and dense vectors may not always capture these changes accurately. The embeddings may become outdated and not reflect current semantic relationships.
Difficulty Capturing Ambiguity: Dense vector representations struggle to capture polysemy (multiple meanings of a word) and homonymy (different words with the same form) effectively. A single vector representation may not accurately capture all possible meanings.
Context Sensitivity: Dense vector representations may not fully capture context, especially complex contextual understanding that involves understanding long-range dependencies or multiple layers of context.
Out-of-Vocabulary Words: Words not present in the training data may pose challenges as they lack pre-trained vector representations. Handling previously unseen words (out-of-vocabulary words) requires special techniques.
Difficulty with Domain-Specific Language: Pre-trained models might not perform optimally in specialized domains or specific jargon-laden language where the vocabulary and usage are unique.
Scalability Issues: As the amount of data grows, maintaining and querying a high-dimensional vector space becomes computationally expensive, potentially affecting the scalability of the search system.
Lack of Explainability: Dense vectors lack inherent interpretability, making it challenging to understand how the model arrived at a particular similarity score or ranking, which can be crucial for certain applications.
Cold Start Problem: Initializing the vector space for a new system or domain without pre-existing embeddings can be challenging, especially when there’s limited training data available.
Need for Regular Updating: Continuous monitoring and updates to the dense vector models are necessary to ensure that the representations stay relevant and accurate over time.

Understanding these limitations is essential for effectively utilizing dense vector search and considering appropriate strategies to address these challenges in various applications and contexts.

In summary, keyword search relies on exact keyword matches and is limited in semantic understanding, whereas dense vector search uses vector representations to capture semantic meaning and provide more contextually relevant search results. Dense vector search is flexible in handling synonyms and related terms, making it suitable for a wide range of applications, especially those that require a deeper understanding of user intent and content semantics.

Hybrid Search

Hybrid vector search is a combination of traditional keyword search and modern dense vector search. It has emerged as a powerful tool for e-commerce companies looking to improve the search experience for their customers.

By combining the strengths of traditional text-based search algorithms with the visual recognition capabilities of deep learning models, hybrid vector search allows users to search for products using a combination of text and images. This can be especially useful for product searches, where customers may not know the exact name or details of the item they are looking for.

Here we will implement ecommerce chat with fashion products using hybrid search. The components include:

Embedding Model: (Sparse + Dense)
Qdrant : for Storage and Retrieval
LLM: gpt-3.5-turbo : Generative Question Answering

What Is SPLADE ?

SPLADE leverages a transformer architecture to generate sparse representations of documents and queries, enabling efficient retrieval. Let’s dive into the process.

The output logits from the transformer backbone are inputs upon which SPLADE builds. The transformer architecture can be something familiar like BERT. Rather than producing dense probability distributions, SPLADE utilizes these logits to construct sparse vectors — think of them as a distilled essence of tokens, where each dimension corresponds to a term from the vocabulary and its associated weight in the context of the given document or query.

This sparsity is critical; it mirrors the probability distributions from a typical Masked Language Modeling task but is tuned for retrieval effectiveness, emphasizing terms that are both:

Contextually relevant: Terms that represent a document well should be given more weight.
Discriminative across documents: Terms that a document has, and other documents don’t, should be given more weight.
The token-level distributions that you’d expect in a standard transformer model are now transformed into token-level importance scores in SPLADE. These scores reflect the significance of each term in the context of the document or query, guiding the model to allocate more weight to terms that are likely to be more meaningful for retrieval purposes.

The resulting sparse vectors are not only memory-efficient but also tailored for precise matching in the high-dimensional space of a search engine like Qdrant.

Interpreting SPLADE

A downside of dense vectors is that they are not interpretable, making it difficult to understand why a document is relevant to a query.

SPLADE importance estimation can provide insights into the ‘why’ behind a document’s relevance to a query. By shedding light on which tokens contribute most to the retrieval score, SPLADE offers some degree of interpretability alongside performance, a rare feat in the realm of neural IR systems. For engineers working on search, this transparency is invaluable.

Compared to other sparse methods, retrieval with SPLADE is slow. There are three primary reasons for this:
The number of non-zero values in SPLADE query and document vectors is typically greater than in traditional sparse vectors, and sparse retrieval systems are not optimized for this
The distribution of non-zero values deviates from the traditional distribution expected by the sparse retrieval systems, again causing slowdowns.
SPLADE vectors are not natively supported by most sparse retrieval systems. Meaning we must perform multiple pre and post-processing steps, weight discretization, etc.

Key idea in SPLADE mechanism is as follows:

The term frequency component measures how often a query term appears within a document, giving more weight to rare terms.
The inverse document frequency factor considers the overall frequency of a term in the entire collection, penalizing common terms.
Document length normalization helps to adjust for variations in document length, ensuring fairness in scoring.

Note: Qdrant supports a separate index for Sparse Vectors. This enables us to use the same collection for both dense and sparse vectors. Each “Point” in Qdrant can have both dense and sparse vectors.