DEV Community: sagaruprety

Music Video Search with Qdrant

sagaruprety — Thu, 22 Feb 2024 17:29:49 +0000

Introduction

In a world inundated with content, discovering new music videos can feel like stumbling through a maze of endless options. As music enthusiasts, we often find ourselves craving a more intuitive and personalized way to explore the vast sea of melodies.

In this blog post, we’ll discuss how to build a music video search and discovery application powered by the capabilities of Qdrant vector store in Python. Qdrant offers advanced search and indexing functionalities, offering a robust and versatile solution for efficient data retrieval and exploration. Designed to meet the evolving needs of modern applications, it provides a powerful framework for organizing, querying, and analyzing vast datasets with speed and precision.

For this blog, we will be using the search and discovery functionalities of Qdrant, so let’s go through the relatively newly introduced Discovery API. All the code in this blog can also accessed at: https://github.com/sagaruprety/youtube_video_search

Qdrant Discovery API

In the Discovery API, Qdrant introduces the concept of “context,” utilized for partitioning the space. Context comprises positive-negative pairs of data points, each pair delineating the space into positive and negative zones. During its usage, the search prioritizes points based on their presence in positive zones or avoidance of negative ones.

The method for supplying context is either via the IDs of the data points or their embeddings. However, they must be provided as positive-negative pairs.

The Discovery API facilitates two novel search types:

Discovery Search: Utilizes the context (positive-negative vector pairs) and a target to retrieve points most akin to the target while adhering to the context’s constraints.

Source: https://qdrant.tech/documentation/concepts/explore/#discovery-search

Context Search: Solely employs the context pairs to fetch points residing in the optimal zone, where loss is minimized. No need for specifying a target embedding. This can be used for recommendation when one has obtained a few data points about a user’s likes and dislikes.

Source: https://qdrant.tech/documentation/concepts/explore/#context-search

Installations

You need to first start a local Qdrant server. The easiest way to do this is via docker. Ensure you have docker installed in your system. Then, go to a terminal and paste the following commands:

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

You can then go to your browser at http://localhost:6333/dashboard and see the Qdrant dashboard.

Also, we need to install the Qdrant Python client, Sentence-transformers library, which contains the vector embedding models, and Pandas for data preprocessing:

pip install qdrant-client pandas sentence-transformers

Dataset

We use an openly available YouTube videos dataset from Kaggle. This dataset consists of the most trending YouTube videos and was scrapped using YouTube’s API. It is essentially a CSV file with a video URL and some metadata about the videos. Specifically, there are 5 fields — Title, Videourl, Category, and Description.

Create Collection

We first download the above-mentioned dataset, load it using the Pandas library, and pre-process it. We filter out all videos which do not belong to the category — ‘Art&Music’.

import pandas as pd

# Load the CSV file into a Pandas DataFrame
csv_file = './data/Youtube_Video_Dataset.csv'
df = pd.read_csv(csv_file)

# filter out all other categories
only_music = df[df['Category'] == 'Art&Music']

# convert all values into string type
only_music['Title'] = only_music['Title'].astype(str)
only_music['Description'] = only_music['Description'].astype(str)
only_music['Category'] = only_music['Category'].astype(str)
only_music.head()

	Title	Videourl	Category	Description
9446	FINE ART Music and Painting PEACEFUL SELECTION...	/watch?v=13E5azGDK1k	Art&Music	CALM MELODIES AND BEAUTIFUL PICTURES\nDebussy,...
9447	Improvised Piano Music and Emotional Art Thera...	/watch?v=5mWjq2BsD9Q	Art&Music	When watching this special episode of The Perf...
9448	babyfirst art and music	/watch?v=rrJbuF6zOIk	Art&Music	nan
9449	Art: music & painting - Van Gogh on Caggiano, ...	/watch?v=1b8xiXKd9Kk	Art&Music	♫ Buy “Art: Music & Painting - Van Gogh on on ...
9450	The Great Masterpieces of Art & Music	/watch?v=tsKlRF2Gw1s	Art&Music	Skip the art museum and come experience “Great...

Next, we create a Qdrant collection. We need to instantiate a Qdrant client and connect it to Qdrant’s local server running at port 6333. The recreate_collection function takes in a collection_name argument, which is the name you want to give to your collection. Note also the vectors_config argument, where we define the size of vector embeddings (our embedding model will be 384 dimensions), and similarity calculation metric, where we use cosine similarity. One can also use the create_collection function, but it will throw an error if you call the function again with the same collection name.

from qdrant_client import QdrantClient
from qdrant_client.http import models

client = QdrantClient("localhost", port=6333)

client.recreate_collection(
   collection_name="youtube_music_videos",
   vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE),
)

We also initialize the embeddings model. Here we use the Sentence-transformer library and the MiniLM model, which is a lightweight embedding model and good enough for common language words.

# Initialize SentenceTransformer model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")

We also need to convert the Pandas dataframe to a dictionary of records, to insert into the Qdrant collection.

# convert pandas dataframe to a dictionary of records for inserting into Qdrant collection
music_videos_dict = only_music.to_dict(orient='records')
music_videos_dict

[{'Title': 'FINE ART Music and Painting PEACEFUL SELECTION (Calm Melodies and Beautiful Pictures)',
 'Videourl': '/watch?v=13E5azGDK1k',
 'Category': 'Art&Music',
 'Description': 'CALM MELODIES AND BEAUTIFUL PICTURES\nDebussy, Milena Stanisic,\nPiano, Flute, Harp,\nFlowers, Sailing, Mediterranean, Lavender,',
 {'Title': 'Improvised Piano Music and Emotional Art Therapy - Featuring Erica Orth',
 'Videourl': '/watch?v=5mWjq2BsD9Q',
 'Category': 'Art&Music',
 'Description': 'When watching this special episode of The Perfect Note, keep in mind, every single note heard and stroke of paint seen in this video is completely improvised…'},
 {'Title': 'babyfirst art and music',
 'Videourl': '/watch?v=rrJbuF6zOIk',
 'Category': 'Art&Music',
 'Description': 'nan',…]

Finally, we insert the records into the collection, including converting the text in the combined_text columns to embeddings:

# upload the records in the Qdrant collection, including creating the vector embeddings of the Title column
for idx, doc in enumerate(music_videos_dict):
 client.upload_records(
 collection_name="youtube_music_videos",
 records=[
 models.Record(
 id=idx, vector=model.encode(doc["Title"]), payload=doc
 )])

Now that we have the data in our collection, let’s do some semantic search on it.

# perform semantic search for a given query in the collection
def search_video(query: str) -> list[dict]:
   collection_name = "youtube_music_videos"
# Convert text query into vector
   vector = model.encode(query).tolist()

   # Use `vector` for search for closest vectors in the collection
   search_results = client.search(
       collection_name=collection_name,
       query_vector=vector,
       query_filter=None,  # If you don't want any other filters
       limit=10,  # get 10 most similar results
   )
   # `search_results` contains found vector ids with similarity scores along with the stored payload
   results = []
   for hit in search_results:
       item = {}
       # print(hit)
       item['score'] = hit.score
       item['Title'] = hit.payload['Title']
       url = hit.payload['Videourl']
       item['URL'] = f'youtube.com{url}'
       results.append(item)
   return results

We have the search function ready. All we need is a query:

# query the collection
query = 'dua lipa'
search_video(query)

[{'score': 0.8309551,
  'Title': 'Dua Lipa - New Rules (Official Music Video)',
  'URL': 'youtube.com/watch?v=k2qgadSvNyU'},
 {'score': 0.8116781,
  'Title': 'Dua Lipa - IDGAF (Official Music Video)',
  'URL': 'youtube.com/watch?v=Mgfe5tIwOj0'},
 {'score': 0.80936086,
  'Title': 'Dua Lipa - Be The One (Official Music Video)',
  'URL': 'youtube.com/watch?v=-rey3m8SWQI'},
 {'score': 0.55487275,
  'Title': 'Sean Paul - No Lie ft. Dua Lipa (Krajnc Remix) (Baywatch Official Music Video)',
  'URL': 'youtube.com/watch?v=hMiHGkzr3ZQ'},
 {'score': 0.49306965,
  'Title': 'Lana Del Rey - Music To Watch Boys To (Official Music Video)',
  'URL': 'youtube.com/watch?v=5kYsxoWfjCg'},
 {'score': 0.48478898,
  'Title': 'Smash Mouth - All Star (Official Music Video)',
  'URL': 'youtube.com/watch?v=L_jWHffIx5E'},
 {'score': 0.47906196,
  'Title': 'Iggy Azalea - Fancy ft. Charli XCX (Official Music Video)',
  'URL': 'youtube.com/watch?v=O-zpOMYRi0w'},
 {'score': 0.47792414,
  'Title': 'ZAYN - PILLOWTALK (Official Music Video)',
  'URL': 'youtube.com/watch?v=C_3d6GntKbk'},
 {'score': 0.46913695,
  'Title': 'ZAYN - Dusk Till Dawn ft. Sia (Official Music Video)',
  'URL': 'youtube.com/watch?v=tt2k8PGm-TI'},
 {'score': 0.46150804,
  'Title': 'Sia - Chandelier (Official Music Video)',
  'URL': 'youtube.com/watch?v=2vjPBrBU-TM'}]

We see that the search API is excellent in retrieving all Dua Lipa videos. However, since our search limit is 10, we also see other videos retrieved. However, the score tells us that the videos not of Dua Lipa have a very low score compared to Dua Lipa videos. For the cosine similarity metric that we have used, the higher the score the better.

We can set the score_threshold parameter to 0.5 in the Qdrant search function to filter out the results below a certain score. Then we only get the top 4 results, despite the maximum limit being set as 10.

Video Discovery

Let’s proceed to use Qdrant’s discovery API service to discover some music videos without explicitly searching with a query. Assume that your music search website has captured user’s preferences by either directly asking them or their search history.

As mentioned above, the discovery API takes in a context of positive and negative data points and searches for new points that are far away from the negative points and close to the positive points.

Let’s assume a given user likes classical and instrumental music and dislikes heavy metal or rock music. As we don’t have an explicit target query, we use the context search functionality of Qdrant.

# specify likes and dislikes as positive and negative queries
negative_1 = 'heavy metal'
positive_1 = 'piano music'

negative_2 = 'rock music'
positive_2 = 'classical music'

# only used when a target query is available
target_embedding = model.encode(query).tolist()

# calculate embeddings for the positive and negative points
positive_embedding_1 = model.encode(positive_1).tolist()
negative_embedding_1= model.encode(negative_1).tolist()

# calculate embeddings for the another pair of positive and negative points
positive_embedding_2 = model.encode(positive_2).tolist()
negative_embedding_2= model.encode(negative_2).tolist()

# create the context example pair
context = [models.ContextExamplePair(positive=positive_embedding_1, negative=negative_embedding_1),
          models.ContextExamplePair(positive=positive_embedding_2, negative=negative_embedding_2)]

# call the discover api
discover = client.discover(
   collection_name = "youtube_music_videos",
       context = context,
       limit=5,

)

# organize the results from the discover api
results = []
for hit in discover:
   item = {}
   item['Title'] = hit.payload['Title']
   url = hit.payload['Videourl']
   item['URL'] = f'youtube.com{url}'
   results.append(item)

display(results)

[{'Title': 'The computer as artist: AI art and music',
  'URL': 'youtube.com/watch?v=ZDcaDv0U8yw'},
 {'Title': 'Arts For Healing: Music and Art Therapy',
  'URL': 'youtube.com/watch?v=6By9oTQIQxQ'},
 {'Title': 'Elephants, Art and Music on the River Kwai',
  'URL': 'youtube.com/watch?v=r1uDNRzcAV0'},
 {'Title': "Art: music & painting - Van Gogh on Caggiano, Floridia, Boito, Mahler and Brahms' music",
  'URL': 'youtube.com/watch?v=1b8xiXKd9Kk'},
 {'Title': 'The Artist Who Paints What She Hears',
  'URL': 'youtube.com/watch?v=zbh7tAnwLCY'}]

We see the results are not perfect. But we still get some results related to music for relaxing or healing purposes. It could very well be because we don’t have such music videos in the original dataset. Also, the vectors have been generated using the Title of the music videos as the embeddings. One can see that the titles do not carry much information about the video. We can use the Description column to create embeddings but the description also contains many irrelevant details which can create further distortion in the vector space. Nevertheless, the context search still leads to discovering the available videos closest to our interest. Also, they are far away in content from the negative examples given. We thus see the power of Qdrant’s discovery API to discover data points of interest in a large multidimensional space of points related to YouTube music videos.

Conclusion

In this blog, we saw how we can leverage Qdrant to build a music search and discovery system. Qdrant abstracts a lot of the hard stuff and makes it easy to implement such a system with a few lines of code. For this specific example, we can improve the search further by using a better embedding model, or by embedding the videos themselves.

Qdrant also offers robust support for filters, which restrict searches based on certain metadata, but utilizing discovery search allows for additional constraints within the vector space where the search is conducted.

How to Build a Legal Information Retrieval Engine Using Mistral, Qdrant, and LangChain

sagaruprety — Wed, 24 Jan 2024 15:12:09 +0000

Introduction

Finding legal cases is an extremely important task that lawyers do, and also the most time consuming and labor-intensive. There is a vast trove of judicial decisions data which needs to be searched, and existing softwares mostly offer boolean and keyword-based search approaches.

In this blog we build a simple way for lawyers to upload case documents, and build a simple AI application that allows for search and analyses of legal case documents related to a given new case scenario. We utilize all open-source components — Mistral AI model, Qdrant vector database, and the Langchain library.

To run the code, one needs to download the Mistral AI model (Mistral-7B) in either the local machine or a cloud. Although we quantize the model and reduce its size, it would need a GPU with at least 16 GB of RAM. I would recommend using Google Colab for running the code, as their free-tier GPU can take the above load easily.

The full codebase of this tutorial is available here.

Why Legal Case Discovery Search?

Legal case discovery is the process of identifying and gathering relevant information to support a given legal case. Technically termed as Case law retrieval, it is needed to analyze judicial precedents and decisions so that lawyers can advise their clients on a similar legal case. Case law is one of two sources of law, along with statutes. Although statutes are limited in their size and slowly amended or expanded, case law forms a rapidly and ever expanding source. This process can be time-consuming and labor-intensive, especially when dealing with large volumes of data. Large language models (LLMs) can help expedite this process by semantically searching for keywords to match a broad yet relevant set of case precedents and statutes. This can not only help legal professionals but also benefit a layperson who can get some preliminary understanding of similar cases and their outcomes, before deciding to proceed with their own case and hiring a lawyer.

Architecture

This tutorial utilizes LLMs and the Retrieval Augmented Generation (RAG) architecture to build a search agent over case law documents. We build a traditional retrieval component using vector databases to filter down the large number of case documents, based on a user query. Then those filtered document chunks are passed on to the LLM, along with the query. The reasoning and semantic understanding capabilities of LLMs helps them the exact answer to the query.

About Mistral

Mistral-7B is a relatively recent large language model which is open-source and developed by Mistral AI, a french startup, which has gained attention because of it outperforming the popular Llama2 models. Specifically, the 7 billion parameter version of Mistral is reported to outperform the 13 Billion and 34 Billion parameter versions of Llama2, which is a significant milestone in generative AI, as this means improved latency without sacrificing model performance.

About Qdrant

Qdrant is an open-source vector search engine that enables fast and efficient similarity search. It is designed to work with high-dimensional data, making it suitable for use with large language models like Mistral. The integration of Qdrant in the architecture aims to enhance the search capabilities for legal case discovery, allowing for quick and accurate retrieval of relevant information from a large corpus of legal documents.

About Langchain

Langchain is an open-source library used for building solutions using LLMs. It offers ready made modules and functions to iteratively build prompts and connect to vector databases. It also offers intuitive syntax to chain together all these components to input into an LLM.

About Dataset

The dataset used in this tutorial was constructed as part of the Artificial Intelligence for Legal Assistance Track at FIRE 2019 conference, which is an important conference in the discipline of Information Retrieval. It can be downloaded from here. It contains thousands of case law documents, but one can consider a subset of 500 documents (files c1.txt to c500.txt in the Object_casedocs directory) for the purpose of this tutorial.

Guide to building the app

We first start with installing the required libraries.

!pip install -q -U transformers langchain bitsandbytes qdrant-client
!pip install -q -U sentence-transformers accelerate unstructured

While all libraries will be clearly visible as and when they are imported and used, except accelerate, which is used under the hood in the transformers model pipeline to map the loaded model into the gpu or cpu, according to the availability. Also, unstructured library is utilized by langchain’s dataloader while loading all files from a directory.

Next, load all required libraries, modules and functions:

import torch
from operator import itemgetter
from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from langchain.vectorstores import Qdrant
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_community.document_loaders import DirectoryLoader
from langchain.embeddings import HuggingFaceEmbeddings
file_path = '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/'

We start first by loading the Mistral model and quantizing it to 4 bit weights using the bitsandbytes library.

# preparing config for quantizing the model into 4 bits
quantization_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_compute_dtype=torch.float16,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
)

# load the tokenizer and the quantized mistral model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"

model_4bit = AutoModelForCausalLM.from_pretrained(
             model_id, 
             device_map="auto",
             quantization_config=quantization_config,)

tokenizer = AutoTokenizer.from_pretrained(model_id)

# using HuggingFace's pipeline
pipeline = pipeline(
       "text-generation",
       model=model_4bit,
       tokenizer=tokenizer,
       use_cache=True,
       device_map="auto",
       max_new_tokens=5000,
       do_sample=True,
       top_k=1,
       temperature = 0.01,
       num_return_sequences=1,
       eos_token_id=tokenizer.eos_token_id,
       pad_token_id=tokenizer.eos_token_id,
)
model = HuggingFacePipeline(pipeline=pipeline)

Now that we have the LLM ready, let’s get the fodder for the LLM — legal case documents. We first load the documents from disk and then define a text splitter to chunk the documents into manageable chunks.

# load the legal case documents and define text splitter
loader = DirectoryLoader(file_path)
docs = loader.load()
print(len(docs))

text_splitter = RecursiveCharacterTextSplitter(
   chunk_size=1000,
   chunk_overlap=20,
   length_function=len,
   is_separator_regex=False,
)

docs = text_splitter.split_documents(docs)

The standard next step is the key part of RAG — defining embedding model, embedding documents and indexing them into a vector store.

# define the embedding model
emb_model = "sentence-transformers/all-MiniLM-L6-v2"
embeddings = HuggingFaceEmbeddings(
   model_name=emb_model,
   cache_folder=os.getenv('SENTENCE_TRANSFORMERS_HOME'))

qdrant_collection = Qdrant.from_documents(
docs,
embeddings,
location=":memory:", # Local mode with in-memory storage only
collection_name="it_resumes",
)
# construct a retriever on top of the vector store
qdrant_retriever = qdrant_collection.as_retriever()

We can check how well the qdrant retriever has indexed the legal case documents by querying something:

qdrant_retriever.invoke('Cite me a dispute related to electricity board tender')

[Document(page_content="The Chhattisgarh State Electricity Board (for short 'the CSEB') issued an advertisement inviting tender (NIT) bearing No. T- 136/2004 dated 02.06.2004 for its work at Hasedeo Thermal Power Station (Korba West) towards Designing, Engineering, Testing, Supply, Erection & Commission of HEA Ignition system. The applications received there under…", metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/C21.txt'}),
 Document(page_content='25. In the present case, the respondent no.1 challenged the impugned advertisement dated 6.12.2004 issued by the Nagar Nigam. We have carefully perused the said advertisement and find no illegality in the same. It has been held by this Court in several decisions that the Court should not ordinarily interfere with the terms mentioned in such an advertisement. Thus in Global Energy Ltd. and Anr. vs. Adani Exports Ltd. and Ors. 2005(4) SCC 435 2005 Indlaw SC 384 this Court observed at para 11:\n\n"The principle is, therefore, well settled…"', metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/C401.txt'}),
 Document(page_content="Ram, General Manager of the said Power Station furnished his report dated 28.12.2004 wherein it was summed up that due to the defects in the scanning system, supplied by the respondent, generation had been adversely effected and the said Electricity Board was not satisfied with the equipment supplied by the respondent. In spite of the aforesaid material, the tender Committee acted with caution and even the technical expertise was sought…", metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/C21.txt'}),]

We see that we get all document chunks which have the embedding closely matching the query. Inspecting the document chunks, we do find mention of cases related to electricity board and tenders.

But we want a bit more than just search, we want specific question answering and reasoning over these documents, in order to glean newer insights and curate new ideas.

We want our engine to answer more specific queries regarding the cases.

In the next step we therefore chain together these retrieved document chunks and pass on to the LLM. Here we start using the Langchain Expression Language (LCEL) which is a new set of syntax in Langchain to easily create LLM input chains.

An essential concept in LCEL is Runnables, which is an atomic unit of work which can be invoked, batched or streamed. It can also be understood as something which can be ‘run’ (through an LLM). In Pythonic terms, Runnables is a class. LCEL has two types of Runnables, RunnableSequence — invokes a series of runnables sequentially, with one runnable’s output serving as the next one’s input, and RunnableParallel — which invokes runnables concurrently, providing the same input to each.

The prompt used below takes two inputs — the context of retrieved documents and a user question. We first set up these two inputs by using the RunnableParallel class, which parallelly takes the input question and creates the context as output of the retriever. Note the RunnablePassthrough function, which captures all the text when the chain is invoked.

# define prompt template
template = """<s>[INST] You are a helpful, respectful and honest legal assistant.
Your task is to assist lawyers in legal case discovery.
Answer the question below from context below.
{context}
{question} [/INST] </s>
"""

# create the prompt from the above template
prompt = ChatPromptTemplate.from_template(template)

# combine document chunks into one
def format_docs(docs):
   return "\n\n".join(doc.page_content for doc in docs)

# setup the context and question part of the chain
setup_and_retrieval = RunnableParallel(
   {"context": qdrant_retriever| format_docs, "question": RunnablePassthrough()})

# extend the chain to include the prompt, model, and output parser
rag_chain = setup_and_retrieval | prompt | model | StrOutputParser()

Results

Let’s start with a bit more complex query than what we passed to the qdrant retriever:

rag_chain.invoke("Cite me a dispute related to electricity board tender")

In this dispute, the Chhattisgarh State Electricity Board (CSEB) invited tenders for the Designing, Engineering, Testing, Supply, Erection & Commission of HEA Ignition system at Hasedeo Thermal Power Station (Korba West). The respondent, M/s Control Electronics India (CEI), submitted an application for the tender documents but it was rejected due to incomplete documents, specifically the non-submission of documentary evidence of past performance and experience. The respondent then complained against the appellant for not issuing the tender documents.\n\nHowever, the present case is not about the initial rejection of the respondent's application. Instead, it revolves around the respondent's allegations that the records were fabricated and the tender document was not opened and returned in furtherance of official duties by the appellant.\n\nThe Court has carefully examined the impugned advertisement issued by the Nagar Nigam and found no illegality in it. The Court has also considered th...

Now one may say that this is not too different from the response of the qdrant retriever. Indeed the retriever returned a list of possible documents and the LLM response returns the most relevant of them all. However, the utility of LLM goes beyond, and can answer questions from within this document, as we see:

rag_chain.invoke("In the dispute related to electricity board tender, what was the outcome?")

In this dispute, the respondent had filed a civil suit challenging the decision of the Electricity Board in returning his tender documents due to non-compliance with the pre-qualifying conditions. However, he withdrew the suit, leading to its dismissal for non-prosecution. The respondent's attempt to challenge the decision of the Tender Committee in not considering his tender was unfaulted due to the constructive res judicata effect of the withdrawn suit. The tender of the respondent was rejected due to the defects in the scanning system supplied by him, which adversely affected the generation at Patratu Thermal Power Station. The Tender Committee sought expert opinions and rejected the respondent's tender based on the reports received. The allegations of fabricating records made by the respondent were considered mischievous and an afterthought. The appellant, R.C. Jain, was deputed to verify the claim of the respondent, and he reported that the works carried out by the respondent at Patratu Thermal Power Station were not satisfactory. Based on this information, the respondent's tender document was not opened and returned.
Lastly, you can also construct your chain so as to return the source documents. You just need to pass them through, all the way.

As mentioned earlier RunnablePassthrough simply takes the input and passes it through. We can use the assign function to add extra arguments to it. Below cell executes a composite chain, where the first chain, rag_chain_from_docs, actually gets executed second. The rag_chain_with_source, on invocation constructs the context with the question and retriever, which is passed on to rag_chain_from_docs, whose output is returned and assigned to a new variable answer. The final output from the LLM thus has three variables — question, the LLM answer and all the retrieved documents as context.

rag_chain_from_docs = (
   RunnablePassthrough.assign(context=(lambda x: format_docs(x["context"])))
   | prompt
   | model
   | StrOutputParser()
)

rag_chain_with_source = RunnableParallel(
   {"context": qdrant_retriever, "question": RunnablePassthrough()}
).assign(answer=rag_chain_from_docs)

rag_chain_with_source.invoke("Cite me a dispute related to electricity board tender")

{'context': [Document(page_content="The Chhattisgarh State Electricity Board (for short 'the CSEB') issued an advertisement inviting tender (NIT) bearing No. T- 136/2004 dated 02.06.2004 for its work at Hasedeo Thermal Power Station (Korba West) towards Designing, Engineering, Testing, Supply, Erection & Commission of HEA Ignition system. The applications received there under were required to be processed in three stages successively namely…", metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/C21.txt'}),
  Document(page_content='15. As already pointed above, tender was floated by the CSEB and the CEI herein was one of the parties who had submitted its bid through the respondent. However, tender conditions mentioned certain conditions and it was necessary to fulfill those conditions to become eligible to submit the bid and have it considered. As per the appellants, tender of the respondent was rejected on the ground that plant…', metadata={'source': '/content/drive/MyDrive/Colab Notebooks/data/Object_casedocs_500/C21.txt'})],

 'question': 'Cite me a dispute related to electricity board tender',

 'answer': "In this dispute, the Chhattisgarh State Electricity Board (CSEB) invited tenders for the Designing, Engineering, Testing, Supply, Erection & Commission of HEA Ignition system at Hasedeo Thermal Power Station (Korba West). The respondent, M/s Control Electronics India (CEI), submitted an application for the tender documents but it was rejected due to incomplete documents, specifically the non-submission of documentary evidence.."}

Conclusion

We have built a simple case law retrieval system in a few lines of code, which looks effective in not only retrieving relevant legal precedents to a query, but also answer specific questions related to that precedent. The secret sauce lies in the capability of LLMs which have been trained extensively with billions of text tokens, and in qdrant’s ability to index and retrieve documents with very low latency.

We hope that this serves as a starting point to a larger case retrieval and discovery engine, which can index and query millions and billions of legal precedents, so that anyone seeking legal advice and justice have access to their own Mike Ross from Suits.

Semantic Search Through Resumes for HR Tech Startups Using Qdrant

sagaruprety — Fri, 12 Jan 2024 13:31:51 +0000

Resume Filtering & Its Challenges

Resume filtering is a common practice in companies that are inundated with resumes for a handful of positions. Most of this is achieved through keyword-based filtering softwares. While keyword based filtering is useful, there is always the risk of filtering out good resumes that don’t fit the filtering algorithm’s rules. Hence, it is important that any resume filtering software looks at resumes like a recruiter would - not only focusing on specific terms, but trying to understand how the experience and skill set of the given candidate fits the job description based on the semantics of the resume text. In AI parlance, one needs to perform semantic search/filtering on resumes.

This blog will take you through the building of a semantic search LLM agent for question-answering and analysis on a collection of resumes. The agent is implemented using OpenAI models, Langchain, and utilizes the Qdrant vector store for semantic search capability.

Implementing the Semantic Search

You would, therefore, need the following libraries to follow and implement this tutorial:

pip install qdrant-client langchain openai pypdf

You can also view the full notebook in this github repo.

First, one has to download the openly available dataset of resumes from Kaggle. Note that this dataset contains thousands of resumes in different folders divided by the domain of the job. We will focus on the Information-Technology domain for the purpose of this tutorial.

We start with importing the required libraries and set the OpenAI API key:

import os
import getpass
from operator import itemgetter
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import format_document
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

After you have downloaded the resumes and stored them in your disk (assuming location at ../data), we need to load them into the memory. However, the resumes are in pdf format, which is essentially an image. We need a pdf parser, which extracts text from the pdf.

Fortunately, Langchain has a built-in module which not only extracts all the text from pdfs, but also loads them into the memory.

Note: This is why you needed to install PyPDF as shown above.

loader = PyPDFDirectoryLoader("../data/INFORMATION-TECHNOLOGY")
docs = loader.load()
print(len(docs))
247

We now have the data we need to give to our LLM agent. We proceed to build a semantic search application. Semantic search is different from traditional search, which is based on keyword matches between query tokens and document tokens. Semantic search matches queries to documents based on the meaning of the query and its tokens. For example, it can match the word ‘car’ in a query to ‘automobile’ or ‘vehicle’ in documents, or match the word ‘bank’ to documents according to the meaning expressed in the rest of the query (river bank or financial bank).

Now, semantic search firstly involves vectorizing documents and indexing the documents appropriately. We also vectorize the incoming query and then perform an optimized search over a vector space via some similarity metric.

Qdrant vector store takes care of all these steps and has a super smooth integration with Langchain. To begin, we first vectorize the documents using OpenAI embeddings and create a Qdrant collection using the from_documents function of the Qdrant class we imported in the beginning.

Qdrant also provides functionality to be used directly as a retriever, which is a Langchain construct which performs the task of using the vector store to retrieve documents for a query via similarity search. In one line of code, we can abstract the retrieval process. We need to pass on the input query to this retriever and it will return us the relevant documents which would, in turn, be passed on to the LLM as prompt context.

# initialise embeddings used to convert text to vectors
embeddings = OpenAIEmbeddings()
# create a qdrant collection - a vector based index of all resumes
qdrant_collection = Qdrant.from_documents(
docs,
embeddings,
location=":memory:", # Local mode with in-memory storage only
collection_name="it_resumes",
)
# construct a retriever on top of the vector store
qdrant_retriever = qdrant_collection.as_retriever()

Now that we have embedded and indexed the resumes in Qdrant and built a retriever, we proceed to build the LLM chain. We start with a custom prompt template which takes in two input variables — resume — composed of retrieved resume chunks, and question — the user query. We initialize the gpt-3.5-turbo-16k model because of its 16k context window, which allows us to send larger resume chunks to the LLM. Set the temperature to 0, to minimize randomness in LLM outputs.

# Now we define and initialise the components of an LLM chain, beginning with the prompt template and the model.
template = """You are a helpful assistant to a recruiter at a technology firm. You are be provided the following input context \
from a dataset of resumes of IT professionals.
Answer the question based only on the context. Also provide the source documents.
{resume}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model='gpt-3.5-turbo-16k-0613')

We use Langchain Expression Language (LCEL) to link together different components of the chain. We first collect the retrieved resumes from Qdrant. The RunnablePassthrough function takes in the text, which is passed when the chain is invoked. This is the user query, which is therefore passed onto the Qdrant retriever to perform the semantic search. The question variable is also similarly assigned.

These two components are then passed on to the prompt which is, in turn, passed on to our model initialized earlier. Finally, we chain a string output parser module available in Langchain, which ensures that the output of the chain is a well-formatted string.

# Construct the chain, with two variables - resume and question to be passed onto the prompt.
chain = (
{"resume": RunnablePassthrough()|qdrant_retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)

Let’s set the chain in motion:

chain.invoke('Which resumes mention both Java and HTML?')

The resumes that mention both Java and HTML are:
1. Document(source='../data/INFORMATION-TECHNOLOGY/83816738.pdf')
2. Document(source='../data/INFORMATION-TECHNOLOGY/12334140.pdf')
We see that the agent is able to understand the query, process it and return corresponding resumes from the resume dataset.

Let us try more queries:

chain.invoke('Which resumes mention working with Cisco technologies.')

The resumes that mention working with Cisco technologies are:
1. Document 1: The first document mentions administering CISCO multi-VLAN layer 3 switched LAN/WAN, designing and administering CISCO VPN solution, and working with CISCO switches.
2. Document 2: The second document mentions skills in Cisco switching and Cisco routers.
3. Document 4: The fourth document mentions experience with Cisco hardware and software business solutions, Cisco command line IOS, and Cisco switches.
Sources:
- Document 1: ../data/INFORMATION-TECHNOLOGY/20879311.pdf
- Document 2: ../data/INFORMATION-TECHNOLOGY/91635250.pdf
- Document 4: ../data/INFORMATION-TECHNOLOGY/31243710.pdf

The agent not only returns the resume file names, but also mentions how the resume answers the query. Note that the document numbering is not sequential because these are document numbers from within the retrieved context provided in the prompt. The semantic search results in four documents have been retrieved and, therefore, the answer is based on those four documents.

Let’s make a slightly more abstract query:

chain.invoke('Find me some resumes who have recent experience at Director level')

Based on the provided input context, here are some resumes of IT professionals who have recent experience at the Director level:
1. Resume: INFORMATION TECHNOLOGY DIRECTOR
 - Experience: April 1999 to January 2015
 - Company Name: Not specified
 - Source: ../data/INFORMATION-TECHNOLOGY/24038620.pdf
2. Resume: INFORMATION TECHNOLOGY SENIOR MANAGER
 - Experience: April 2013 to February 2015
 - Company Name: Not specified
 - Source: ../data/INFORMATION-TECHNOLOGY/17681064.pdf
Please note that the specific company names are not provided in the input context.

The agent is consistently answering the queries and returning source documents. All it needed was to specify this instruction in the prompt itself.

Conclusion

In this tutorial, we have built an agent for similarity search through resumes. We have used the Qdrant database for similarity search, which performs the tasks of converting documents to embeddings, indexing the document embeddings, performing optimized similarity search, and retrieving the relevant documents. This helps us in sending only the most relevant context to the LLM, thus saving us multiple LLM calls and token costs.

This agent is built as a simple chain. One can add more functionality like customizing text splitting and document chunking. One can also utilize an advanced feature in Qdrant called sparse vectors for hybrid search. This can enable keyword based search for simple queries, which can help you avoid LLM calls for exact match queries (e.g. get all resumes for candidates located in Germany), and semantic search for more subjective queries, like the ones discussed in the blog.

Boosting Your Blog Recommendations with Flowise & Qdrant: A Step-by-Step Guide

sagaruprety — Wed, 27 Dec 2023 11:02:31 +0000

Looking to incorporate a recommendation module on your blog website? Something that would help you recommend high-quality, personalized blogs to your users based on their interests and preferences? Don’t have the time to acquaint yourself with machine learning algorithms, models, and Python programming? This tutorial will help you out in building a high-quality blog recommendation application by using the latest open-source no-code tools and large language models.

The code used in this tutorial can also be followed here.

Specifically, we are going to use Flowise — a no-code drag & drop graphical tool with the aim to make it easy for people to visualize and build LLM apps, and Qdrant — an open-source vector search engine and vector database.

We need a vector database for this task as we would be using a collection of blog articles to recommend from. This collection can be very large to input to a LLM, because of limited context windows and cost per input token. Thus we need to use the Retrieval Augmented Generation (RAG) technique to firstly chunk, embed, and index these blogs in a vector database, and then find a smaller, more precise set of blogs to input to the LLM to decide to recommend from.

While there are many vector databases out there to choose from, Qdrant is a complete vector database for building LLM applications. It is open-source, and also provides a managed cloud service. It also doubles up as a hybrid and full-text search engine, benefiting from its sparse-vectors capability. Moreover, it offers metadata filtering, in-built embeddings creation (both text and image), sharding, disk-based indexing, and easily integrates with LangChain and LlamaIndex.

Following are the steps you need to take to ensure you have all the tools at your disposal before we begin using them to build the blog recommender.

You need to ensure that you have an OpenAI API key.
You need to obtain a Qdrant API key.
You need to install Flowise in your machine or cloud.

Install Flowise (Official documentation)

Download and Install NodeJS >= 18.15.0
Install Flowise:

npm install -g flowise

Start Flowise:

npx flowise start

Open http://localhost:3000

You can now see a similar page open up in your browser

The chatflow is the place where you can plug-in different components to create an LLM app. Notice on the left hand pane, apart from chatflow you have got other options as well. You can explore the marketplace to use some ready-made LLM app flows, e.g. conversation agent, QnA agent, etc.

For now, lets' try out the Flowise Docs QnA template. Click on it and you can see the template as connected blocks. These blocks, also called nodes, are essential compontents in any LLM app. Think of them as functions when one programmtically creates an LLM app via Langchain or LlamaIndex.

These nodes are explained herein:

You have got text splitter for chunking large documents, where you can specify the relevant parameters like chunk size and chunk overlap.
Text splitter is connected to a document source, in this case, the Flowise github repo. ou need to specify the connect credential, which is essential any form of authorisation such as api key, needed to access the source documents.
There is an embedding model, in this case OpenAI embeddings and therefore you need the OpenAI API key as connect credential.
The embedding model and the document source is connected to the vector store, wherein the chunked and embedded documents are indexed and stored for retrieval.
We also have the LLM model, in this case, ChatOpenAI model from Open AI.
The LLM and the output of the vector store are input to the Converational Retrieval QA chain, which is a chain of prompts meant to perform the required task - chatting with the LLM over the Flowise documentation.

On the top right, you can see the Use template button. We can use this template as a starting point of our recommendation app.

As we see, the first things we need to build a recommendation system is a set of documents. We need a pool of blog articles from which our LLM agent can recommend blogs.
One can either scrape blogs from the internet using some scraper node like Cheerio Web Scraper in Flowise. Or one can have a collection of blogs in disk to be loaded via a document loader.

Fortunately, I could find a well scraped, clean collection of medium blogs on diverse topics in Hugging Face datasets. Clicking on this link and going to the File and Versions tab, one can download the 1.04 GB file named medium_articles.csv.

In the below cells, I show how to use this file to create the LLM blog recommendation agent.

First we need to import the libraries required to load and process the dataset

import pandas as pd
import ast

# replace the file path as appropriate
file_path = "./data/medium_articles.csv"

df = pd.read_csv(file_path)
df.head()

	title	text	url	authors	timestamp	tags
0	Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	['Mental Health', 'Health', 'Psychology', 'Sci...
1	Your Brain On Coronavirus	Your Brain On Coronavirus\n\nA guide to the cu...	https://medium.com/age-of-awareness/how-the-pa...	['Simon Spichak']	2020-09-23 22:10:17.126000+00:00	['Mental Health', 'Coronavirus', 'Science', 'P...
2	Mind Your Nose	Mind Your Nose\n\nHow smell training can chang...	https://medium.com/neodotlife/mind-your-nose-f...	[]	2020-10-10 20:17:37.132000+00:00	['Biotechnology', 'Neuroscience', 'Brain', 'We...
3	The 4 Purposes of Dreams	Passionate about the synergy between science a...	https://medium.com/science-for-real/the-4-purp...	['Eshan Samaranayake']	2020-12-21 16:05:19.524000+00:00	['Health', 'Neuroscience', 'Mental Health', 'P...
4	Surviving a Rod Through the Head	You’ve heard of him, haven’t you? Phineas Gage...	https://medium.com/live-your-life-on-purpose/s...	['Rishav Sinha']	2020-02-26 00:01:01.576000+00:00	['Brain', 'Health', 'Development', 'Psychology...

# some basic info about the dataframe
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 192368 entries, 0 to 192367
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   title      192363 non-null  object
 1   text       192368 non-null  object
 2   url        192368 non-null  object
 3   authors    192368 non-null  object
 4   timestamp  192366 non-null  object
 5   tags       192368 non-null  object
dtypes: object(6)
memory usage: 8.8+ MB
None

We see that there are close to 200k articles in this dataset. We don't need these many articles in our pool to recommend from.

Therefore, we focus only on a niche domain, 'AI'. We sample only those articles which include 'AI' as a tag.

# first converting the tags values from string to a list for the explode operation
df['tags'] = df.tags.apply(lambda d: ast.literal_eval(d))
df = df.explode('tags')
df.head()

title	text	url	authors	timestamp	tags
Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	Mental Health
Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	Health
Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	Psychology
Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	Science
Mental Note Vol. 24	Photo by Josh Riemer on Unsplash\n\nMerry Chri...	https://medium.com/invisible-illness/mental-no...	['Ryan Fan']	2020-12-26 03:38:10.479000+00:00	Neuroscience

# now we see that the explode operation has duplicated rest of the row for each tag in the tags list
# We can further filter only the AI tagged articles

df_ai = df.query('tags == "AI"')
df_ai.head()

	title	text	url	authors	timestamp	tags
34	AI creating Human-Looking Images and Tracking ...	AI creating Human-Looking Images and Tracking ...	https://medium.com/towards-artificial-intellig...	['David Yakobovitch']	2020-09-07 18:01:01.467000+00:00	AI
69	Predicting The Protein Structures Using AI	Proteins are found essentially in all organism...	https://medium.com/datadriveninvestor/predicti...	['Vishnu Aravindhan']	2020-12-26 08:46:36.656000+00:00	AI
72	Unleash the Potential of AI in Circular Econom...	Business Potential of AI in promoting circular...	https://medium.com/swlh/unleash-the-potential-...	['Americana Chen']	2020-12-07 22:46:53.490000+00:00	AI
85	Essential OpenCV Functions to Get You Started ...	Reading, writing and displaying images\n\nBefo...	https://towardsdatascience.com/essential-openc...	['Juan Cruz Martinez']	2020-06-12 16:03:06.663000+00:00	AI
105	Google Objectron — A giant leap for the 3D obj...	bjecrPhoto by Tamara Gak on Unsplash\n\nGoogle...	https://towardsdatascience.com/google-objectro...	['Jair Ribeiro']	2020-11-23 17:48:03.183000+00:00	AI

# Finally, we concatenate information across the columns into a single column, so that we have to index only a single column in the Qdrant vector DB
# Also, we include only use the titles of the article as the summary of its content. This is to minimise the db upsert times (as we will be using Qdrant DB on our local machine)
# We keep the url field, as the LLM agent can cite the url of the recommended blog using this field.
# finally, we take a random sample of 200 articles, again to minimise the db upsert times.
df_ai.loc[:, 'combined_info'] = df_ai.apply(lambda row: f"title: {row['title']}, url: {row['url']}", axis=1)
df_ai_combined = df_ai['combined_info']
df_ai_combined.sample(200).to_csv('medium_articles_ai.csv')
df_ai_combined.head()

34     title: AI creating Human-Looking Images and Tr...
69     title: Predicting The Protein Structures Using...
72     title: Unleash the Potential of AI in Circular...
85     title: Essential OpenCV Functions to Get You S...
105    title: Google Objectron — A giant leap for the...
Name: combined_info, dtype: object

Now that we have the csv file, we go back to the flowise dashboard.

In the QnA template we discussed earlier, replace the MarkdownTextsplitter by RecursiveTextSplitter, the Github Document Loader with the CSV document loader, the In-memory retrieval with the Qdrant vector store and the Conversational Retrieval Chain with the Retrieval QA Chain.

Also, in the CSV document loader, upload the csv file we created in the above cell, putting 'combined_info' in the 'Single Column Extraction' field. Also, ensure that you put down your OpenAI API keys, the Qdrant server URL as 'http://0.0.0.0:6333' and give a new collection (database) name.

The dashboard looks like this now:

You can save your chatflow using the save icon on the top right, and can run the flow using the green coloured database icon below it.

But before proceeding, you need to firstly start a local Qdrant server. The easiest way to do is via docker. Ensure you have docker installed in your system. Then, go to a terminal and paste the following commands:

docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant

You can then go to your browser at http://localhost:6333/dashboard and see the Qdrant dashboard. We will come back to this dashboard later.

Run the flow now, using the green coloured database icon and click on the upsert button on the pop-up which follows.

Once the documents are upserted into the Qdrant DB, you can do over to the Qdrant DB dashboard at http://localhost:6333/dashboard and refresh to see the new collection 'medium_articles_ai' created. Click on it to see the indexed csv file.

Finally, lets start the chatbot by clicking on the purple message icon next to the green database icon.

You can ask the bot about articles in AI and the bot would recommend you the articles from the collection we created with the csv file. Without us instructing it to do so, it also knows to cite the url of the recommended blog, so that the user can straightaway start reading the blogs.

Conclusion

In this article we saw how to use an LLM to build a blog recommendation application. We used OpenAI’s GPT-3.5 as the language model and Flowise as a graphical interface with built-in nodes to design the architecture of the application. Qdrant vector database performed the important task of RAG, which helped us augment the LLM’s reasoning capability by feeding it the pool of blogs, in an optimal, vectorized manner. The LLMs may not have access to this specific pool of blogs as they may either be privately held, or created after its pre-training cut-off date. While the chat example I showed asked the LLM agent to explicitly recommend blogs of a particular topic, one can also input another blog and ask the agent to recommend similar blogs.