sagaruprety

Posted on Jan 12, 2024

Semantic Search Through Resumes for HR Tech Startups Using Qdrant

#llm #qdrant #langchain #hrtech

Resume Filtering & Its Challenges

Resume filtering is a common practice in companies that are inundated with resumes for a handful of positions. Most of this is achieved through keyword-based filtering softwares. While keyword based filtering is useful, there is always the risk of filtering out good resumes that don’t fit the filtering algorithm’s rules. Hence, it is important that any resume filtering software looks at resumes like a recruiter would - not only focusing on specific terms, but trying to understand how the experience and skill set of the given candidate fits the job description based on the semantics of the resume text. In AI parlance, one needs to perform semantic search/filtering on resumes.

This blog will take you through the building of a semantic search LLM agent for question-answering and analysis on a collection of resumes. The agent is implemented using OpenAI models, Langchain, and utilizes the Qdrant vector store for semantic search capability.

Implementing the Semantic Search

You would, therefore, need the following libraries to follow and implement this tutorial:

pip install qdrant-client langchain openai pypdf

You can also view the full notebook in this github repo.

First, one has to download the openly available dataset of resumes from Kaggle. Note that this dataset contains thousands of resumes in different folders divided by the domain of the job. We will focus on the Information-Technology domain for the purpose of this tutorial.

We start with importing the required libraries and set the OpenAI API key:

import os
import getpass
from operator import itemgetter
from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain.vectorstores import Qdrant
from langchain_community.chat_models import ChatOpenAI
from langchain_community.embeddings.openai import OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel
from langchain.schema import format_document
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

After you have downloaded the resumes and stored them in your disk (assuming location at ../data), we need to load them into the memory. However, the resumes are in pdf format, which is essentially an image. We need a pdf parser, which extracts text from the pdf.

Fortunately, Langchain has a built-in module which not only extracts all the text from pdfs, but also loads them into the memory.

Note: This is why you needed to install PyPDF as shown above.

loader = PyPDFDirectoryLoader("../data/INFORMATION-TECHNOLOGY")
docs = loader.load()
print(len(docs))
247

We now have the data we need to give to our LLM agent. We proceed to build a semantic search application. Semantic search is different from traditional search, which is based on keyword matches between query tokens and document tokens. Semantic search matches queries to documents based on the meaning of the query and its tokens. For example, it can match the word ‘car’ in a query to ‘automobile’ or ‘vehicle’ in documents, or match the word ‘bank’ to documents according to the meaning expressed in the rest of the query (river bank or financial bank).

Now, semantic search firstly involves vectorizing documents and indexing the documents appropriately. We also vectorize the incoming query and then perform an optimized search over a vector space via some similarity metric.

Qdrant vector store takes care of all these steps and has a super smooth integration with Langchain. To begin, we first vectorize the documents using OpenAI embeddings and create a Qdrant collection using the from_documents function of the Qdrant class we imported in the beginning.

Qdrant also provides functionality to be used directly as a retriever, which is a Langchain construct which performs the task of using the vector store to retrieve documents for a query via similarity search. In one line of code, we can abstract the retrieval process. We need to pass on the input query to this retriever and it will return us the relevant documents which would, in turn, be passed on to the LLM as prompt context.

# initialise embeddings used to convert text to vectors
embeddings = OpenAIEmbeddings()
# create a qdrant collection - a vector based index of all resumes
qdrant_collection = Qdrant.from_documents(
docs,
embeddings,
location=":memory:", # Local mode with in-memory storage only
collection_name="it_resumes",
)
# construct a retriever on top of the vector store
qdrant_retriever = qdrant_collection.as_retriever()

Now that we have embedded and indexed the resumes in Qdrant and built a retriever, we proceed to build the LLM chain. We start with a custom prompt template which takes in two input variables — resume — composed of retrieved resume chunks, and question — the user query. We initialize the gpt-3.5-turbo-16k model because of its 16k context window, which allows us to send larger resume chunks to the LLM. Set the temperature to 0, to minimize randomness in LLM outputs.

# Now we define and initialise the components of an LLM chain, beginning with the prompt template and the model.
template = """You are a helpful assistant to a recruiter at a technology firm. You are be provided the following input context \
from a dataset of resumes of IT professionals.
Answer the question based only on the context. Also provide the source documents.
{resume}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
model = ChatOpenAI(temperature=0, model='gpt-3.5-turbo-16k-0613')

We use Langchain Expression Language (LCEL) to link together different components of the chain. We first collect the retrieved resumes from Qdrant. The RunnablePassthrough function takes in the text, which is passed when the chain is invoked. This is the user query, which is therefore passed onto the Qdrant retriever to perform the semantic search. The question variable is also similarly assigned.

These two components are then passed on to the prompt which is, in turn, passed on to our model initialized earlier. Finally, we chain a string output parser module available in Langchain, which ensures that the output of the chain is a well-formatted string.

# Construct the chain, with two variables - resume and question to be passed onto the prompt.
chain = (
{"resume": RunnablePassthrough()|qdrant_retriever, "question": RunnablePassthrough()}
| prompt
| model
| StrOutputParser()
)

Let’s set the chain in motion:

chain.invoke('Which resumes mention both Java and HTML?')

The resumes that mention both Java and HTML are:
1. Document(source='../data/INFORMATION-TECHNOLOGY/83816738.pdf')
2. Document(source='../data/INFORMATION-TECHNOLOGY/12334140.pdf')
We see that the agent is able to understand the query, process it and return corresponding resumes from the resume dataset.

Let us try more queries:

chain.invoke('Which resumes mention working with Cisco technologies.')

The resumes that mention working with Cisco technologies are:
1. Document 1: The first document mentions administering CISCO multi-VLAN layer 3 switched LAN/WAN, designing and administering CISCO VPN solution, and working with CISCO switches.
2. Document 2: The second document mentions skills in Cisco switching and Cisco routers.
3. Document 4: The fourth document mentions experience with Cisco hardware and software business solutions, Cisco command line IOS, and Cisco switches.
Sources:
- Document 1: ../data/INFORMATION-TECHNOLOGY/20879311.pdf
- Document 2: ../data/INFORMATION-TECHNOLOGY/91635250.pdf
- Document 4: ../data/INFORMATION-TECHNOLOGY/31243710.pdf

The agent not only returns the resume file names, but also mentions how the resume answers the query. Note that the document numbering is not sequential because these are document numbers from within the retrieved context provided in the prompt. The semantic search results in four documents have been retrieved and, therefore, the answer is based on those four documents.

Let’s make a slightly more abstract query:

chain.invoke('Find me some resumes who have recent experience at Director level')

Based on the provided input context, here are some resumes of IT professionals who have recent experience at the Director level:
1. Resume: INFORMATION TECHNOLOGY DIRECTOR
 - Experience: April 1999 to January 2015
 - Company Name: Not specified
 - Source: ../data/INFORMATION-TECHNOLOGY/24038620.pdf
2. Resume: INFORMATION TECHNOLOGY SENIOR MANAGER
 - Experience: April 2013 to February 2015
 - Company Name: Not specified
 - Source: ../data/INFORMATION-TECHNOLOGY/17681064.pdf
Please note that the specific company names are not provided in the input context.

The agent is consistently answering the queries and returning source documents. All it needed was to specify this instruction in the prompt itself.

Conclusion

In this tutorial, we have built an agent for similarity search through resumes. We have used the Qdrant database for similarity search, which performs the tasks of converting documents to embeddings, indexing the document embeddings, performing optimized similarity search, and retrieving the relevant documents. This helps us in sending only the most relevant context to the LLM, thus saving us multiple LLM calls and token costs.

This agent is built as a simple chain. One can add more functionality like customizing text splitting and document chunking. One can also utilize an advanced feature in Qdrant called sparse vectors for hybrid search. This can enable keyword based search for simple queries, which can help you avoid LLM calls for exact match queries (e.g. get all resumes for candidates located in Germany), and semantic search for more subjective queries, like the ones discussed in the blog.

DEV Community

Semantic Search Through Resumes for HR Tech Startups Using Qdrant

Resume Filtering & Its Challenges

Implementing the Semantic Search

Conclusion

Top comments (0)