Jaydeep Biswas

Posted on Jan 7 • Updated on Jan 11

LangChain's 2nd Module: Retrieval🦜🐕

In our recent blog series, we've traversed through a diverse range of topics. Here's the topics we've covered so far:

Retrieval Augmented Generation (RAG) is a crucial process in LangChain, especially for applications that need specific user data not present in the model's training set. In simpler terms, it involves fetching external data and blending it seamlessly into the language model's generation process. LangChain offers a robust set of tools and features to make this process easy, accommodating both simple and complex applications.

Let's break down the components involved in the retrieval process:

Document Loaders 📄

Document loaders in LangChain enable the extraction of data from various sources. They enable the extraction of data from various sources, boasting over 100 loaders that support a diverse range of document types and sources such as private S3 buckets, public websites and databases. All these loaders ingest data into Document classes.

You have the flexibility to choose a document loader based on your specific needs from here. Here are examples of different loaders:

Text File Loader

To load a simple .txt file into a document, you can use the TextLoader:

from langchain.document_loaders import TextLoader

loader = TextLoader("./sample.txt")
document = loader.load()

CSV Loader

For loading data from a CSV file, LangChain provides the CSVLoader. You can even customize parsing by specifying field names:

from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./example_data/sample.csv')
documents = loader.load()

loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins']
})
documents = loader.load()

PDF Loaders

LangChain's PDF Loaders offer various methods for parsing and extracting content from PDF files. Different loaders cater to different needs:

PyPDFLoader (Basic PDF Parsing)

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()

MathPixLoader (Mathematical Content and Diagrams)

from langchain.document_loaders import MathpixPDFLoader

loader = MathpixPDFLoader("example_data/math-content.pdf")
data = loader.load()

PyMuPDFLoader (Fast PDF Parsing with Detailed Metadata)

from langchain.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("example_data/layout-parser-paper.pdf")
data = loader.load()

# Optionally pass additional arguments for PyMuPDF's get_text() call
data = loader.load(option="text")

PDFMiner Loader (Granular Control over Text Extraction)

from langchain.document_loaders import PDFMinerLoader

loader = PDFMinerLoader("example_data/layout-parser-paper.pdf")
data = loader.load()

AmazonTextractPDFParser utilizes AWS Textract for OCR and other advanced PDF parsing features.

from langchain.document_loaders import AmazonTextractPDFLoader

# Requires AWS account and configuration
loader = AmazonTextractPDFLoader("example_data/complex-layout.pdf")
documents = loader.load()

PDFMinerPDFasHTMLLoader generates HTML from PDF for semantic parsing.

from langchain.document_loaders import PDFMinerPDFasHTMLLoader

loader = PDFMinerPDFasHTMLLoader("example_data/layout-parser-paper.pdf")
data = loader.load()

PDFPlumberLoader provides detailed metadata and supports one document per page.

from langchain.document_loaders import PDFPlumberLoader

loader = PDFPlumberLoader("example_data/layout-parser-paper.pdf")
data = loader.load()

Integrated Loaders play a vital role in LangChain by allowing direct data loading from various applications such as Slack, Figma, Google Drive, databases, and more. These loaders empower LLMs to seamlessly incorporate information from diverse sources, expanding the capabilities of language generation applications.

Let's explore a couple of examples to illustrate how Integrated Loaders can be employed:

Example I - Slack 💬

Slack, a popular instant messaging platform, can be integrated into LLM workflows with ease. Here's a simplified step-by-step guide:

Go to your Slack Workspace Management page.
Navigate to {your_slack_domain}.slack.com/services/export.
Select the desired date range and initiate the export.
Slack notifies you via email and DM once the export is ready.
The exported data is in a .zip file located in your Downloads folder or the designated download path.

Now, you can use the SlackDirectoryLoader from the langchain.document_loaders package to load this data into LangChain:

from langchain.document_loaders import SlackDirectoryLoader

SLACK_WORKSPACE_URL = "https://xxx.slack.com"  # Replace with your Slack URL
LOCAL_ZIPFILE = ""  # Path to the Slack zip file

loader = SlackDirectoryLoader(LOCAL_ZIPFILE, SLACK_WORKSPACE_URL)
docs = loader.load()
print(docs)

Example II - Figma 🎨

Figma, a widely-used tool for interface design, offers a REST API for data integration. Here's a simplified guide:

Obtain the Figma file key from the URL format: https://www.figma.com/file/{filekey}/sampleFilename.
Node IDs are found in the URL parameter ?node-id={node_id}.
Generate an access token following instructions at the Figma Help Center.

Now, you can use the FigmaFileLoader class from langchain.document_loaders.figma to load Figma data into LangChain. This example demonstrates how to generate HTML/CSS code based on Figma design input:

import os
from langchain.document_loaders.figma import FigmaFileLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.chat_models import ChatOpenAI
from langchain.indexes import VectorstoreIndexCreator
from langchain.chains import ConversationChain, LLMChain
from langchain.memory import ConversationBufferWindowMemory
from langchain.prompts.chat import ChatPromptTemplate, SystemMessagePromptTemplate, AIMessagePromptTemplate, HumanMessagePromptTemplate

figma_loader = FigmaFileLoader(
    os.environ.get("ACCESS_TOKEN"),
    os.environ.get("NODE_IDS"),
    os.environ.get("FILE_KEY"),
)

index = VectorstoreIndexCreator().from_loaders([figma_loader])
figma_doc_retriever = index.vectorstore.as_retriever()

The generate_code function uses the Figma data to create HTML/CSS code.
It employs a templated conversation with a GPT-based model.

def generate_code(human_input):
    # Template for system and human prompts
    system_prompt_template = "Your coding instructions..."
    human_prompt_template = "Code the {text}. Ensure it's mobile responsive"

    # Creating prompt templates
    system_message_prompt = SystemMessagePromptTemplate.from_template(system_prompt_template)
    human_message_prompt = HumanMessagePromptTemplate.from_template(human_prompt_template)

    # Setting up the AI model
    gpt_4 = ChatOpenAI(temperature=0.02, model_name="gpt-4")

    # Retrieving relevant documents
    relevant_nodes = figma_doc_retriever.get_relevant_documents(human_input)

    # Generating and formatting the prompt
    conversation = [system_message_prompt, human_message_prompt]
    chat_prompt = ChatPromptTemplate.from_messages(conversation)
    response = gpt_4(chat_prompt.format_prompt(context=relevant_nodes, text=human_input).to_messages())

    return response

# Example usage
response = generate_code("page top header")
print(response.content)

In this example, the generate_code function utilizes Figma data to create HTML/CSS code through LangChain's capabilities. These Integrated Loaders showcase how LangChain simplifies the integration of external data, enabling powerful applications in various domains.

Document Transformers 🔄

Document Transformers in LangChain help shape and modify documents, the building blocks we've created earlier. These tools are crucial for tasks like breaking down long texts, combining information, and filtering content, making them fit neatly into a model's understanding or specific application requirements.

One handy tool is the RecursiveCharacterTextSplitter, a versatile text splitter using a character list. It allows you to tweak parameters like chunk size, overlap, and starting index. Here's a simple example in Python:

from langchain.text_splitter import RecursiveCharacterTextSplitter

state_of_the_union = "Your long text here..."

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    length_function=len,
    add_start_index=True,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

Another tool is the CharacterTextSplitter, which divides text based on a chosen character, with controls for chunk size and overlap:

from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

Now, if you're dealing with HTML content, use the HTMLHeaderTextSplitter. It cleverly splits HTML content based on header tags while keeping the semantic structure intact:

from langchain.text_splitter import HTMLHeaderTextSplitter

html_string = "Your HTML content here..."
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
print(html_header_splits[0])

Things get even more interesting when you combine different tools. For example, combining HTMLHeaderTextSplitter with the Pipelined Splitter:

from langchain.text_splitter import HTMLHeaderTextSplitter, RecursiveCharacterTextSplitter

url = "https://example.com"
headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)

chunk_size = 500
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size)
splits = text_splitter.split_documents(html_header_splits)
print(splits[0])

LangChain also offers specialized splitters for different programming languages, such as the Python Code Splitter and the JavaScript Code Splitter:

from langchain.text_splitter import RecursiveCharacterTextSplitter, Language

python_code = """
def hello_world():
    print("Hello, World!")
hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50
)
python_docs = python_splitter.create_documents([python_code])
print(python_docs[0])

js_code = """
function helloWorld() {
  console.log("Hello, World!");
}
helloWorld();
"""

js_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.JS, chunk_size=60
)
js_docs = js_splitter.create_documents([js_code])
print(js_docs[0])

For handling text based on token count (useful for models with token limits), there's the TokenTextSplitter:

from langchain.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(chunk_size=10)
texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Lastly, there's the LongContextReorder that shuffles documents to prevent performance slowdowns in models due to lengthy contexts:

from langchain.document_transformers import LongContextReorder

reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)
print(reordered_docs[0])

These tools showcase the incredible ways you can transform documents in LangChain, from simple text splitting to complex reordering and language-specific separation. For more detailed insights and specific use cases, diving into the LangChain documentation and Integrations section is highly recommended. And don't worry, in our examples, the loaders have already done the heavy lifting of creating chunked documents for us!

Text Embedding Model 📝➡️🔠

Text Embedding Models in LangChain bring a standardized way of handling various embedding model providers like OpenAI, Cohere, and Hugging Face. These models work by transforming text into vector representations, allowing for powerful operations like semantic search through text similarity in vector space.

Getting started is usually a breeze, involving the installation of specific packages and setting up API keys. In our case, we've already taken care of this for OpenAI.

In LangChain, the go-to method for embedding multiple texts is embed_documents. Take a look at this example using OpenAI:

from langchain.embeddings import OpenAIEmbeddings

# Initialize the model
embeddings_model = OpenAIEmbeddings()

# Embed a list of texts
embeddings = embeddings_model.embed_documents(
    ["Hi there!", "Oh, hello!", "What's your name?", "My friends call me World", "Hello World!"]
)
print("Number of documents embedded:", len(embeddings))
print("Dimension of each embedding:", len(embeddings[0]))

For a single text, like a search query, you can use embed_query. This is handy for comparing a query to a set of document embeddings:

from langchain.embeddings import OpenAIEmbeddings

# Initialize the model
embeddings_model = OpenAIEmbeddings()

# Embed a single query
embedded_query = embeddings_model.embed_query("What was the name mentioned in the conversation?")
print("First five dimensions of the embedded query:", embedded_query[:5])

Understanding these embeddings is key. Each piece of text becomes a vector, and the dimension depends on the model used – for OpenAI, it's typically 1536-dimensions vectors. These embeddings then used for retrieving relevant information.

And here's the cool part: LangChain isn't limited to just OpenAI. It's designed to seamlessly work with various providers. While the setup and usage might differ slightly based on the provider, the core concept of embedding texts into vector space stays the same. For all the nitty-gritty details, including advanced configurations and integrations with different embedding model providers, the LangChain documentation in the Integrations section is a goldmine.

Vector Stores 🗄️

Vector Stores in LangChain are the vaults that efficiently store and search those text embeddings. LangChain integrates with over 50 vector stores, offering a standardized interface for a smooth user experience.

Let's dive into an example where we've embedded texts and now want to store and search them using Chroma:

from langchain.vectorstores import Chroma

db = Chroma.from_texts(embedded_texts)
similar_texts = db.similarity_search("search query")

Alternatively, if we want to use FAISS for creating indexes, here's an example:

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS

pdfstore = FAISS.from_documents(pdfpages, 
            embedding=OpenAIEmbeddings())

airtablestore = FAISS.from_documents(airtabledocs, 
            embedding=OpenAIEmbeddings())

Retrievers 🔍

Retrievers in LangChain are like smart search engines, but way more flexible. They don't just find documents, they understand what you're looking for. Unlike vector stores that focus on storing, retrievers are all about finding information.

Let's start with the Chroma retriever. Setting it up involves a few steps, like installing Chroma with pip install chromadb. Then, you load, split, embed, and retrieve documents. Here's a simple example:

from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

full_text = open("state_of_the_union.txt", "r").read()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
texts = text_splitter.split_text(full_text)

embeddings = OpenAIEmbeddings()
db = Chroma.from_texts(texts, embeddings)
retriever = db.as_retriever()

retrieved_docs = retriever.invoke("What did the president say about Ketanji Brown Jackson?")
print(retrieved_docs[0].page_content)

Now, there's the MultiQueryRetriever, automates prompt tuning by generating multiple queries for a user input query and combines the results. It generates multiple queries based on your input and combines the results. Check it out:

from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=db.as_retriever(), llm=llm
)

unique_docs = retriever_from_llm.get_relevant_documents(query=question)
print("Number of unique documents:", len(unique_docs))

Now, imagine you want just the relevant parts from a long document. That's where Contextual Compression Retriever steps in. It compresses retrieved documents, keeping only the relevant info. Take a look:

from langchain.llms import OpenAI
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

llm = OpenAI(temperature=0)
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(base_compressor=compressor, base_retriever=retriever)

compressed_docs = compression_retriever.get_relevant_documents("What did the president say about Ketanji Jackson Brown")
print(compressed_docs[0].page_content)

Now, let's talk team players. The EnsembleRetriever brings different algorithms together for a grand performance. In this example, BM25 and FAISS Retrievers join forces:

from langchain.retrievers import BM25Retriever, EnsembleRetriever
from langchain.vectorstores import FAISS

bm25_retriever = BM25Retriever.from_texts(doc_list).set_k(2)
faiss_vectorstore = FAISS.from_texts(doc_list, OpenAIEmbeddings())
faiss_retriever = faiss_vectorstore.as_retriever(search_kwargs={"k": 2})

ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, faiss_retriever], weights=[0.5, 0.5]
)

docs = ensemble_retriever.get_relevant_documents("apples")
print(docs[0].page_content)

Now, here's something for those who want more from a document. The MultiVector Retriever lets you query with multiple vectors per document. Here's how you can split documents into smaller chunks:

from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain.document_loaders import TextLoader
import uuid

loaders = [TextLoader("file1.txt"), TextLoader("file2.txt")]
docs = [doc for loader in loaders for doc in loader.load()]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)

vectorstore = Chroma(collection_name="full_documents", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, docstore=store, id_key=id_key)

doc_ids = [str(uuid.uuid4()) for _ in docs]
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = [sub_doc for doc in docs for sub_doc in child_text_splitter.split_documents([doc])]
for sub_doc in sub_docs:
    sub_doc.metadata[id_key] = doc_ids[sub_docs.index(sub_doc)]

retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))

Lastly, for those who want a balance between accuracy and context, there's the Parent Document Retriever:

from langchain.retrievers import ParentDocumentRetriever

loaders = [TextLoader("file1.txt"), TextLoader("file2.txt")]
docs = [doc for loader in loaders for doc in loader.load()]

child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
vectorstore = Chroma(collection_name="full_documents", embedding_function=OpenAIEmbeddings())
store = InMemoryStore()
retriever = ParentDocumentRetriever(vectorstore=vectorstore, docstore=store, child_splitter=child_splitter)

retriever.add_documents(docs, ids=None)

retrieved_docs = retriever.get_relevant_documents("query")

These retrievers make LangChain a powerhouse for retrieving information. Whether you want focused content, multiple perspectives, or a balanced approach, there's a retriever for you. And hey, don't forget the documentation for more explorations!

A self-querying retriever constructs structured queries from natural language inputs and applies them to its underlying VectorStore. Its implementation is shown in the following code:

from langchain.chat_models from ChatOpenAI
from langchain.chains.query_constructor.base from AttributeInfo
from langchain.retrievers.self_query.base from SelfQueryRetriever

metadata_field_info = [AttributeInfo(name="genre", description="...", type="string"), ...]
document_content_description = "Brief summary of a movie"
llm = ChatOpenAI(temperature=0)

retriever = SelfQueryRetriever.from_llm(llm, vectorstore, document_content_description, metadata_field_info)

retrieved_docs = retriever.invoke("query")

The WebResearchRetriever performs web research based on a given query -

from langchain.retrievers.web_research import WebResearchRetriever

# Initialize components
llm = ChatOpenAI(temperature=0)
search = GoogleSearchAPIWrapper()
vectorstore = Chroma(embedding_function=OpenAIEmbeddings())

# Instantiate WebResearchRetriever
web_research_retriever = WebResearchRetriever.from_llm(vectorstore=vectorstore, llm=llm, search=search)

# Retrieve documents
docs = web_research_retriever.get_relevant_documents("query")

For our examples, we can also use the standard retriever already implemented as part of our vector store object as follows -

We can now query the retrievers. The output of our query will be document objects relevant to the query. These will be ultimately utilized to create relevant responses in further sections.

Next Chapter: Lanchain's 3rd Module: Agents

DEV Community