Eti Ijeoma

Posted on Dec 13

Building RAG-Powered Applications with LangChain, Pinecone, and OpenAI

#llm #rag #python #langchain

The rise of large language models (LLMs) that can understand and generate human-like text has really transformed the area of artificial intelligence, allowing machines to understand and generate text that feels very human-like. While there are numbers of LLMs are available, In this article, we will focus on the generative pre-trained transformer (GPT), one of OpenAI's most advanced models.

GPT models were first trained on many datasets, most of which were gathered from the Internet. This really improved how well the model can think. Because of that, it can do a good job with different natural language processing (NLP) tasks, including question-answering, summarizing, and generating human-like text.

In this article, we will explore how the retrieval augmentation generation technique can mitigate these limitations and improve the performance of language models.

Retrieval Augmentation Generation

Retrieval Augmentation Generation (RAG) is a technique for generating a high-quality, context-aware response by combining the initial prompt with information retrieved from external sources of knowledge and passing the augmented prompt to LLM. These knowledge sources may include collections of web pages, documents, or other textual resources that improve the LLM's understanding of information. Giving language models access to external data after initial training will optimize the way we train language models. In this article, We will leverage the capabilities of OpenAI alongside Langchain and Pinecone to create a context-aware chatbot.

Langchain

Langchain is a tool that's great for creating applications that use large language models.It is available as a Python or JavaScript package, allowing software developers with to create applications based on pre-existing AI models. Langchain can connect language models (LLMs) to data sources, allowing them to interact with the environment. Check out this documentation for a guide on Langchain and how to get started.

Pinecone

Pinecone is a cloud-based vector database specializing in efficiently storing, indexing, and querying high-dimensional vectors. It is designed for effective similarity searches, enabling you to find vectors that are most similar based on metrics like EEuclidean distance or cosine similarity.

To improve information retrieval for our AI model, we convert our knowledge documents into a special form called word embedding" or a "word vector" by using an embedding model and storing them in a vector database. This makes searching for relevant retrieval information faster and more accurate, especially when dealing with unstructured or semi-structured text data.

Bringing it All Together

With our understanding of retrieval augmentation generation, we will leverage the power of OpenAI LLM, Langchain, and Pinecone to create a question-answering application.

Implementation will be as follows: We will provide knowledge base documents, embed them, and store them in a pinecone. When a query is provided, it is converted into word embeddings using the same embedding model as the knowledge base text. The embedded query is then used to query Pinecone, to find the most similar and relevant vectors. These similar vectors will be translated back into the original language and used to help the LLM generate context-based responses.

Steps Involved in Retrieval Augmented Generation

Data preparation and Ingestion: After we gather our knowledge base data from different source(s). It is important to ingest (transform) the data into a standard structure that the language model can easily process. Langchain provides Document loader tools that are responsible for loading text from different sources, (text file, CSV file, youtube transcript, etc) to create a Langchain document. A Langchain document has two fields;

page_content: Contains the text of the file
metadata: It stores additional relevant information about the text such as the text URL.

Examples of document loaders are Text loaders which can open a text file and Transform loaders which can open from any list of specific formats (HTML, CSV, etc) and load into a document.

Chunking Text: This entails breaking down a document into smaller, more meaningful fragments, which are often shaped like sentences. This procedure is critical, especially when dealing with lengthy text inside the context window constraints of GPT-4, which presently supports a maximum of 8,912 tokens. The idea is to construct manageable fragments that are semantically coherent.

To accomplish this, we use a length function, such as tiktoken, to calculate the sizes of these smaller fragments. The goal is to prevent suddenly splitting relevant information during chunking. In addition, we include an overlap that allows nearby pieces to share their content. This overlap contributes to continuity by including common words or phrases at the end of one chunk and the start of the next.

Embedding: This process transforms complex data e.g. text, images, and audio into high-dimensional numeric vectors. Numerical storage allows for effective storage and processing. Also, embedding methods such as Word embeddings (Word2Vec) capture the semantic relationship between words and concepts. This semantic information is essential during the retrieval augmentation where understanding the context and relationships between terms is crucial for generating relevant content.

Embedding tools: Langchain integrates with several models for generating word or sentence embeddings. In this article, we will make use of the OpenAIEmbeddings to create embeddings.

Storage: The embedded documents are then stored in Pinecone’s vector database. Pinecone has some indexing techniques that organize and optimize the search to facilitate efficient retrieval of vectors similar to a query vector.

Retrieval: The received query is embedded and then used to search Pinecone’s vector database for the most relevant documents. For example, In a similarity search, the term "Dog" may be represented numerically as [0.617]. Anytime a word is searched, they are also converted into vectors. a good model ensures that words with similar contexts, such as "puppy," yield closely aligned number series, like [0.691], reflecting the shared context between the words.

Generation: The language model produces an accurate response for the query by utilizing the retrieved document as an additional context.

Implementing the question-and-answering chatbot

In this section, we’ll walk through a practical example of building an international football match question-answering bot. You can download the context-aware document will come from Kaggle data set.

Setting up the environment

To begin, we will create a .env file in which we will keep the Pinecone and OpenAI environment secret keys. To generate or find the secret keys for Pinecone and OpenAI, click the attached link.

OPENAI_API_KEY=""
PINECONE_ENV=""
PINECONE_API_KEY=""

Next, we'll utilize pip to install the necessary packages.

!pip install openai \
      \ langchain 
       \ pinecone 
       \ tiktoken

Then we'll import the relevant libraries.

import os
import time
from langchain.llms import OpenAI
from langchain.vectorstores import Pinecone
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.callbacks.base import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import ConversationalRetrievalChain
from langchain.chains.conversational_retrieval.prompts import CONDENSE_QUESTION_PROMPT, QA_PROMPT
from langchain.chains.question_answering import load_qa_chain
from langchain.embeddings.openai import OpenAIEmbeddings

We'll then use the Langchain DirectoryLoader to load documents from a directory. The documents in this example will be in CSV format and placed in the /data directory.

# In your project folder, create this directory structure which will hold the context-aware documents downloaded earlier 
directory = '/data' 
def load_docs(directory):
  loader = DirectoryLoader(directory, glob='**/*.csv', show_progress=True, loader_cls=CSVLoader)
  documents = loader.load()
  return documents

documents = load_docs(directory)
print(f"Loaded {len(documents)} documents")

Next, we will split the documents into smaller chunks. This can be complex but we will simplify it using RecursiveCharacterTextSplitter from langchain. First, we will determine the custom length using the tiktoken library which will enable us to recursively split the data into n tokens

# Tell tiktoken what model we'd like to use for embeddings
tiktoken.encoding_for_model('text-embedding-ada-002')

# Intialize a tiktoken tokenizer (i.e. a tool that identifies individual tokens (words))
tokenizer = tiktoken.get_encoding('cl100k_base')

# Create our custom tiktoken function
def tiktoken_len(text: str) -> int:
    """
    Split up a body of text using a custom tokenizer.

    :param text: Text we'd like to tokenize.
    """
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

def chunk_by_size(text: str, size: int = 50) -> list[Document]:
    """
    Chunk up text recursively.

    :param text: Text to be chunked up
    :return: List of Document items (i.e. chunks).|
    """
    text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = size,
    chunk_overlap = 20,
    length_function = tiktoken_len,
    add_start_index = True,
)
    return text_splitter.create_documents([text])

We will initialize Initialize our OpenAI model

# Initialize our OpenAI model
OPENAI_API_KEY = getpass("OpenAI API Key: ")
model_name = 'text-embedding-ada-002'

embeddings = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

After that, Pinecone will be initialized using the environment and Pincone API key. If an index does not already exist, one will be created and configured to store 1536 dimension vectors that correspond to the embedding's length, using cosine similarity as the similarity metric. The Pincone instance will be created using the embeddings and index that have been provided, and the document will be added to the vector store using the Pinecone.from_documents() method.

PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')
PINECONE_ENV = os.getenv('PINECONE_ENV')
# Initialize Pinecone client
pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=tPINECONE_ENV
)
index_name = 'international-sport'

if index_name not in pinecone.list_indexes():
    print(f'Creating Index {index_name}...')
    pinecone.create_index(index_name, dimension=1536, metric='cosine') # Create index. This might take a while to create
    # wait a moment for the index to be fully initialized
     time.sleep(1)
     print('Done')
else:
    print(f'index {index_name} already exists')
index = Pinecone.from_documents(docs, embeddings, index_name=index_name) 

# to retieve the number of vectors in the embedding
index = pinecone.Index(INDEX_NAME)
print(index.describe_index_stats())

Finding similar document

We will now define the function to search Pinecone for similar documents based on the user query input

get_similar_docs(cls, query, index, k=5):
    found_docs = index.similarity_search(query, k=k)
    print(found_docs)
    if len(found_docs) == 0:
      return "Sorry, There is no relevant answer to your question. Please try again."
    logger.info("Found document similar to the query")
    return found_docs

Next, we'll use the Langchain PromptTemplate to create a predefined parameterized text format that will be used to direct response generation in a specific context.

template = """
You are an AI chatbot with a sense of humor.
Your mission is to turn the user's input into funny jokes.

{chat_history}
Human: {human_input}
Chatbot:"""

new_prompt = PromptTemplate(
input_variables=["chat_history", "human_input"],
template=template
)
new_memory = ConversationBufferMemory(memory_key="chat_history")

Now, we will write a function that uses OpenAI LLM, a question-answering chain load_qa_chain from Langchain, and takes the user's query as input. You can choose from a variety of chain types depending on your use case; in this case, we'll use the stuff chain_type, which uses all of the text from the prompt's documents.

model_name = "gpt-3.5-turbo-0301"
llm = OpenAI(model_name=model_name)

chain = load_qa_chain(llm, chain_type="stuff")

def retrieve_answer(query):
  similar_docs = get_similar_docs(query)
  return chain.run(input_documents=similar_docs, question=query)

Lastly, we will test the functionality of the question-answering system using the query below

query = "How many away goals did scotland score in the last century?"
reponse = retrieve_answer(query)
print(answer)
 //419

Conclusion

In this article, we explained how we can mitigate hallucination during the response generation by giving it some context. We wrapped it up by building a context-aware question-answering system that utilizes the power of semantic search to extract relevant information from a set of documents that will give context to Open AI.