Vishnu Sivan

Posted on Jul 24, 2023

Talk with documents using LlamaIndex

Discover the latest buzz in the tech world with LangChain and LlamaIndex! These open-source libraries offer developers the opportunity to harness the incredible power of Large Language Models (LLMs) in their applications. LlamaIndex acts as a central hub, seamlessly connecting LLMs with external data sources. Meanwhile, LangChain provides a robust framework for constructing and managing LLM-powered applications. Though still in development, these game-changing tools have the potential to revolutionize the way we build and integrate advanced language models.

In this article, we will cover the basics of LlamaIndex and create a data extraction and analysis tool using LlamaIndex, LangChain and OpenAI.

Getting Started

What are Large Language Models (LLMs)
What is LangChain
What is Streamlit
Introduction to LlamaIndex
Basic workflow of LlamaIndex
LlamaIndex indices
Creating a document extractor / analyzer application using LlamaIndex, LangChain and OpenAI
Installing the dependencies
Setting up environment variables
Importing the libraries
Designing the sidebar
Defining the get_response method
Designing streamlit input field and submit button
Complete code for the app
Running the app

What are Large Language Models (LLMs)

Large Language Models (LLMs) refer to powerful AI models that are designed to understand and generate human language. LLMs are characterized by their ability to process and generate text that is coherent, contextually relevant, and often indistinguishable from human-written content. These models are pre-trained on diverse and extensive corpora of text, such as books, articles, websites, and other sources of written language. During pre-training, the models learn to predict the next word in a given sentence or fill in missing words in a paragraph, which helps them capture grammar, syntax, and semantic relationships between words and phrases.

Large Language Models have gained significant attention and popularity due to their versatility and the impressive quality of their language generation capabilities. They have found applications in various domains, including natural language processing, content creation, chatbots, virtual assistants, and even creative writing. However, it’s important to note that LLMs are still machines and may occasionally produce inaccurate or biased outputs, highlighting the need for careful evaluation and human oversight when using them in real-world applications.

What is LangChain

LangChain is an open-source framework developed to streamline the creation of applications powered by large language models (LLMs). It provides a comprehensive set of tools, components, and interfaces that simplify the development process of LLM-centric applications. By leveraging LangChain, developers can effortlessly manage interactions with language models, seamlessly connect various components, and integrate resources like APIs and databases. The LangChain platform also offers a range of embedded APIs that empower developers to incorporate language processing capabilities without starting from scratch.

As natural language processing continues to advance and gain wider adoption, the potential applications of this technology become virtually boundless. Here are some notable features of LangChain:

LangChain allows developers to tailor prompts according to their specific requirements, enabling more precise and relevant language model outputs.
LangChain enables developers to manipulate context to establish and guide the context for improved precision and user satisfaction, enhancing the overall user experience.
With LangChain, developers can construct chain link components, which facilitate advanced usage scenarios and provide greater flexibility in the application design.
The framework provides versatile components that can be mixed and matched to suit specific application needs, providing a modular approach to development.
LangChain supports the integration of various models, including popular ones like GPT and HuggingFace Hub, allowing developers to leverage the cutting-edge capabilities of these language models.

What is Streamlit

Streamlit is a Python library that enables the effortless creation and sharing of interactive web applications and data visualizations. It provides a user-friendly interface for developing interactive charts and graphs using popular data visualization libraries such as matplotlib, pandas, and plotly. With Streamlit, you can build web apps that respond in real-time to user input, making it easy to create dynamic and engaging data-driven applications.

Introduction to LlamaIndex

The primary concept behind LlamaIndex is the capability to query documents, whether they consist of text or code, using a language model (LLM) such as ChatGPT. It is an open-source project that serves as a bridge between large language models (LLMs) and external data sources such as APIs, PDFs, and SQL databases. It offers a straightforward interface and facilitates the creation of indices for both structured and unstructured data, effectively handling the variations among different data sources. LlamaIndex can store the necessary context for prompt engineering, address challenges when dealing with large context windows, and assist in balancing cost and performance considerations when executing queries.

llamaindex (LlamaIndex) | Org profile for LlamaIndex on Hugging Face, the AI community building the future. | huggingface.co

Basic workflow of LlamaIndex

The document is loaded into LlamaIndex using pre-built readers for various sources, including databases, Discord, Slack, Google Docs, Notion, and GitHub repositories.
LlamaIndex parses the documents, breaking them down into nodes or chunks of text.
An index is created to efficiently retrieve relevant data when querying the documents. The index can be stored in different ways, with the Vector Store being a commonly used method.
To perform a query, the document is searched using the index stored in the vector store. The response is then sent back to the user.

LlamaIndex indices

LlamaIndex provides specialized indices in the form of unique data structures.

Vector store index: Widely used for answering queries across a large corpus of data.
List index: Beneficial for synthesizing answers that combine information from multiple data sources.
Keyword table index: Useful for routing queries to different unrelated data sources.
Knowledge graph index: Effective for constructing and utilizing knowledge graphs.
Structured store index: Well-suited for handling structured data, such as SQL queries.
Tree index: Valuable for summarizing collections of documents.

Creating a document extractor / analyzer application using LlamaIndex, LangChain and OpenAI

In the previous sections, we discussed the basics of LLMs, LangChain and LlamaIndex. In this section, we will create a basic document extractor / analyzer application using these generative AI tools. The application takes openai key and the directory path as inputs and provides an interface to interact with the documents listed in the specified directory.

Installing the dependencies

Create and activate a virtual environment by executing the following command.

python -m venv venv
source venv/bin/activate #for ubuntu
venv/Scripts/activate #for windows

Install llama-index and streamlit libraries using pip.

Note that LlamaIndex requires python 3.8+ version to work. Try using the specified streamlit and llama-index version since the latest version gives some RateLimit error and is not yet fixed by the LlamaIndex team.

pip install streamlit==1.24.0
pip install llama-index==0.5.27

Setting up environment variables

Openai key is required to access LlamaIndex. Follow the steps to create a new openai key.

Open platform.openai.com.
Click on your name or icon option which is located on the top right corner of the page and select “API Keys” or click on the link — Account API Keys — OpenAI API.
Click on create new secret key button to create a new openai key.

Importing the libraries

Import the necessary libraries by creating a file named app.py and add the following code to it.

import os, streamlit as st

from llama_index import GPTVectorStoreIndex, SimpleDirectoryReader, LLMPredictor, PromptHelper, ServiceContext
from langchain.llms.openai import OpenAI

Designing the sidebar

Create a sidebar using the streamit sidebar class to collect the openai key and the directory path from the user. Add the following code to the app.py file.

openai_api_key = st.sidebar.text_input(
    label="#### Your OpenAI API key 👇",
    placeholder="Paste your openAI API key, sk-",
    type="password")

directory_path = st.sidebar.text_input(
    label="#### Your data directory path 👇",
    placeholder="C:\data",
    type="default")

Defining the get_response method

Create a get_response() method which takes query, directory_pathand openai_api_key as arguments and returns the query response.

Add the following code to the app.py file.

def get_response(query,directory_path,openai_api_key):
    # This example uses text-davinci-003 by default; feel free to change if desired. 
    # Skip openai_api_key argument if you have already set it up in environment variables (Line No: 7)
    llm_predictor = LLMPredictor(llm=OpenAI(openai_api_key=openai_api_key, temperature=0, model_name="text-davinci-003"))

    # Configure prompt parameters and initialise helper
    max_input_size = 4096
    num_output = 256
    max_chunk_overlap = 20

    prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

    if os.path.isdir(directory_path): 
        # Load documents from the 'data' directory
        documents = SimpleDirectoryReader(directory_path).load_data()
        service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
        index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

        response = index.query(query)
        if response is None:
            st.error("Oops! No result found")
        else:
            st.success(response)
    else:
        st.error(f"Not a valid directory: {directory_path}")

Understanding the code:

Create an object llm_predictor for the class LLMPredictor which accepts a parameter llm. Specify a model, text-davinci-003 from OpenAI’s API. Specify temperature and openai api key as the object arguments.
Create a PromptHelper by specifying the maximum input size (max_input_size), number of outputs (num_output), maximum chunk overlap (max_chunk_overlap).
The SimpleDirectoryReader class is designed to read data from a directory. It takes an input_files parameter, which is used to dynamically generate a filename and is passed to the SimpleDirectoryReader instance. When the load_data method is invoked on the SimpleDirectoryReader object, it is responsible for loading the data from the specified input files and returning the documents that have been successfully loaded.
The GPTSimpleVectorIndex class is specifically designed to establish an index that enables efficient searching and retrieval of documents. To create this index, we use the from_documents method of the class, which requires two parameters — documents and service_context.
The documents parameter is used to represent the actual documents that will be indexed.
The service_context parameter is used to denote the service context being passed along with the documents.
Query documents using index.query(query) as the query input.

Designing streamlit input field and submit button

Create an input field and a submit button using streamlit to get the user queries. Call the get_response() method inside the submit button to execute llama-index for querying the documents through the given input.

Add the following code to the app.py file.

# Define a simple Streamlit app
st.title("ChatMATE")
query = st.text_input("What would you like to ask?", "")

# If the 'Submit' button is clicked
if st.button("Submit"):
    if not query.strip():
        st.error(f"Please provide the search query.")
    else:
        try:
            if len(openai_api_key) > 0:
                get_response(query,directory_path,openai_api_key)
            else:
                st.error(f"Enter a valid openai key")
        except Exception as e:
            st.error(f"An error occurred: {e}")

Complete code for the app

import os, streamlit as st

from llama_index import GPTSimpleVectorIndex, SimpleDirectoryReader, LLMPredictor, PromptHelper, ServiceContext
from langchain.llms.openai import OpenAI

# Uncomment to specify your OpenAI API key here, or add corresponding environment variable (recommended)
# os.environ['OPENAI_API_KEY']= "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"

# Provide openai key from the frontend if you are not using the above line of code to seet the key
openai_api_key = st.sidebar.text_input(
    label="#### Your OpenAI API key 👇",
    placeholder="Paste your openAI API key, sk-",
    type="password")

directory_path = st.sidebar.text_input(
    label="#### Your data directory path 👇",
    placeholder="C:\data",
    type="default")

def get_response(query,directory_path,openai_api_key):
    # This example uses text-davinci-003 by default; feel free to change if desired. 
    # Skip openai_api_key argument if you have already set it up in environment variables (Line No: 7)
    llm_predictor = LLMPredictor(llm=OpenAI(openai_api_key=openai_api_key, temperature=0, model_name="text-davinci-003"))

    # Configure prompt parameters and initialise helper
    max_input_size = 4096
    num_output = 256
    max_chunk_overlap = 20

    prompt_helper = PromptHelper(max_input_size, num_output, max_chunk_overlap)

    if os.path.isdir(directory_path): 
        # Load documents from the 'data' directory
        documents = SimpleDirectoryReader(directory_path).load_data()
        service_context = ServiceContext.from_defaults(llm_predictor=llm_predictor, prompt_helper=prompt_helper)
        index = GPTSimpleVectorIndex.from_documents(documents, service_context=service_context)

        response = index.query(query)
        if response is None:
            st.error("Oops! No result found")
        else:
            st.success(response)
    else:
        st.error(f"Not a valid directory: {directory_path}")

# Define a simple Streamlit app
st.title("ChatMATE")
query = st.text_input("What would you like to ask?", "")

# If the 'Submit' button is clicked
if st.button("Submit"):
    if not query.strip():
        st.error(f"Please provide the search query.")
    else:
        try:
            if len(openai_api_key) > 0:
                get_response(query,directory_path,openai_api_key)
            else:
                st.error(f"Enter a valid openai key")
        except Exception as e:
            st.error(f"An error occurred: {e}")

Running the app

To run the app, an openai api key and a directory path is required. Create a few text files that contains your required content and place it in a directory. Specify the directory while running the app. In this demo, text files containing information about quantum physics and quantum computing were used.

Run the app using the following command,

streamlit run app.py

The output is as given below,

Thanks for reading this article.

Thanks Gowri M Bhatt for reviewing the content.

If you enjoyed this article, please click on the heart button ♥ and share to help others find it!

The full source code for this tutorial can be found here,

GitHub - codemaker2015/llamaindex-based-document-extractor | github.com

Here are some useful links,

DEV Community