DEV Community

Cover image for Chat with your PDF: Build a PDF Analyst with LlamaIndex and AgentLabs
Kevin Piacentini
Kevin Piacentini

Posted on

Chat with your PDF: Build a PDF Analyst with LlamaIndex and AgentLabs

In this tutorial, we'll learn how to use some basic features of LlamaIndex to create your PDF Document Analyst.

We'll use the AgentLabs interface to interact with our analysts, uploading documents and asking questions about them.

The tools we'll use

LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models.

It makes it easy to build Llm backend applications.

AgentLabs will allow us to get a frontend in no time using either Python or TypeScript in our backend (here we'll use Python).

What we are building

Getting started

As usual, let's install all the dependencies we'll need.

If you're using pip:

pip install pypdf langchain llama-index agentlabs-sdk
Enter fullscreen mode Exit fullscreen mode

If you're using poetry:

poetry add pypdf langchain llama-index agentlabs-sdk
Enter fullscreen mode Exit fullscreen mode

And now import them all:

from langchain.llms import OpenAI

from llama_index import SimpleDirectoryReader, ServiceContext, VectorStoreIndex
from llama_index.node_parser import SimpleNodeParser
from llama_index import set_global_service_context
from llama_index.response.pprint_utils import pprint_response
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.query_engine import SubQuestionQueryEngine

from agentlabs.chat import IncomingChatMessage, MessageAttachment
from agentlabs.project import Project
from agentlabs.chat import MessageFormat

import asyncio
import os
Enter fullscreen mode Exit fullscreen mode

Preparing our model

Before to get started, we need to instantiate the model we'll use along this tutorial for handling the users requests and compute our document's embedding (we'll talk more about this later).

Here, we'll use OpenAI's text-davinci-003 model. We pass it a max_tokens value of -1 so it's considered as unlimited.

llm = OpenAI(temperature=0, model_name="text-davinci-003", max_tokens=-1)
Enter fullscreen mode Exit fullscreen mode

Now we construct a ServiceContext, passing it our llm as an argument so that every time the framework needs to call our model, it'll use our llm's instance.

service_context = ServiceContext.from_defaults(llm=llm)
set_global_service_context(service_context=service_context)
Enter fullscreen mode Exit fullscreen mode

What's next?

Okay, so now our users will be able to upload some files.

In order to give our application the ability to retrieve information about these large files, we will need to transform them a bit and store them in a dedicated storage.

Note: If you're not familiar with embeddings, here's an article that explains embeddings in detail.

Long story short, performing semantic research over a normal database is not something we are capable of. To allow our model to retrieve some data by semantic proximity, we'll proceed in two steps:

  1. we'll ask our model to transform the data in a mathematical representation (this process is called embeddings)
  2. we'll store this representation in a database that is capable of retrieving the spatial similarity of two embeddings.

Files handling and indexing

Let's assume we know where the files are stored in the filesystem and we have the absolute paths of the location of every file.

First, we'll initialize a variable that will contain our vector storage index; it will be null at the beginning, you will understand why very soon.

vs_index = None
Enter fullscreen mode Exit fullscreen mode

Now, we'll use the SimpleDirectoryReader and its load_data() method to transform every PDF file into a plain text document using pypdf under the hood.

def load_and_index_files(paths):
  docs = SimpleDirectoryReader(input_files=paths).load_data()
Enter fullscreen mode Exit fullscreen mode

In that same function, we'll now use a VectorStoreIndex.from_documents() to create an in-memory vector store index containing our embeddings.

Under the hood, this method will use our LLM to compute all vector embeddings for us.

Let's update our function:


def load_and_index_files(paths):
  docs = SimpleDirectoryReader(input_files=paths).load_data()
  vs_index = VectorStoreIndex.from_documents(docs)
Enter fullscreen mode Exit fullscreen mode

Final change: since our users will be able to upload multiple documents, we want to re-index and update our vector store index every time a user uploads a new document.

We can achieve this by using the SimpleNodeParser and inserting nodes directly into our index.

Here's our final function:

def load_and_index_files(paths):
  docs = SimpleDirectoryReader(input_files=paths).load_data()
  global vs_index
  if vs_index is None:
    vs_index = VectorStoreIndex.from_documents(docs)
  else:
    parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)
    new_nodes = parser.get_nodes_from_documents(docs)
    vs_index.insert_nodes(new_nodes)
Enter fullscreen mode Exit fullscreen mode

Querying

Now, we know how to handle our files and create our indices. We can create what we'll need to query those indices.

To do so, we'll configure a QueryEngine as follows.

engine = vs_index.as_query_engine(similarity_top_k=3)
Enter fullscreen mode Exit fullscreen mode

Now we have our querying engine, we can use it to send some queries:

# or use await if you're in an async function
response = asyncio.run("your query about your document")
Enter fullscreen mode Exit fullscreen mode

But obviously, to get it working, we now need to wrap everything up and to setup the UI for our users.

No worries, this is probably the most straightforward part.

Setting up the UI

We'll start by setting up the user interface with AgentLabs.
It's fairly easy to do:

  • sign-in to https://agentlabs.dev
  • create a project
  • create an agent and name it ChatGPT
  • create a secret key for this agent

Image description

Init the AgentLabs project

Now, we will init AgentLabs with the info they provide to us in our dashboard.

from agentlabs.agent import Agent
from agentlabs.chat import IncomingChatMessage, MessageFormat
from agentlabs.project import Project
import os


alabs = Project(
    project_id="df3e3beb-49c4-4bd7-9193-e7755e4e1578",
    agentlabs_url="https://llamaindex-analyst.app.agentlabs.dev",
    secret=os.environ['AGENTLABS_SECRET'],
)

agent = alabs.agent(id="5fb3e7af-5cb3-4095-bca1-47db49774730")

alabs.connect()
alabs.wait()
Enter fullscreen mode Exit fullscreen mode

Here, we add our secret in an environment variable for safety reasons. All the above variables can be found in your AgentLabs console.

Handling user uploads and messages

We'll use the on_chat_message method provided by AgentLabs to handle every message (including files) sent by the user.

We'll define a handler with a simple logic :

  • if the message contains one or more attachment, then we'll download them and we'll use the load_and_index_files function we previously created.

  • if the message contains no attachment but we did not indexed any file yet, we send a kindly message to our user inviting them to upload some files

  • otherwise, we run the query over our engine and we return the result to the user.

Here's our handler code:

def handle_message(msg: IncomingChatMessage):
  if len(msg.attachments) > 0:
    agent.typewrite(
      conversation_id=msg.conversation_id,
      text="Ok, I am indexing your files"
    )
    st = agent.create_stream(conversation_id=msg.conversation_id, format=MessageFormat.MARKDOWN)

    paths = download_attachments(msg.attachments)
    load_and_index_files(paths)

    st.typewrite("All files have been indexed. You can ask me questions now.")
    st.end()
    return

  if vs_index is None:
    return agent.typewrite(
      conversation_id=msg.conversation_id,
      text="No files have been indexed yet. Please upload some files."
    )

  engine = vs_index.as_query_engine(similarity_top_k=3)

  response = asyncio.run(engine.aquery(msg.text))

  agent.typewrite(
    conversation_id=msg.conversation_id,
    text=response.response,
  )



alabs.on_chat_message(handle_message)
Enter fullscreen mode Exit fullscreen mode

You probably noticed that AgentLabs provides some practical built-in methods to interact with our users in realtime such as agent.typewrite() and agent.create_stream().

You can find more information about these methods in the official documentation.

Et voilà!

Congrats, your project is ready!

You can also retrieve the entire source code here.

Here's again how it looks:

Conclusion

In this tutorial, we only saw some basic querying and storing mechanisms available with LlamaIndex.

However, it gives you an idea about how you can get started an easily prototype powerful llms-apps with AgentLabs and LlamaIndex.

If you liked this tutorial, feel free to leave a comment below and to smash the like buttons :)

Top comments (0)