Talk with your PDF documents in SharePoint

Vincent Cheng — Fri, 06 Dec 2024 13:42:59 +0000

A dreadful Teams/Slack message popped up! “Hey, could you help to find out [information] is in which documents?” You opened up the SharePoint folder, only to find out that you have no idea which documents this information belongs to.

Fear not! In this article, we will be building a RAG application to search through the mountain of PDF documents in your SharePoint.

RAG app: https://finance-chatbot-vincent-cheng.streamlit.app/

Tech Stack

Database: ChromaDB
LLM and model: OpenAI’s gpt-4o-mini, Google’s Gemini 1.5 Flash-8B
Text embeddings: OpenAI’s text-embedding-3-large, Google’s embedding-001
FrontEnd: Streamlit
Cloud: Streamlit community cloud
Tools: LangChain
Storage: Microsoft SharePoint

Architecture Overview

Github: https://github.com/cyshen11/finance-chatbot/tree/main

Index

For index, we are converting the PDF documents into vector embeddings and store in a vector database.

Given that your documents are in the SharePoint, we can load the documents using LangChain SharePointLoader. Before using the SharePointLoader, we need to obtain a few parameters O365_CLIENT_ID, O365_CLIENT_SECRET, O365_TOKEN, DOCUMENT_LIBRARY_ID and FOLDER_ID. You can follow this guide on how to obtain these parameters. For the O365_TOKEN, convert the content in o365_token.txt into TOML format. Copy the output and paste into your Streamlit secrets in this format.

[O365_TOKEN]
token_type = ...
scope = ...
expires_in = ...
...

In the Python code, read this secrets, convert into JSON, write the JSON into this directory Path.home() / ".credentials" . Then, you can initialize the SharePointLoader with the token and load the documents.

 directory_path = Path.home() / ".credentials"

 # Check if dir exist
  if not os.path.exists(directory_path):
    os.makedirs(directory_path)

  # Write O365 token into text file 
  with open(directory_path / "o365_token.txt", 'w') as f:
    json.dump(O365_TOKEN, f)

  # Initialize document loader
  loader = SharePointLoader(
    document_library_id=document_library_id, 
    auth_with_token=True,
    folder_id=folder_id
  )

Load the documents using the SharePointLoader. Before initializing the vector database, obtain the API keys for the LLM model that you are going to use. Initialize vector database (ChromaDB) and specify the collection name, embeddings based on user selected model. Provide the directory to the persist_directory parameter to save the vector database on-disk. Add the loaded documents into the vector database with generated ids.

Retrieval

When we submit the question at the app, the RAG will convert the question into embeddings, perform vector search to return top K documents (n-nearest neighbors) based on vector similarity.

Generation

The RAG then passes the documents as context and user question to the LLM for generating a response. We will also retrieve the source, page from the documents and de-duplicate them. Finally, the response, source and page are passed back to the front-end.

Result

Tada! We found the documents!