<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Vincent Cheng</title>
    <description>The latest articles on DEV Community by Vincent Cheng (@vincent_cys).</description>
    <link>https://dev.to/vincent_cys</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2533462%2F7c4932e6-aa64-46b4-9336-510adaf81b65.jpg</url>
      <title>DEV Community: Vincent Cheng</title>
      <link>https://dev.to/vincent_cys</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vincent_cys"/>
    <language>en</language>
    <item>
      <title>Talk with your PDF documents in SharePoint</title>
      <dc:creator>Vincent Cheng</dc:creator>
      <pubDate>Fri, 06 Dec 2024 13:42:59 +0000</pubDate>
      <link>https://dev.to/vincent_cys/talk-with-your-pdf-documents-in-sharepoint-2dgc</link>
      <guid>https://dev.to/vincent_cys/talk-with-your-pdf-documents-in-sharepoint-2dgc</guid>
      <description>&lt;p&gt;A dreadful Teams/Slack message popped up! &lt;em&gt;“Hey, could you help to find out [information] is in which documents?”&lt;/em&gt; You opened up the SharePoint folder, only to find out that you have no idea which documents this information belongs to.&lt;/p&gt;

&lt;p&gt;Fear not! In this article, we will be building a RAG application to search through the mountain of PDF documents in your SharePoint.&lt;/p&gt;

&lt;p&gt;RAG app: &lt;a href="https://finance-chatbot-vincent-cheng.streamlit.app/" rel="noopener noreferrer"&gt;https://finance-chatbot-vincent-cheng.streamlit.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ji680a7tc5w4vch17i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp1ji680a7tc5w4vch17i.png" alt="RAG app preview" width="800" height="485"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Tech Stack
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Database: ChromaDB&lt;/li&gt;
&lt;li&gt;LLM and model: OpenAI’s gpt-4o-mini, Google’s Gemini 1.5 Flash-8B&lt;/li&gt;
&lt;li&gt;Text embeddings: OpenAI’s text-embedding-3-large, Google’s embedding-001&lt;/li&gt;
&lt;li&gt;FrontEnd: Streamlit&lt;/li&gt;
&lt;li&gt;Cloud: Streamlit community cloud&lt;/li&gt;
&lt;li&gt;Tools: LangChain&lt;/li&gt;
&lt;li&gt;Storage: Microsoft SharePoint&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1nbii9xz4hjauqi8zly.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft1nbii9xz4hjauqi8zly.png" alt="Architecture Overview" width="800" height="307"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Github: &lt;a href="https://github.com/cyshen11/finance-chatbot/tree/main" rel="noopener noreferrer"&gt;https://github.com/cyshen11/finance-chatbot/tree/main&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Index
&lt;/h2&gt;

&lt;p&gt;For index, we are converting the PDF documents into vector embeddings and store in a vector database.&lt;/p&gt;

&lt;p&gt;Given that your documents are in the SharePoint, we can load the documents using LangChain SharePointLoader. Before using the SharePointLoader, we need to obtain a few parameters &lt;code&gt;O365_CLIENT_ID&lt;/code&gt;, &lt;code&gt;O365_CLIENT_SECRET&lt;/code&gt;, &lt;code&gt;O365_TOKEN&lt;/code&gt;, &lt;code&gt;DOCUMENT_LIBRARY_ID&lt;/code&gt; and &lt;code&gt;FOLDER_ID&lt;/code&gt;. You can follow this &lt;a href="https://python.langchain.com/docs/integrations/document_loaders/microsoft_sharepoint/" rel="noopener noreferrer"&gt;guide&lt;/a&gt; on how to obtain these parameters. For the O365_TOKEN, convert the content in o365_token.txt into TOML format. Copy the output and paste into your Streamlit secrets in this format.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[O365_TOKEN]
token_type = ...
scope = ...
expires_in = ...
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the Python code, read this secrets, convert into JSON, write the JSON into this directory &lt;code&gt;Path.home() / ".credentials"&lt;/code&gt; . Then, you can initialize the SharePointLoader with the token and load the documents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; directory_path = Path.home() / ".credentials"

 # Check if dir exist
  if not os.path.exists(directory_path):
    os.makedirs(directory_path)

  # Write O365 token into text file 
  with open(directory_path / "o365_token.txt", 'w') as f:
    json.dump(O365_TOKEN, f)

  # Initialize document loader
  loader = SharePointLoader(
    document_library_id=document_library_id, 
    auth_with_token=True,
    folder_id=folder_id
  )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Load the documents using the SharePointLoader. Before initializing the vector database, obtain the API keys for the LLM model that you are going to use. Initialize vector database (ChromaDB) and specify the collection name, embeddings based on user selected model. Provide the directory to the &lt;code&gt;persist_directory&lt;/code&gt; parameter to save the vector database on-disk. Add the loaded documents into the vector database with generated ids.&lt;/p&gt;

&lt;h2&gt;
  
  
  Retrieval
&lt;/h2&gt;

&lt;p&gt;When we submit the question at the app, the RAG will convert the question into embeddings, perform vector search to return top K documents (n-nearest neighbors) based on vector similarity.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generation
&lt;/h2&gt;

&lt;p&gt;The RAG then passes the documents as &lt;code&gt;context&lt;/code&gt; and user &lt;code&gt;question&lt;/code&gt; to the LLM for generating a &lt;code&gt;response&lt;/code&gt;. We will also retrieve the &lt;code&gt;source&lt;/code&gt;, &lt;code&gt;page&lt;/code&gt; from the documents and de-duplicate them. Finally, the &lt;code&gt;response&lt;/code&gt;, &lt;code&gt;source&lt;/code&gt; and &lt;code&gt;page&lt;/code&gt; are passed back to the front-end.&lt;/p&gt;

&lt;h2&gt;
  
  
  Result
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4jh548rqjit4xsrz7vh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fv4jh548rqjit4xsrz7vh.png" alt="Result" width="800" height="487"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Tada! We found the documents!&lt;/p&gt;

</description>
      <category>rag</category>
      <category>llm</category>
      <category>langchain</category>
      <category>sharepoint</category>
    </item>
  </channel>
</rss>
