LangChain on InterSystems PDF documentation

#programming #python #beginners #ai

Yet another example of applying LangChain to give some inspiration for new community Grand Prix contest.

I was initially looking to build a chain to achieve dynamic search of html of documentation site, but in the end it was simpler to borg the static PDFs instead.

Create new virtual environment

mkdir chainpdf

cd chainpdf

python -m venv .

scripts\activate 

pip install openai
pip install langchain
pip install wget
pip install lancedb
pip install tiktoken
pip install pypdf

set OPENAI_API_KEY=[ Your OpenAI Key ]

python

Prepare the docs

import glob
import wget;

url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip';
wget.download(url)
# extract docs
import zipfile
with zipfile.ZipFile('pdfs.zip','r') as zip_ref:
  zip_ref.extractall('.')

# get a list of files
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]

Load docs into Vector Store

import lancedb
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import LanceDB
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain


embeddings = OpenAIEmbeddings()
db = lancedb.connect('lancedb')
table = db.create_table("my_table", data=[
    {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"}
], mode="overwrite")

documentsAll=[]
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
for file_name in pdfFiles:
  loader = PyPDFLoader(file_name)
  pages = loader.load_and_split()
  # Strip unwanted padding
  for page in pages:
    del page.lc_kwargs
    page.page_content=("".join((page.page_content.split('\xa0'))))
  documents = CharacterTextSplitter().split_documents(pages)
  # Ignore the cover pages
  for document in documents[2:]:
    documentsAll.append(document)

# This will take couple of minutes to complete
docsearch = LanceDB.from_documents(documentsAll, embeddings, connection=table)

Prep the search template

_GetDocWords_TEMPLATE = """Answer the Question: {question}

By considering the following documents:
{docs}
"""

PROMPT = PromptTemplate(
     input_variables=["docs","question"], template=_GetDocWords_TEMPLATE
)

llm = OpenAI(temperature=0, verbose=True)

chain = LLMChain(llm=llm, prompt=PROMPT)

Are you sitting down... Lets talk with the documentation

"What is a File adapter?"

# Ask the queston
# First query the vector store for matching content
query = "What is a File adapter"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
chain.run(docs=docs[:2],question=query)

Answer:

'\nA file adapter is a type of software that enables the transfer of data between two different systems. It is typically used to move data from one system to another, such as from a database to a file system, or from a file system to a database. It can also be used to move data between different types of systems, such as from a web server to a database.

"What is a lock table?"

# Ask the queston # First query the vector store for matching content
query = "What is a locak table"
docs = docsearch.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
chain.run(docs=docs[:2],question=query)

Answer:

'\nA lock table is a system-wide, in-memory table maintained by InterSystems IRIS that records all current locks and the processes that have owned them. It is accessible via the Management Portal, where you can view the locks and (in rare cases, if needed) remove them.'

Will leave as a future exercise to format an User interface on this functionality