Yet another example of applying LangChain to give some inspiration for new community Grand Prix contest.
I was initially looking to build a chain to achieve dynamic search of html of documentation site, but in the end it was simpler to borg the static PDFs instead.
Create new virtual environment
mkdir chainpdf cd chainpdf python -m venv . scripts\activate pip install openai pip install langchain pip install wget pip install lancedb pip install tiktoken pip install pypdf set OPENAI_API_KEY=[ Your OpenAI Key ] python
Prepare the docs
import glob import wget; url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip'; wget.download(url) # extract docs import zipfile with zipfile.ZipFile('pdfs.zip','r') as zip_ref: zip_ref.extractall('.') # get a list of files pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
Load docs into Vector Store
import lancedb from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores import LanceDB from langchain.document_loaders import PyPDFLoader from langchain.embeddings.openai import OpenAIEmbeddings from langchain.text_splitter import CharacterTextSplitter from langchain.prompts.prompt import PromptTemplate from langchain import OpenAI from langchain.chains import LLMChain embeddings = OpenAIEmbeddings() db = lancedb.connect('lancedb') table = db.create_table("my_table", data=[ {"vector": embeddings.embed_query("Hello World"), "text": "Hello World", "id": "1"} ], mode="overwrite") documentsAll=[] pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")] for file_name in pdfFiles: loader = PyPDFLoader(file_name) pages = loader.load_and_split() # Strip unwanted padding for page in pages: del page.lc_kwargs page.page_content=("".join((page.page_content.split('\xa0')))) documents = CharacterTextSplitter().split_documents(pages) # Ignore the cover pages for document in documents[2:]: documentsAll.append(document) # This will take couple of minutes to complete docsearch = LanceDB.from_documents(documentsAll, embeddings, connection=table)
Prep the search template
_GetDocWords_TEMPLATE = """Answer the Question: {question} By considering the following documents: {docs} """ PROMPT = PromptTemplate( input_variables=["docs","question"], template=_GetDocWords_TEMPLATE ) llm = OpenAI(temperature=0, verbose=True) chain = LLMChain(llm=llm, prompt=PROMPT)
Are you sitting down... Lets talk with the documentation
"What is a File adapter?"
# Ask the queston # First query the vector store for matching content query = "What is a File adapter" docs = docsearch.similarity_search(query) # Only using the first two documents to reduce token search size on openai chain.run(docs=docs[:2],question=query)
Answer:
'\nA file adapter is a type of software that enables the transfer of data between two different systems. It is typically used to move data from one system to another, such as from a database to a file system, or from a file system to a database. It can also be used to move data between different types of systems, such as from a web server to a database.
"What is a lock table?"
# Ask the queston # First query the vector store for matching content query = "What is a locak table" docs = docsearch.similarity_search(query) # Only using the first two documents to reduce token search size on openai chain.run(docs=docs[:2],question=query)
Answer:
'\nA lock table is a system-wide, in-memory table maintained by InterSystems IRIS that records all current locks and the processes that have owned them. It is accessible via the Management Portal, where you can view the locks and (in rare cases, if needed) remove them.'
Will leave as a future exercise to format an User interface on this functionality
Top comments (0)