Chatting to a Website with OpenAI, LangChain, and ChromaDB

Jason Webster — Fri, 04 Aug 2023 18:21:38 +0000

Large language models (LLMs) are proving to be a powerful generational tool and assistant that can handle a large variety of questions and return human readable responses. They have also seen a lot of interest from big tech giants. ChatGPT, Bing's Assistant, and Google's Bard are all examples of large language models that can be askde questions and will respond with remarkably generalised and creative answers.

Here, we take the power of LLMs and build our own custom application around them. Specifically we're going to build a chatbot will allow us to ask natural questions of a set of scraped website data by giving an LLM app long term searchable memory with a vector database. This is a similar concept to SiteGPT. We'll use OpenAI's gpt-3.5-turbo model for our LLM, and LangChain to help us build our chatbot. Finally, we'll use use ChromaDB as a vector store, and embed data to it using OpenAI's text-ada-embedding-002 model.

This will be a beginner to intermediate level tutorial. I'll assume you have some experience with Python, but not much experience with LangChain or building applications around LLMs. I've written these sections largely independent of one another, so feel free to jump around to the section you're most interested in learning about. The code for this project is available on GitHub.

Web Scraping

We want to build a bot to chat to a website. So, we'll build a quick webscraper to collect our data. We'll use BeautifulSoup and the requests module to build it. Let's install BeautifulSoup.

pip install beautifulsoup4

First, we'll write a utility function to make a GET request to a url, and save the contents of response to a local file. We'll return the response object too. We'll put our code in a new file called scrape.py, and put all our scraped website content into a folder ./scrape.

# scrape.py

import os
import requests
import json

def get_response_and_save(url: str):
    response = requests.get(url)

    # create the scrape dir (if not found)
    if not os.path.exists("./scrape"):
        os.mkdir("./scrape")

    # save scraped content to a cleaned filename
    parsedUrl = cleanUrl(url)
    with open("./scrape/" + parsedUrl + ".html", "wb") as f:
        f.write(response.content)

    return response

def cleanUrl(url: str):
    return url.replace("https://", "").replace("/", "-").replace(".", "_")

We'll use this to create a recursive scraper function. Our intent is to:

Scrape a url and use BeautifulSoup to find all other links on the page.
Then for each url we find from the links we repeat the scrape and go to step 1.

We'll do this recursively up to some depth - where setting depth=1 will scrape the initial url's linked pages, depth=2 will scrape the links found in those pages, etc. We can use a sitemap dict object to track any already visited urls. We'll also make sure we don't escape the origin of the initial url provided (or risk scraping a hefty portion of the internet).

# scrape.py

from urllib.parse import urlparse
from collections import defaultdict
from bs4 import BeautifulSoup

...

def scrape_links(
    scheme: str,
    origin: str,
    path: str,
    depth=3,
    sitemap: dict = defaultdict(lambda: ""),
):
    siteUrl = scheme + "://" + origin + path
    cleanedUrl = cleanUrl(siteUrl)

    if depth < 0:
        return
    if sitemap[cleanedUrl] != "":
        return

    sitemap[cleanedUrl] = siteUrl
    response = get_response_and_save(siteUrl)
    soup = BeautifulSoup(response.content, "html.parser")
    links = soup.find_all("a")

    for link in links:
        href = urlparse(link.get("href"))
        if (href.netloc != origin and href.netloc != "") or (
            href.scheme != "" and href.scheme != "https"
        ):
            # don't scrape external links
            continue

        scrape_links(
            href.scheme or "https",
            href.netloc or origin,
            href.path,
            depth=depth - 1,
            sitemap=sitemap,
        )

    return sitemap

Finally, we'll make this callable via a simple command line interface using argparse. We'll also save the sitemap dict that the scrape_links function returns; this just maps the names of the locally saved/scraped files to their online source - the reason for which will become clear later.

# scrape.py

import argparse

parser = argparse.ArgumentParser()
parser.add_argument("--site", type=str, required=True)
parser.add_argument("--depth", type=int, default=3)

if __name__ == "__main__":
    args = parser.parse_args()
    url = urlparse(args.site)
    sitemap = scrape_links(url.scheme, url.netloc, url.path, depth=args.depth)
    with open("./scrape/sitemap.json", "w") as f:
        f.write(json.dumps(sitemap))

We can now scrape a site by calling

python scrape.py --site <your_site_url> --depth 3

To get some demo data, we'll scrape LangChain's documentation at a depth of 10.

python scrape.py \
  --site https://python.langchain.com/docs/get_started/introduction.html \
  --depth 10

Embedding & Vector Databases

Now that we have data, we'll store this in a way that is easily accessible to our AI via a vector database. Specifically, we'll be using ChromaDB with the help of LangChain.

If you don't know what a vector database is, the TL;DR is that they can store and query data by using embedding vectors. An embedding vector is a way to numerically represent raw data. Ideally, embedding vectors store relational information between two or more data points, such that semantically similar data get similar numerical representations. Most AIs generate these embedding vectors during their training, and they'll learn to produce meaningful embeddings if given a large enough set of data. This makes a vector database a natural candidate for storing "memory" to "recall" for future purposes.

To get started, let's install the relevant packages.

pip install chroma langchain

We'll turn our text into embedding vectors with OpenAI's text-embedding-ada-002 model. We'll need to install openai to access it.

pip install openai

To be able to call OpenAI's model, we'll need a .env file. Let's create one.

# .env

OPENAI_API_KEY=<your_openai_api_key_here>

Replacing <your_openai_api_key_here> with an API key from OpenAI's platform. If you don't have an OpenAI account, now is a good time to create one. You'll get $18 of free credits for a new account, and if you follow along with this tutorial, you'll only use at most $2.

Now, we first need to load the data from the ./scrape/ folder. In a new embed.py file, we add the following.

# embed.py

import os
import json

from langchain.document_loaders import (
    BSHTMLLoader,
    DirectoryLoader,
)
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

from dotenv import load_dotenv
load_dotenv()

loader = DirectoryLoader(
        "./scrape",
        glob="*.html",
        loader_cls=BSHTMLLoader,
        show_progress=True,
        loader_kwargs={"get_text_separator": " "},
)
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
)
documents = text_splitter.split_documents(data)

# map sources from file directory to web source
with open("./scrape/sitemap.json", "r") as f:
        sitemap = json.loads(f.read())
for document in documents:
        document.metadata["source"] = sitemap[
                document.metadata["source"].replace(".html", "").replace("scrape/", "")
        ]

embedding_model = OpenAIEmbeddings(model="text-embedding-ada-002")
db = Chroma.from_documents(
    documents, 
    embedding_model,
    persist_directory="./chroma"
)
db.persist()

This will

Load the html documents in the ./scrape folder, using the BeautifulSoup wrapper class BSHTMLLoader to parse the files.
Split the documents into 1000 character long chunks with 200 character overlaps (this will make querying data easier later).
Add the original URL of the scraped site as the source of the document. The DirectoryLoader automatically adds the file location as the source, and we use the sitemap from earlier to get the original URL.
Set up an embedding model using text-embedding-ada-002.
Store the documents into a ChromaDB vector store using the embedding model.
Persists the data in ChromaDB to a local ./chroma directory to be used later.

Finally, we can embed our data by just running this file. We'll load it up when we create our AI chatbot.

python embed.py

Chatting to Data

Now the real fun can begin. We can create an AI assistant to chat to our vector store. Specifically, we'll be building a question and answering (QA) chatbot that will allow us to ask it questions about the scraped data. We'll at the same time make it return its sources, for anyone with a healthy distrust of chatbots and their hallucinations.

To build our chatbot we'll use LangChain, a framework for building application on top of LLMs. This also helps abstract away much of the prompt engineering we would need to do.

We'll define the LLM we want to use as OpenAI's gpt-3.5-turbo chat model. In our main.py folder, we'll add the following.

# main.py
from dotenv import load_dotenv
load_dotenv()

from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=0, model="gpt-3.5-turbo")

Handling Questions and Memory

As we chat to our model, we will be asking it questions about our scraped website. To do this effectively, we need to ask a question that can be turned into an embedding that can be used to query our vector store. Since our question might be a follow on from a previous question, we should condense our chat history into a single question that we can eventually embed and pass to our vector store. To do this, we create a new LLMChain that will prompt our LLM with an instruction to condense our question.

# main.py

from langchain.chains import LLMChain

condense_question_prompt = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.\
Make sure to avoid using any unclear pronouns.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""

condense_question_prompt = PromptTemplate.from_template(condense_question_prompt)

condense_question_chain = LLMChain(
    llm=llm,
    prompt=condense_question_prompt,
)

Notice we're using {chat_history} and {question} templates in our string. We'll pass the {question} text in as input later down the line. For the {chat_history}, we can add memory using ConversationBufferMemory from LangChain.

# main.py

from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

Note the memory_key argument is set to chat_history. This will automatically keep a text log in memory of our conversation with our chat bot, and will look something like this:

# example chat_history injected by theConversationBufferMemory object
chat_history = """
Human: Hi bot!
AI: Hi human! What can I assist you with today?
Human: What is the answer to life, the universe, and everything else?
AI: 42
...
"""

What is a Chain?

You might be wondering, what exactly is the LLMChain doing? And what are chains anyway? Chains are essentially sequential function calls that are set up to take the output of one function as the input to another. So essentially if we define the functions f and g, then a chain would be f(g(x)).

LangChain provides many off the shelf chains to help with a variety of use cases with LLMs. The one we used above, LLMChain, will setup a chain that can take a list of inputs, format prompts for each input and then call the llm given to it.

The beauty here is we can create chains upon chains. In our case, we can create a chain that figures out the correct question to ask a vector store and retrieve a document. Then another chain that can add that document to a prompt as context (LangChain calls this "stuffing"), then prompt an LLM with the initial question to retrieve an answer. We could even create a prompt that figures out which data store is best to prompt, or create a chain that can detect if any further output is needed from the user.

We'll use a few more off the shelf chains in order to build question and answering (QA) bot that can chat to our scraped web data.

Creating a QA Chain

The basis of asking questions and getting answered is handled with LangChain's QA chains. These are a set of classes and helper functions that aid us in building a QA system on top of a data retriever object. The data retriever will just be our embedded vector store we created earlier.

We'll start by creating a QA chain that will retrieve its sources.

# main.py

from langchain.chains import create_qa_with_sources_chain

qa_chain = create_qa_with_sources_chain(llm)

We're putting a lot of heavy lifting under the create_qa_with_sources_chain function. Under the hood, all this is doing is creating an LLMChain with a special prompt that will answer a user's question given context from a retrieved document. We provide the document context with a StuffDocumentsChain, which stuffs a document into the context of a prompt.

# main.py

from langchain.prompts import PromptTemplate
from langchain.chains import create_qa_with_sources_chain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

qa_chain = create_qa_with_sources_chain(llm)

doc_prompt = PromptTemplate(
    template="Content: {page_content}\nSource: {source}",
    input_variables=["page_content", "source"],
)

final_qa_chain = StuffDocumentsChain(
    llm_chain=qa_chain,
    document_variable_name="context",
    document_prompt=doc_prompt,
)

Finally, we wrap up the QA retrieval by finally loading our vector store with Chroma, set it up as a retriever.

# main.py

from langchain.chains import ConversationalRetrievalChain
from langchain.vectorstores import Chroma

db = Chroma(
    persist_directory="./chroma",
    embedding_function=OpenAIEmbeddings(model="text-embedding-ada-002"),
)

retrieval_qa = ConversationalRetrievalChain(
    question_generator=condense_question_chain,
    retriever=db.as_retriever(),
    memory=memory,
    combine_docs_chain=final_qa_chain,
)

And that's it. The retrieval_qa chain is the last chain we need to start querying our data. We scraped the LangChain docs in our example, so let's ask it a LangChain related question.

response = retrieval_qa.run({question: 'How can I use LangChain with LLMs?'})
print(response)

# output:
"""
{
  "answer": "LangChain provides a standard interface for LLMs, which are language models that take a string as input and return a string as output. To use LangChain with LLMs, you need to understand the different types of language models and how to work with them. You can configure the LLM and/or the prompt used in LangChain applications to customize the output. Additionally, LangChain provides prompt management, prompt optimization, and common utilities for working with LLMs. By combining LLMs with other modules in LangChain, such as chains and agents, you can create more complex applications.",
  "sources": [
    "https://python.langchain.com/docs/get_started/quickstart", "https://python.langchain.com/docs/modules/data_connection/document_loaders/markdown"
  ]
}
"""

Wow! That's a pretty good answer! And it even cites the original sources (this is because we mapped the original documents source to the initial website with the sitemap.json file). It seems to be returning a json structure as a string with the answer and sources list. We'd be able to separate these from the response using the json package.

import json

response = retrieval_qa.run({question: 'How can I use LangChain with LLMs?'})
responseDict = json.loads(response)
answer = responseDict["answer"]
sources = responseDict["sources"]

print(answer)
> """LangChain provides a standard interface for LLMs, which are language models that take a string as input and return a string as output. To use LangChain with LLMs, you need to understand the different types of language models and how to work with them. You can configure the LLM and/or the prompt used in LangChain applications to customize the output. Additionally, LangChain provides prompt management, prompt optimization, and common utilities for working with LLMs. By combining LLMs with other modules in LangChain, such as chains and agents, you can create more complex applications."""

print(sources)
> [
      "https://python.langchain.com/docs/get_started/quickstart",
      ...
  ]

Of course, this is not the best way to interface with our bot. For that, we can use Gradio.

Chatbot Interface with Gradio

Gradio is a very powerful framework for building AI demos. It can quickly create beautiful web interfaces that can interact with an underlying AI bot, and serve the interfaces locally. We'd also be able to host our demo on Hugging Face Spaces if we so choose.

To get started, we first install it

pip install gradio

and in our main.py file, all we need to do is add the following

# main.py

import gradio

def predict(message, history):
    response = retrieval_qa.run({"question": message})
    responseDict = json.loads(response)
    answer = responseDict["answer"]
    sources = responseDict["sources"]

    if type(sources) == list:
        sources = "\n".join(sources)

    if sources:
        return answer + "\n\nSee more:\n" + sources
    return answer


gradio.ChatInterface(predict).launch()

This

Creates a predict function that takes in a message and history arguments. We don't need the history, since we track it already with ConversationalBufferHistory, but Gradio's ChatInterface will require it.
The message variable is passed to the retrieval_qa model as a question.
The response from the model is a string that takes a json schema. We parse the schema and load the answer and sources, and then generate a nice output for the user.
We then setup a Gradio ChatInterface and launch it.

We can then run

python main.py

to launch our Gradio server. This is what it looks like.

Our previous question now looks really good, and we can now chat with our bot in a natural interface.

Evaluation

For the evaluation, we can scrape the LangChain docs using our custom webscraper. Using an original url, and a depth of 10, we run our scraping function

python scrape.py \
  --site https://python.langchain.com/docs/get_started/quickstart \
  --depth 10

Which will get us 659 scraped html files. We then embed this to our vector store

python embed.py

before finally running

python main.py

and interacting with the chatbot in our Gradio server.

The chatbot seems to do quite well with very direct questions. Some examples:

Question: What is LangChain and how can I use it to build apps on top of LLMs?
✅ Answer: LangChain is an open-source library that provides developers with the tools to build applications powered by large language models (LLMs). It is a framework for including AI from LLMs inside data pipelines and applications. LangChain provides a standard interface for chains, integrations with other tools, and end-to-end chains for common applications. You can use LangChain to build apps on top of LLMs by leveraging its features such as prompt management, prompt optimization, working with LLMs, chains, data augmented generation, and agents.

See more:
Question: How can I get started?
✅ Answer: To get started with LangChain, follow these steps:
1. Installation: Install the LangChain framework by following the installation guide.
2. Quickstart: Familiarize yourself with the framework by building your first LangChain application using the Quickstart guide.
3. Modules: Explore the different modules provided by LangChain, such as Models, Prompts, Memory, Indexes, Chains, and Agents.
4. Data connection: Interface with application-specific data by using the Data connection module.
5. Guides: Refer to the Guides section for additional resources and conceptual guides. Source: LangChain Documentation - Get started
See more:
- https://python.langchain.com/docs/get_started

Curiously, the answer will also sometimes cite the source of its knowledge, as well as provide sources separately.

It seems to do less well with more generic questions, especially when there are multiple plausible sources. The bot seems to get confused by all the sources, and there may not be enough context provided for it to answer accurately.

One particular example is if you ask it what LangChain is, without specifying LLMs, it will think LangChain provides integration with blockchain technology. This is technically true (with the blockchain document loader) but it is not the true purpose of LangChain.

Question: What is LangChain?
❌ Answer: LangChain is a platform that provides integration with blockchain technology. It offers document loaders for blockchain integration and provides an introduction to get started with LangChain.

See more:

Of course, the underlying ChatGPT chat model has no clue what LangChain is. ChatGPT's training data only includes data up to September 2021, and LangChain was released in 2022. A query like "What is LangChain" will only be good as the query it can make to the vector store, which will be far too generic when almost everything in the vector store is referencing LangChain.

We could get around this by fine tuning a model first, with the more generic questions, before arming it with a query-able vector store. We could also explore tweaking the chain or tweaking, tweaking the embedding process, or tweaking the vector store querying algorithms.

Conclusion and Next Steps

We've managed to build a fairly simple chatbot that can answer questions about a scraped website using LangChain, OpenAI models, a vector store, and about $2 of credit for our OpenAI API calls.

Our chatbot performs quite well when asked direct questions, and can answer multiple questions in sequence while keeping a memory of any past questions.

Our chatbot does a little less well at generalisations and when asked very generic questions. We might be able to improve it by tweaking the character size when generating documents to embed, tweak the vector store search from the default similarity_search to mmr, construct a more complex chain that attempts to query more documents, or fine tune the chat model before using it in our chain.

If you have any questions please feel free to reach out to me at jasonrobwebster@gmail.com. If you were inspired by this work I'd also love to hear it!

This article was originally published on my personal blog.

DEV Community: Jason Webster