Use ChatGPT To query your internal website

Chat GPT is making a lot of waves in the world of internet. Chat GPT excels at answering questions based on factual knowledge it has. But there is a limit: it has only the knowledge available on public internet up to September 2021. What if you want chat GPT to answer questions based on content that was created after September 2021? What if you wanted to answer questions based on your internal documentation that is not available on public internet? This is where the openAI API for interacting with a GPT will come to help.

There are two ways to help chat GPT learn new knowledge:
1) Fine tuning the GPT model with new learning data
2) Insert the new knowledge into the question that you are asking as context.

For factual recall use-case, the second methodology is better. How much context data can be fed to each model is limited by a maximum amount of text it can read at once (credit: OpenAI Cookbook)

Model	Maximum text length
gpt-3.5-turbo	4,096 tokens (~5 pages)
gpt4	8,192 tokens (~10 pages)
gpt4-32k	32,768 tokens (~40 pages)

To enable chat GPT to answer questions based on your internal documentation, you can follow the below procedure:

1) collect the training data from your internal documentation and create embeddings from it.

Embeddings are vectors or arrays of integers that represent the textual data. These vectors will have 1000s of dimensions. We will store the embeddings in a file for this POC, but for real use, it should be in a vector database.

2)When the question comes from the user you create the embeddings of the question.

3) compare the embedding that you have of the documentation with the embedding of the question and then match them using attributes like similarity for relevance

4) now we can give the most relevant or most similar knowledge data as a context to the question that we are asking to the GPT model.

This is like chatGPT is writing an open book exam and can check the book for the relevant information before answering the question and therefore it's able to answer the questions well.

For the proof of concept that, we have a Python application that would read a confluence website and answer questions based on that knowledge.

https://github.com/manumaan/custom_chat_gpt

You can create a free confluence website by going to https://www.atlassian.com/software/confluence/jira-integration/try

Once the site is created you can create a "space" inside that, and then pages inside that space. Click on your username icon and go to settings -> password. Here you can create the API Token.

You can create OpenAI API account from https://platform.openai.com Note that, as of March 2023 OpenAI API calls are no longer free. You need to add a payment method and they will charge to that card. You can specify a hard usage limit to block API calls if it crosses that limit. (Rate limits section)

Costs of Open AI API Calls depend on which model you are using, and how many tokens you are sending them. (Imagine a word = a token)

Below is from OpenAI Cookbook:
For gpt-3.5-turbo using ~1,000 tokens per query, it costs ~$0.002 per query, or ~500 queries per dollar (as of Apr 2023)
For gpt-4, again assuming ~1,000 tokens per query, it costs ~$0.03 per query, or ~30 queries per dollar (as of Apr 2023)

internal_doc_chatbot.py is the engine of the our chatgpt. On top we have constants to store confluence site url, space name, and credentials.

CONFLUENCE_URL = 'https://manucommerce.atlassian.net/'
CONFLUENCE_SPACE = 'Recipes'
CONFLUENCE_USER = "manu.commerce@gmail.com"
CONFLUENCE_PASSWORD = 'API_Key_For_Confluence' #"API_Key_For_Confluence"
OPENAI_API_KEY =  'OPENAI_API_KEY' # 'OPENAI_API_KEY'
EMBEDDING_MODEL = 'text-search-ada-doc-001'
COMPLETIONS_MODEL = "gpt-3.5-turbo"

We are using 2 models for our POC: 'text-search-ada-doc-001' for getting the embeddings and "gpt-3.5-turbo" for asking the question to chatGPT.
It contains these functions:
connect_to_Confluence - this will connect to the confluence using the credentials given.

def connect_to_Confluence():
    '''
    Connect to Confluence

    We use the API token for the cloud
    To create an API token here: Confluence -> Profile Pic -> Settings -> Password -> Create and manage tokens

    Return
    ------
    A connector to Confluence
    '''

    url = CONFLUENCE_URL
    username = CONFLUENCE_USER
    password  = CONFLUENCE_PASSWORD
    confluence = Confluence(
        url=url,
        username=username,
        password=password,
        cloud=True)

    return confluence

get_all_pages - Gets all the pages from the space

def get_all_pages(confluence, space=CONFLUENCE_SPACE):
    '''
    Get all the pages within the CONFLUENCE_SPACE space.

    Parameters
    ----------
    confluence: a connector to Confluence
    space: Space of the Confluence (i.e. 'Recipes')

    Return
    ------
    List of page objects. Each page object has all the information concerning
    a Confluence page (title, body, etc)
    '''

    # There is a limit of how many pages we can retrieve one at a time
    # so we retrieve 100 at a time and loop until we know we retrieved all of
    # them.
    keep_going = True
    start = 0
    limit = 100
    pages = []
    while keep_going:
        results = confluence.get_all_pages_from_space(space, start=start, limit=100, status=None, expand='body.storage', content_type='page')
        pages.extend(results)
        if len(results) < limit:
            keep_going = False
        else:
            start = start + limit
    return pages

collect_title_body_embeddings - Goes through each page and creates embeddings for them. It will be saved to a CSV file.

def collect_title_body_embeddings(pages, save_csv=True):
    '''
    From a list of page objects, get the title and the body, calculate
    the number of tokens as well as the embeddings of the body.

    Parameters
    ----------
    pages: List of page objects, i.e. output of get_all_pages()
    save_csv: Boolean. If True, the dataframe is saved locally
    into a CSV file.

    Return
    ------
    A dataframe of the title and body of all pages.
    '''

    collect = []
    for page in pages:
        title = page['title']
        link = CONFLUENCE_URL + '/wiki/spaces/'+CONFLUENCE_SPACE+'/pages/' + page['id']
        htmlbody = page['body']['storage']['value']
        htmlParse = BeautifulSoup(htmlbody, 'html.parser')
        body = []
        for para in htmlParse.find_all("p"):
            # Keep only a sentence if there is a subject and a verb
            # Otherwise, we assume the sentence does not contain enough useful information
            # to be included in the context for openai
            sentence = para.get_text()
            tokens = nltk.tokenize.word_tokenize(sentence)
            token_tags = nltk.pos_tag(tokens)
            tags = [x[1] for x in token_tags]
            if any([x[:2] == 'VB' for x in tags]): # There is at least one verb
                if any([x[:2] == 'NN' for x in tags]): # There is at least noun
                    body.append(sentence)
        body = '. '.join(body)
        # Calculate number of tokens
        tokens = tokenizer.encode(body)
        collect += [(title, link, body, len(tokens))]
    DOC_title_content_embeddings = pd.DataFrame(collect, columns=['title', 'link', 'body', 'num_tokens'])
    # Caculate the embeddings
    # Limit first to pages with less than 2046 tokens
    DOC_title_content_embeddings = DOC_title_content_embeddings[DOC_title_content_embeddings.num_tokens<=get_max_num_tokens()]
    print(DOC_title_content_embeddings);
    doc_model = EMBEDDING_MODEL
    DOC_title_content_embeddings['embeddings'] = DOC_title_content_embeddings.body.apply(lambda x: get_embeddings(x, doc_model))

    if save_csv:
        DOC_title_content_embeddings.to_csv('DOC_title_content_embeddings.csv', index=False)

    return DOC_title_content_embeddings

update_internal_doc_embeddings - Calls the above functions to create the embeddings CSV file

order_document_sections_by_query_similarity - Creates embedding for the query and then compares with knowledgebase embeddings for similarity.

def order_document_sections_by_query_similarity(query: str, doc_embeddings: pd.DataFrame):
    """
    Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
    to find the most relevant sections.

    Return the list of document sections, sorted by relevance in descending order.
    """
    query_model = EMBEDDING_MODEL
    query_embedding = get_embeddings(query, model=query_model)
    doc_embeddings['similarity'] = doc_embeddings['embeddings'].apply(lambda x: vector_similarity(x, query_embedding))
    doc_embeddings.sort_values(by='similarity', inplace=True, ascending=False)
    doc_embeddings.reset_index(drop=True, inplace=True)

    return doc_embeddings

construct_prompt - Constructs a prompt to be sent to chatGPT.

def construct_prompt(query, doc_embeddings):

    MAX_SECTION_LEN = get_max_num_tokens()
    SEPARATOR = "\n* "
    separator_len = len(tokenizer.tokenize(SEPARATOR))

    chosen_sections = []
    chosen_sections_len = 0
    chosen_sections_links = []

    for section_index in range(len(doc_embeddings)):
        # Add contexts until we run out of space.
        document_section = doc_embeddings.loc[section_index]

        chosen_sections_len += document_section.num_tokens + separator_len
        if chosen_sections_len > MAX_SECTION_LEN:
            break

        chosen_sections.append(SEPARATOR + document_section.body.replace("\n", " "))
        chosen_sections_links.append(document_section.link)

    header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
    prompt = header + "".join(chosen_sections) + "\n\n Q: " + query + "\n A:"

    return (prompt,  chosen_sections_links)

internal_doc_chatbot_answer - Calls above 2 functions and then submits the prompt to the chatGPT.

def internal_doc_chatbot_answer(query, DOC_title_content_embeddings):

    # Order docs by similarity of the embeddings with the query
    DOC_title_content_embeddings = order_document_sections_by_query_similarity(query, DOC_title_content_embeddings)
    # Construct the prompt
    prompt, links = construct_prompt(query, DOC_title_content_embeddings)
    # Ask the question with the context to ChatGPT

    print(prompt)

    messages = [
        {"role": "system", "content": "You answer questions about the Recipes space."},
        {"role": "user", "content": prompt},
    ]

    response = openai.ChatCompletion.create(
        model=COMPLETIONS_MODEL,
        messages=messages,
        temperature=0
    )

    #output = response["choices"][0]["text"].strip(" \n")
    output = response["choices"][0]["message"]["content"].strip(" \n")

    return output, links

Now we create a minimal flask app so that we will have a web interface to call our custom chat GPT. On invoking POST, it will check if the embeddings file exists, it will be created if it does not exist or if it is more than 7 days old. Then it will call functions from the chatboat module to get the response from chatGPT.

import os
from flask import Flask, request, render_template
import datetime
import internal_doc_chatbot
import pandas as pd

app = Flask(__name__)

@app.route('/', methods=['GET', 'POST'])
def main_page():
    if request.method == 'POST':
        text_input = request.form['text_input']
        text_output, links = process_text(text_input)
        print(text_output)
        return render_template('index.html', text_output=text_output, links=links)
    return render_template('index.html')

def parse_numbers(s):
    return [float(x) for x in s.strip('[]').split(',')]

def return_Confluence_embeddings():

    # Today's date
    today = datetime.datetime.today()
    # Current file where the embeddings of our internal Confluence document is saved
    Confluence_embeddings_file = 'DOC_title_content_embeddings.csv'
    # If embeddings file does not exist, create it
    if os.path.exists(Confluence_embeddings_file):
    # Run the embeddings again if the file is more than a week old
    # Otherwise, read the save file
        Confluence_embeddings_file_date = datetime.datetime.fromtimestamp(os.path.getmtime(Confluence_embeddings_file))
        delta = today - Confluence_embeddings_file_date
        if delta.days > 7:
            DOC_title_content_embeddings= internal_doc_chatbot.update_internal_doc_embeddings()
        else:
            DOC_title_content_embeddings= pd.read_csv(Confluence_embeddings_file, dtype={'embeddings': object})
            DOC_title_content_embeddings['embeddings'] = DOC_title_content_embeddings['embeddings'].apply(lambda x: parse_numbers(x))
    else:
        DOC_title_content_embeddings= internal_doc_chatbot.update_internal_doc_embeddings()
    return DOC_title_content_embeddings

def process_text(query):

    DOC_title_content_embeddings= return_Confluence_embeddings()
    output, links = internal_doc_chatbot.internal_doc_chatbot_answer(query, DOC_title_content_embeddings)

    return output, links

if __name__ == '__main__':
    app.run()

Running the flask lets us access the site at localhost:5000 by default. On asking the question, chatGPT answers the question using the context from the confluence page: