Chat GPT is making a lot of waves in the world of internet. Chat GPT excels at answering questions based on factual knowledge it has. But there is a limit: it has only the knowledge available on public internet up to September 2021. What if you want chat GPT to answer questions based on content that was created after September 2021? What if you wanted to answer questions based on your internal documentation that is not available on public internet? This is where the openAI API for interacting with a GPT will come to help.
There are two ways to help chat GPT learn new knowledge:
1) Fine tuning the GPT model with new learning data
2) Insert the new knowledge into the question that you are asking as context.
For factual recall use-case, the second methodology is better. How much context data can be fed to each model is limited by a maximum amount of text it can read at once (credit: OpenAI Cookbook)
Model | Maximum text length |
---|---|
gpt-3.5-turbo | 4,096 tokens (~5 pages) |
gpt4 | 8,192 tokens (~10 pages) |
gpt4-32k | 32,768 tokens (~40 pages) |
To enable chat GPT to answer questions based on your internal documentation, you can follow the below procedure:
1) collect the training data from your internal documentation and create embeddings from it.
Embeddings are vectors or arrays of integers that represent the textual data. These vectors will have 1000s of dimensions. We will store the embeddings in a file for this POC, but for real use, it should be in a vector database.
2)When the question comes from the user you create the embeddings of the question.
3) compare the embedding that you have of the documentation with the embedding of the question and then match them using attributes like similarity for relevance
4) now we can give the most relevant or most similar knowledge data as a context to the question that we are asking to the GPT model.
This is like chatGPT is writing an open book exam and can check the book for the relevant information before answering the question and therefore it's able to answer the questions well.
For the proof of concept that, we have a Python application that would read a confluence website and answer questions based on that knowledge.
https://github.com/manumaan/custom_chat_gpt
You can create a free confluence website by going to https://www.atlassian.com/software/confluence/jira-integration/try
Once the site is created you can create a "space" inside that, and then pages inside that space. Click on your username icon and go to settings -> password. Here you can create the API Token.
You can create OpenAI API account from https://platform.openai.com Note that, as of March 2023 OpenAI API calls are no longer free. You need to add a payment method and they will charge to that card. You can specify a hard usage limit to block API calls if it crosses that limit. (Rate limits section)
Costs of Open AI API Calls depend on which model you are using, and how many tokens you are sending them. (Imagine a word = a token)
Below is from OpenAI Cookbook:
For gpt-3.5-turbo using ~1,000 tokens per query, it costs ~$0.002 per query, or ~500 queries per dollar (as of Apr 2023)
For gpt-4, again assuming ~1,000 tokens per query, it costs ~$0.03 per query, or ~30 queries per dollar (as of Apr 2023)
internal_doc_chatbot.py is the engine of the our chatgpt. On top we have constants to store confluence site url, space name, and credentials.
CONFLUENCE_URL = 'https://manucommerce.atlassian.net/'
CONFLUENCE_SPACE = 'Recipes'
CONFLUENCE_USER = "manu.commerce@gmail.com"
CONFLUENCE_PASSWORD = 'API_Key_For_Confluence' #"API_Key_For_Confluence"
OPENAI_API_KEY = 'OPENAI_API_KEY' # 'OPENAI_API_KEY'
EMBEDDING_MODEL = 'text-search-ada-doc-001'
COMPLETIONS_MODEL = "gpt-3.5-turbo"
We are using 2 models for our POC: 'text-search-ada-doc-001' for getting the embeddings and "gpt-3.5-turbo" for asking the question to chatGPT.
It contains these functions:
connect_to_Confluence - this will connect to the confluence using the credentials given.
def connect_to_Confluence():
'''
Connect to Confluence
We use the API token for the cloud
To create an API token here: Confluence -> Profile Pic -> Settings -> Password -> Create and manage tokens
Return
------
A connector to Confluence
'''
url = CONFLUENCE_URL
username = CONFLUENCE_USER
password = CONFLUENCE_PASSWORD
confluence = Confluence(
url=url,
username=username,
password=password,
cloud=True)
return confluence
get_all_pages - Gets all the pages from the space
def get_all_pages(confluence, space=CONFLUENCE_SPACE):
'''
Get all the pages within the CONFLUENCE_SPACE space.
Parameters
----------
confluence: a connector to Confluence
space: Space of the Confluence (i.e. 'Recipes')
Return
------
List of page objects. Each page object has all the information concerning
a Confluence page (title, body, etc)
'''
# There is a limit of how many pages we can retrieve one at a time
# so we retrieve 100 at a time and loop until we know we retrieved all of
# them.
keep_going = True
start = 0
limit = 100
pages = []
while keep_going:
results = confluence.get_all_pages_from_space(space, start=start, limit=100, status=None, expand='body.storage', content_type='page')
pages.extend(results)
if len(results) < limit:
keep_going = False
else:
start = start + limit
return pages
collect_title_body_embeddings - Goes through each page and creates embeddings for them. It will be saved to a CSV file.
def collect_title_body_embeddings(pages, save_csv=True):
'''
From a list of page objects, get the title and the body, calculate
the number of tokens as well as the embeddings of the body.
Parameters
----------
pages: List of page objects, i.e. output of get_all_pages()
save_csv: Boolean. If True, the dataframe is saved locally
into a CSV file.
Return
------
A dataframe of the title and body of all pages.
'''
collect = []
for page in pages:
title = page['title']
link = CONFLUENCE_URL + '/wiki/spaces/'+CONFLUENCE_SPACE+'/pages/' + page['id']
htmlbody = page['body']['storage']['value']
htmlParse = BeautifulSoup(htmlbody, 'html.parser')
body = []
for para in htmlParse.find_all("p"):
# Keep only a sentence if there is a subject and a verb
# Otherwise, we assume the sentence does not contain enough useful information
# to be included in the context for openai
sentence = para.get_text()
tokens = nltk.tokenize.word_tokenize(sentence)
token_tags = nltk.pos_tag(tokens)
tags = [x[1] for x in token_tags]
if any([x[:2] == 'VB' for x in tags]): # There is at least one verb
if any([x[:2] == 'NN' for x in tags]): # There is at least noun
body.append(sentence)
body = '. '.join(body)
# Calculate number of tokens
tokens = tokenizer.encode(body)
collect += [(title, link, body, len(tokens))]
DOC_title_content_embeddings = pd.DataFrame(collect, columns=['title', 'link', 'body', 'num_tokens'])
# Caculate the embeddings
# Limit first to pages with less than 2046 tokens
DOC_title_content_embeddings = DOC_title_content_embeddings[DOC_title_content_embeddings.num_tokens<=get_max_num_tokens()]
print(DOC_title_content_embeddings);
doc_model = EMBEDDING_MODEL
DOC_title_content_embeddings['embeddings'] = DOC_title_content_embeddings.body.apply(lambda x: get_embeddings(x, doc_model))
if save_csv:
DOC_title_content_embeddings.to_csv('DOC_title_content_embeddings.csv', index=False)
return DOC_title_content_embeddings
update_internal_doc_embeddings - Calls the above functions to create the embeddings CSV file
order_document_sections_by_query_similarity - Creates embedding for the query and then compares with knowledgebase embeddings for similarity.
def order_document_sections_by_query_similarity(query: str, doc_embeddings: pd.DataFrame):
"""
Find the query embedding for the supplied query, and compare it against all of the pre-calculated document embeddings
to find the most relevant sections.
Return the list of document sections, sorted by relevance in descending order.
"""
query_model = EMBEDDING_MODEL
query_embedding = get_embeddings(query, model=query_model)
doc_embeddings['similarity'] = doc_embeddings['embeddings'].apply(lambda x: vector_similarity(x, query_embedding))
doc_embeddings.sort_values(by='similarity', inplace=True, ascending=False)
doc_embeddings.reset_index(drop=True, inplace=True)
return doc_embeddings
construct_prompt - Constructs a prompt to be sent to chatGPT.
def construct_prompt(query, doc_embeddings):
MAX_SECTION_LEN = get_max_num_tokens()
SEPARATOR = "\n* "
separator_len = len(tokenizer.tokenize(SEPARATOR))
chosen_sections = []
chosen_sections_len = 0
chosen_sections_links = []
for section_index in range(len(doc_embeddings)):
# Add contexts until we run out of space.
document_section = doc_embeddings.loc[section_index]
chosen_sections_len += document_section.num_tokens + separator_len
if chosen_sections_len > MAX_SECTION_LEN:
break
chosen_sections.append(SEPARATOR + document_section.body.replace("\n", " "))
chosen_sections_links.append(document_section.link)
header = """Answer the question as truthfully as possible using the provided context, and if the answer is not contained within the text below, say "I don't know."\n\nContext:\n"""
prompt = header + "".join(chosen_sections) + "\n\n Q: " + query + "\n A:"
return (prompt, chosen_sections_links)
internal_doc_chatbot_answer - Calls above 2 functions and then submits the prompt to the chatGPT.
def internal_doc_chatbot_answer(query, DOC_title_content_embeddings):
# Order docs by similarity of the embeddings with the query
DOC_title_content_embeddings = order_document_sections_by_query_similarity(query, DOC_title_content_embeddings)
# Construct the prompt
prompt, links = construct_prompt(query, DOC_title_content_embeddings)
# Ask the question with the context to ChatGPT
print(prompt)
messages = [
{"role": "system", "content": "You answer questions about the Recipes space."},
{"role": "user", "content": prompt},
]
response = openai.ChatCompletion.create(
model=COMPLETIONS_MODEL,
messages=messages,
temperature=0
)
#output = response["choices"][0]["text"].strip(" \n")
output = response["choices"][0]["message"]["content"].strip(" \n")
return output, links
Now we create a minimal flask app so that we will have a web interface to call our custom chat GPT. On invoking POST, it will check if the embeddings file exists, it will be created if it does not exist or if it is more than 7 days old. Then it will call functions from the chatboat module to get the response from chatGPT.
import os
from flask import Flask, request, render_template
import datetime
import internal_doc_chatbot
import pandas as pd
app = Flask(__name__)
@app.route('/', methods=['GET', 'POST'])
def main_page():
if request.method == 'POST':
text_input = request.form['text_input']
text_output, links = process_text(text_input)
print(text_output)
return render_template('index.html', text_output=text_output, links=links)
return render_template('index.html')
def parse_numbers(s):
return [float(x) for x in s.strip('[]').split(',')]
def return_Confluence_embeddings():
# Today's date
today = datetime.datetime.today()
# Current file where the embeddings of our internal Confluence document is saved
Confluence_embeddings_file = 'DOC_title_content_embeddings.csv'
# If embeddings file does not exist, create it
if os.path.exists(Confluence_embeddings_file):
# Run the embeddings again if the file is more than a week old
# Otherwise, read the save file
Confluence_embeddings_file_date = datetime.datetime.fromtimestamp(os.path.getmtime(Confluence_embeddings_file))
delta = today - Confluence_embeddings_file_date
if delta.days > 7:
DOC_title_content_embeddings= internal_doc_chatbot.update_internal_doc_embeddings()
else:
DOC_title_content_embeddings= pd.read_csv(Confluence_embeddings_file, dtype={'embeddings': object})
DOC_title_content_embeddings['embeddings'] = DOC_title_content_embeddings['embeddings'].apply(lambda x: parse_numbers(x))
else:
DOC_title_content_embeddings= internal_doc_chatbot.update_internal_doc_embeddings()
return DOC_title_content_embeddings
def process_text(query):
DOC_title_content_embeddings= return_Confluence_embeddings()
output, links = internal_doc_chatbot.internal_doc_chatbot_answer(query, DOC_title_content_embeddings)
return output, links
if __name__ == '__main__':
app.run()
Running the flask lets us access the site at localhost:5000 by default. On asking the question, chatGPT answers the question using the context from the confluence page:
Context data as it appears in Confluence page:
This prompt to the chatGPT cost us 1,577 prompt + 189 completion = 1,766 tokens.
Hope this is helpful!
Credits:
https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb
https://medium.com/@francois.ascani/running-chatgpt-on-your-internal-confluence-documentation-d7761aa8fc68
Top comments (1)
Thanks for your job, I met a similar issue.