DEV Community: Marcelo Xavier

Fake-GPT: Fake news classifier with GPT-3

Marcelo Xavier — Mon, 24 Apr 2023 16:37:15 +0000

Introduction

If you don't live under a rock you most probably have heard about ChatGPT since it's the main topic of the internet recently. It's an artificial intelligence chatbot developed by OpenAI in which you can have a conversation about any subject and you are answered with human-like coherent responses. It is built on top of the largest language model trained so far and it's based on massive amounts of text data from all over the internet and beyond. It was launched on November 30, 2022 and drew attention for its coherent, correct and articulate answers across many domains of knowledge.

Besides the main functionality of answering questions with coherent responses, OpenAI provides a way to customize their model to answer questions and generate text for inputs it has never seen before. This feature is called Fine-Tuning and allows anyone to make use of GPT-3 technology to train the original model based on examples from a dataset. This results in a personalized model that answers questions for your specific scenarios. With this model the quality of the responses for your specific cases will improve and the latency of the requests will decrease.

With that tool in hand I decided to try it out and train a custom model to test how it works. After looking for open datasets I found one that caught my attention, a Fake News dataset. This dataset is composed of fake and real news and it would be an interesting and bold scenario to try out and to test the potential of GPT-3 Fine Tuning feature.

Fine Tuning Fake-GPT

Gather and sanitize your dataset

First of all I searched for an open dataset that would fit the fine tuning process. I was looking for a dataset which could fit into the format that GPT-3 uses: question and answer or prompt and completion. After several hours of dataset searches I finally found a promising dataset of Fake News. The dataset is composed of thousands of examples of real news and fake news. It is divided into two CSV files, one with fake news and the other with real news. Each file has the following properties for each news article: title, text body, subject and date. You can see the the dataset example below (fake_news_dataset.csv):

title,text,subject,date
News Title 2,News body 2,politicsNews,31/12/22
News Title 2,News body 2,politicsNews,09/10/22
News Title 3,News body 3,worldNews,17/11/22
. . .

The idea is to identify if a news article is fake or real based on the news title and the text body. With the raw dataset in hand it's time to format the data in a way that fits into the Fine-Tuning process. The file format should be CSV and must contain two properties: "prompt" and "completion". The prompt will have the news title and the text body appended together. The completion is simply one word, fake or true. The resulted file should be structured like the example below (formated_dataset.csv):

prompt,completion
"News Title: News body ... end","fake"
"News Title: News body ... end","true"
"News Title: News body ... end","fake"
. . .

To format the raw CSV data into the format for fine tuning I used the csvkit CLI tool for the sake of simplicity. You can use other tools like panda for Python or even Spark for big amounts of data. When you have the data together, sanitized and formatted, then you are ready to start the fine-tuning process.

Prerequisites for Fine Tuning

First of all I have installed the OpenAI Python library using pip (pypi.org/project/OpenAI):

pip install --upgrade openai

Then I had to set the OpenAI API key to allow you to send commands to OpenAI on behalf of your account. I used the api key for your account in the api-keys page. Then I set the environment variable to authorize commands to OpenAI with the following command:

export OPENAI_API_KEY=<your_key>

Format dataset

OpenAI fine-tuning process accepts only files in JSONL format, but don't worry, it provides a tool to transform any format mentioned above to JSONL. So I run the following command to do so:

openai tools fine_tunes.prepare_data -f formated_dataset.csv

The command fails because it's missing some dependency, so I ran the following command to fix it and tried again (click here for more details). Command to install required dependency to run openai package:

pip install OpenAI[dependency]

The openai prepare_data command analyzed my dataset and showed me a list of suggestions with important information about my dataset and some options for the fine tuning. Then, based on the analysis it suggested to perform some actions like, remove duplicates, add a suffix separator, add white space character at the beginning of completion, etc. I applied all recommended actions since they all fit my dataset case.
One important action is to "split the data into training and validation sets" which will generate some assessment results after the training process. It will produce two files with examples consisting of a training set to be used for fine tuning (around ~95% of the examples) and a validation set to be used to assess the performance of the trained model (around ~5% of the examples). This action allows OpenAI to generate detailed results about the model performance later on.

Finally, I had two JSONL files in the right format to be used in the fine tuning. Note that when the command finishes it will give you some tips and recommendations on how to run the fine tuning command. In this case the suggested command was:

openai api fine_tunes.create -t "dataset_train.jsonl" -v "dataset_valid.jsonl" -m "ada" --compute_classification_metrics --classification_positive_class " true"

This command will be used to create the trained model and will be explained in the next section.

Run Fine-tuning

Now that I have the data correctly formatted in JSONL format I am ready to run the fine tuning process to train the custom model. Note that this is a paid functionality and the price varies based on the size of your dataset and the base model you use for the training (see pricing here).
I ran the "fine_tunes.create" command suggested previously. This command has several options to set or change based on your needs (check it out fine-tune command details here). For example, the option "-m" you can set as "ada", "babbage", "curie" or "davinci", being one of the based models OpenAI provides having different advantages like performance, accuracy, price, etc. (see more about the models OpenAI provides).

openai api fine_tunes.create
-m "ada" 
-t "prepared_dataset_train.jsonl" 
-v "prepared_dataset_valid.jsonl" 
--compute_classification_metrics 
--classification_positive_class " true"

Note that the command can change depending on your dataset values. For example, for classifier problems, where each input in the prompt should be classified into one of the predefined classes, I had to set the option "--compute_classification_metrics". Since there were only two classes I had to set '--classification_positive_class " true" '. If there are more classes you have to set "--classification_n_classes 5". See these examples below.

The options explanation:

m: the model used in fine tuning
t: file name for the training dataset
v: file name for the validation dataset
--compute_classification_metrics: For metrics generation
--classification_positive_class: fine tuning for classification with positive class being true

When you execute the command it will upload the dataset file(s) to OpenAI servers and the command will prompt you with a fine tune id such as "ft-FpsHdkc836TLMi53FidOabcF". I saved this id for reference later. At this point the request is received by OpenAI servers and the fine tuning is queued to be executed. The fine tuning execution took some hours because of the dataset size and the set model. I didn't have to wait for the process to finish so you killed the process (ctrl-c) without hesitation and came back later to check the progress of the job. I checked the status of your process by running the following command with the id you received previously:

openai api fine_tunes.get -i ft-FpsHdkc836TLMi53FidOabcF

When the fine tuning is finished I checked the status of your fine tune process as "processed" meaning your custom model training is done and it's ready to be used. To use my trained model I had to get its name which I found in the property called "fine_tuned_model" (ex.: "fine_tuned_model": "ada:ft-personal-2023-01-18-19-50-00").
To use the trained model when sending prompts to OpenAI I had to set the "-m" option with the id of your model like showing bellow:

openai api completions.create -m ada:ft-personal-2023-01-18-19-50-00 -p "your prompt here!"

I was also able to use the trained model in my account's playground link address. Now I'm free to play, check, test and assess my custom trained model to make sure it responds with the expected answer for my scenario. In this case I could send a news article and it will return if the article is fake or not! See the results below.

Performance assess results (Bonus)

Remember we generated two files for the dataset, one for the training and the other for the validation. OpenAI used the first one to train the model and the second one to run some assessment procedures to measure some stats like, for example, accuracy and loss. I could see the stats using the following command:

openai api fine_tunes.results -i ft-FpsHdkc836TLMi53FidOabcF results.csv

The output data is in a CSV file by default and can be saved in a file for better visualization. The file structure is based on steps done in the training process which is represented by each line of the csv file. Each line shows valuable information about the step like: how many examples it processed, number of tokens processed, training accuracy, training loss, validation accuracy, validation loss, and much more. In this case the results where:

See more about results in this link.

Conclusion

OpenAI is bringing one of the most revolutionary technological disruptions of recent times with the creation of GPT-3, GPT-4, and ChatGPT. The best part is that anyone can have access to use this incredible tool. Besides that, OpenAI provides the necessary tools for anyone to create personalized models to attend any specific scenario. This fine-tuning process is easily available to anyone by OpeanAI. Fine tuning provides access to the cutting-edge technology of machine learning that OpenAI uses in GPT. This provides endless possibilities to improve computer human interaction for companies, their employees and their customers.

Fine tuning allows anyone to create customized models specific for any scenario. In the article's case it was created a Fake News classifier. Imagine how useful it can be for the user to have something to alert you if an article you're reading is probably fake!
Other examples can be:

Create a chatbot to interact with customers in a conventional manner to answer specific questions about your company, your services and your products.
Help employees to find valuable information quickly and easily in a large knowledge base with several text articles where it's hard and time consuming to find what you need.

The scenarios involving a high volume of text it can help are countless.

If you have any questions, comments or suggestions feel free to contact me: https://www.linkedin.com/in/marceloax/

How to fine-tuning OpenAI GPT-3

Marcelo Xavier — Fri, 17 Mar 2023 16:43:59 +0000

Introduction

ChatGPT is an artificial intelligence chatbot developed by OpenAI in which you can have a conversation about any subject and you are answered with human-like coherent responses. It is built on top of a large language model trained on massive amounts of text data from all over the internet and several other sources. It was launched on November 30, 2022 and drew attention for its coherent, correct and articulate answers across many domains of knowledge.

Besides the main functionality of answering questions with coherent responses, OpenAI provides a way to customize their model to adjust the answers to fit a specific case. This feature is called fine-tuning and allows anyone to make use of GPT-3 technology to train a personalized model. This results in a model that answers questions that fits in your scenario based on the examples you provided. With a personalized model the quality of the responses related to your topic will improve and the latency of the requests will decrease.

This article provides you with the step by step on how to fine-tune the OpenAI GPT-3 model and have a trained model to fit your specific scenarios. Some example scenarios you can apply the fine tuning process are:

Create a personalized chatbot for automatically answering frequently asked questions (FAQ) from your customers about your company's services
Compilation of knowledge base documents from your company so any employee will be able to find information easily, faster and using natural language
Summarization of text documents, discussion, notes or meetings transcriptions in succinct abstract with the important topics
Programming code generation based on your company's code base
And many more other cases...

How to run fine-tuning process

Gather and sanitize your dataset

First of all you need to gather a dataset with hundreds or thousands of examples that represents your scenario. You can use tools to help you to format these data like Apache Spark or any other MapReduce tool or framework. The dataset can be saved in the following format: CSV, TSV, XLSX, JSON or JSONL, and must have only two properties: "prompt" and "completion". You can see an example in CSV format bellow on how your dataset should be structured, dataset.csv:

prompt,completion
How to create an Order?,To create an order you must do 5 things:...
How to create a new product?,To create a new product you must first...
How to create a promotion campaign?,To create a promotion campaign...
. . .

When you have the data together, sanitized and in one of the supported formats, then you are ready to start the fine-tuning process.

Prerequisites

First of all you have to install the OpenAI Python library using pip. Run the following command to install OpenAI library:

pip install --upgrade openai

Then you have to set the OpenAI API key to allow you to send commands to OpenAI on behalf of your account. You can find the api key for your account in the api-keys page. Run the following command to set environment variable to authorize commands to OpenAI:

export OPENAI_API_KEY=<your_key>

Format dataset

OpenAI fine-tuning process accepts only files in JSONL format, but don't worry, it provides a tool to transform any format mentioned above to JSONL. Just run the following command to format dataset to JSONL:

openai tools fine_tunes.prepare_data -f dataset.csv

If the command fails because it's missing some dependency, then run the following command to fix it and try again (click here for more details).

pip install OpenAI[_dependency_]

The command will analyze your dataset and will show you a list of sentences with important information about your dataset and some options for the fine tuning. Read it carefully! Then, based on the analysis it will suggest to perform some actions like, remove duplicates, add a suffix separator, add white space character at the beginning of completion, etc. I would suggest applying all recommended actions if there isn't any reason not to do so.

One important action to do is "split the data into training and validation sets" which will generate two files with a training set to be used for fine tuning (around ~95% of the examples) and a validation set to be used to assess the performance of the trained model (around ~5% of the examples). This action allows OpenAI to generate detailed results about the model performance later on.

Finally, you will have the two files with the dataset in the right format to be used in the fine tuning. Note that when the command finishes it will give you some tips and recommendations to run the fine tuning. For example, the suggested command to be used, like OpenAI api fine_tunes.create .... Take note of that.

Run Fine-tuning

Now you have the data correctly formatted in JSONL format you are ready to run the fine tuning process for your custom model. Note that this is a paid functionality and the price varies based on the size of your dataset and the base model you use for the training (see pricing here). You can run the suggested command from the "prepare data" command output or you can build the command yourself with options you would like to set. The command has several options you can set or change based on your needs (check it out fine-tune command details here). For example, the option "-m" you can set as "ada", "babbage", "curie" or "davinci", being one of the based models OpenAI provides having different advantages like performance, accuracy, price, etc. (see more about the models OpenAI provides).

Note that the command can change depending on your dataset values. For example, for classifier problems where each input in the prompt should be classified into one of the predefined classes you have to set the option "--compute_classification_metrics". If there are only two classes you have to set '--classification_positive_class "class1"'. If there are more classes you have to set "--classification_n_classes 5". See these examples below.

Fine tuning command for binary classification:

openai api fine_tunes.create -t "dataset_train.jsonl" -v "dataset_valid.jsonl" -m "ada" --compute_classification_metrics --classification_positive_class " true"

Fine tuning command for multi-class classification:

openai api fine_tunes.create -t "dataset_train.jsonl" -v "dataset_valid.jsonl" -m "ada" --compute_classification_metrics --classification_n_classes 5

When you execute the command it will upload the dataset file(s) to OpenAI servers and the command will prompt you with a fine tune id such as "ft-FpsHdkc836TLMi53FidOabcF", save this id. At this point the request is received by OpenAI servers and the fine tuning is queued to be executed. The fine tuning execution would take some minutes or many hours depending on your dataset size and the model you chose. You don't need to wait for the command to finish, you can kill the process (ctrl-c) without hesitation and come back later to check the progress of the job. You can check the status of your process by running the following command with the id you received previously:

openai api fine_tunes.get -i <FINE_TUNING_ID>

When the fine tuning is done you will see the status of your fine tune process as "processed" meaning your custom model training is done and it's ready to be used. Note that to use your trained model you have to know its name which you can find in the property called "fine_tuned_model" (ex.: "fine_tuned_model": "ada:ft-personal-2023-01-18-19-50-00").

To use your trained model when sending prompts to OpenAI just set the "-m" option with the id of your model like showing bellow:

openai api completions.create -m <FINE_TUNED_MODEL> -p "your prompt here!"

You can also use the created model in your account playground link address. Now feel free to play, check, test and assess your model to make sure it responds with the expected answer for your scenario!

Performance assess results (Bonus)

Remember we generated two files for the dataset, one for the training and the other for the validation, OpenAI used the first one to train the model and the second one to run some assessment procedures to measure some stats like, for example, accuracy and loss. You can see the stats using the following command:

openai api fine_tunes.results -i <FINE_TUNING_ID> > results.csv

curl https://api.OpenAI.com/v1/files/$RESULTS_FILE_ID/content \
  -H "Authorization: Bearer $OPENAI_API_KEY" > results.csv

The output data is in a CSV file by default and can be saved in a file for better visualization. The file structure is based on steps done in the training process which is represented by each line of the csv file. Each line shows valuable information about the step like: how many examples it processed, how many tokens, training loss, training accuracy, validation loss, and much more. See more about results in this link.

Conclusion

OpenAI is bringing one of the most revolutionary technological disruptions of recent times with the creation of GPT-3 and ChatGPT. The best part is that anyone can have access to use this incredible tool for free. Besides that, OpenAI provides the necessary tools for anyone to create personalized models to attend any specific scenario. This process is called fine-tuning and it's easily available to anyone by OpeanAI. Fine tuning provides access to the cutting-edge technology of machine learning that OpenAI used in GPT-3. This provides endless possibilities to improve computer human interaction for companies, their employees and their customers.

With fine tuning anyone can create customized GPT-3 models specific for your scenario. For example, create a chatbot to interact with customers in a conventional manner to answer specific questions about your company, your services and your products. Another example is to help employees to find valuable information quickly and easily in a large knowledge base with several text articles where it's hard and time consuming to find what you need. The scenarios involving a high volume of text it can help are countless.

If you have any question feel free to contact me:
https://www.linkedin.com/in/marceloax/