(Updated at 20, February, 2022)
Introduction
In this post, I will fine-tune GPT-2, especially rinna's, which are one of the Japanese GPT-2 models. I am Japanese and most of my chat histories are in Japanese. Because of that, I will fine-tune "Japanese" GPT-2.
GPT-2 stands for Generative pre-trained transformer 2 and it generates sentences as the name shows. We could build a chatbot by fine-tuning a pre-trained model with tiny training data.
I will not go through GPT-2 in detail. I highly recommend the article How to Build an AI Text Generator: Text Generation with a GPT-2 Model on dev.to to understand what is GPT-2 and what is a language model.
git repository: chatbot_with_gpt2
I would appreciate the author of the following two articles.
Thanks to the first author, I could build my chatbot model. The sources in my git repository are almost constructed with his codes. I just summarized them. Thanks to the second author, I could go through GPT-2.
What is rinna
rinna is a conversational pre-trained model given from rinna Co., Ltd. and five pre-trained models are available on hugging face [rinna Co., Ltd.] on 19, February 2022. rinna is a bit famous in Japanese because they published rinna AI on LINE, one of the most popular SNS apps in Japan. She is a junior high school girl. We could take conversations on LINE.
I am not sure when the models are published on hugging face, but anyways, the models are available now. I will fine-tune rinna/japanese-gpt2-small
whose number of parameters is small. By the way, I wanted to use rinna/japanese-gpt-1b
whose number of parameters is around one billion, but I couldn't because of the memory capacity on google colab.
Process
I will suppose you have a google and git account and can use google colab.
Furthermore, I will use a chat history on LINE. If you have no account on the app, it is okay. All you have to do is prepare a chat history and modify the data. I know these processes are the hardest and most bothering things though. If you have the account, the following processes would work. Note that, if your LINE setting language is Japanese, you should change it to English until exporting a chat history because the following processes are supposing the setting language (not message language) is English.
Prepare the environment
At the end of this process, your google drive is constructed as follows.
MyDrive ---- chatbot_with_gpt2.ipynb
|
|- config
| |- general_config.yaml
|
|- data
|- chat_history.txt
- 1: Clone chatbot_with_gpt2 repository on your local machine.
It is accomplished by running the following command on the git bash.
git clone https://github.com/ksk0629/chatbot_with_gpt2
2: Upload
chatbot_with_gpt2/chatbot_with_gpt2.ipynb
to the google drive.3: Make a directory named
config
on your google drive and creategeneral_config.yaml
in the config folder.
general_config.yaml
is as follows.
github:
username: your_github_username
email: your_email
token: your_access_token
ngrok:
token: anything
The ngrok
block is needless, but it is needed to avoid an error below.
- 4: Get a chat history from LINE.
We can get the history by following the official announcement [Help centre - Chat history].
- 5: Make a directory named
data
on your google drive and move the chat history to the directory.
Prepare training data and build the model
1: Open
chatbot_with_gpt2.ipynb
on google colaboratory.2: Run the cells in Preparation block.
The environment is prepared to get training data and build the model by running the cells.
- 3: Change
chatbot_with_gpt2/pre_processor_config.yaml
.
The initial yaml file is as follows.
line:
initial:
input_username: "input_username"
output_username: "output_username"
target_year_list: "[2016,2017,2018,2019,2020,2021,2022]"
path:
input_path: "/content/gdrive/MyDrive/data/chat_history.txt"
output_path: "chat_history_cleaned.pk"
You have to change at least initial block. The meaning of each line is as follows.
- input_username: a username of messages that you want to input into the model
- output_username: a username of messages that you want the model to output
- target_year_list: years that you want to use to train the model
- input_path: path to the raw chat history
- output_path: path to the cleaned data that is obtained by the following process
Note that, if you do not change output_path, then your training data would not be available after closing the notebook. Of course, it is available whilst the notebook is working.
- 4: Run the cell in Preprocessing data block.
The data is cleaned in the cell.
- 5: Change
chatbot_with_gpt2/model_config.yaml
.
The initial yaml file is as follows.
general:
basemodel: "rinna/japanese-gpt2-xsmall"
dataset:
input_path: "chat_history_cleaned.pk"
output_path: "gpt2_train_data.txt"
train:
epochs: 10
save_steps: 10000
save_total_limit: 3
per_device_train_batch_size: 1
per_device_eval_batch_size: 1
output_dir: "model/default"
use_fast_tokenizer: False
You have to change input_path in dataset block to the path to the cleaned data, which is specified in pre_processor_config.yaml
. You can change basemodel to rinna/japanese-gpt2-small, but others (medium and 1b) would not work because of a lack of GPU memory as I mentioned in What is rinna section.
- 6: Run the cells in Training data preparation and Building model block.
That is all! After running this cell, all you have to do is wait for a while. You would see your model file in the directory that is specified in model_config.yaml
.
Let's talk to the model
Again, all you have to do is run the only one cell in Talking with the model block. Then, the source code is running and you could talk with the model, like the following.
Conclusion
I fine-tuned GPT-2 with my chat history on LINE. I certainly did it, but there are the following problems as you could see in Let's talk to the model section.
- There is unnecessary line
Setting 'pad_token_id' to 'eos_token_id':2 for open-end generation.
in each conversation. - There are some tokens, like
<br:
,[<unk>hoto]<br///
, and<br/ゥ>
, that disturb coherence sentence. - The model did not reply well.
The first response
帰ったんか
おつかれさま!
looks quite good because "おっす" means "Hey" and the response means "You are home. You’ve got to be exhausted". Something like these. But the others look wrong. To improve the model, I could clean training data more and I need to understand GPT-2 and the source codes.
If you have any suggestions, comments, or questions about this article, please comment below. I'd appreciate it.
Top comments (4)
Could you please show me an example of the data in chat_history.txt?
I encountered a problem when running preprocessor.py on my own chat_history.txt pulled from LINE... I guess the problem might be related to the data format.
Sorry for this late reply. The example is as below.
Note that, the input is supposed to be Japanese content and the settings of line are supposed to be English. They might work well if they are not though.
Thannnnnks! I'm so glad to receive your reply!
BTW,
21:32 [someone's name]. <----- There is a space between the two items, right?
【the input is supposed to be Japanese content and the settings of line is supposed to be English】--> In Japanese: [someone's name] & [content]; In English: Everything else, right?
Hi. Sorry for this late reply again.
Yes. There is a space, which is a tab.
Yes, you are right. Technically, [someone's name] is okay even in Japanese and if [content] is written in English, it would work, but the analyser is for Japanese. The result would be not great.
Let me know if it works well.