For the last 9 months I have been working on a project that integrates OpenAI's ChatGPT into our project to help create creative marketing assets for our customers. Over the last couple of months I have been working with LLM (Large Language Models) and trying to get a deeper understanding of what they are and how they work. One of the first things I wanted to understand is how they are built. One of the steps in building a LLM is pre-training. I wanted to understand what pre-training is and how it works. This blog is a summary of what I have learned.
An LLM at its core is really a prediction engine that given a sequence of words or sentence it will predict what is the next set of works that should come. Pre-training is the process where the LLM is provided a huge amount of data to recognize syntax, grammar and common phrases. As an example GPT-3 was pre-trained on 0.5 Trillion words and the outcome was a 175 Billion Parameter model. The pre-training process is a very compute intensive process and requires a lot of compute power. During the Pre-Training process, we use Self-Surpervised Learning (SSL) to get the model to solve pre-training tasks to help it learn the language.
Some common pre-training tasks used in SSL include:
Masked Language Modeling (MLM): The model tries to predict randomly masked words based on context. Helps learn relationships between words. Used in BERT.
Next Sentence Prediction (NSP): Model predicts if two sentences are consecutive. Learns sentence structure. Used in BERT.
Causal Language Modeling (CLM): Model predicts next word in a sequence. Learns fluency. Used in GPT-1.
The loss function for SSL combines the losses from the different pre-training tasks. Each task has its own specialized loss function that the model tries to minimize during training. Using multiple complementary tasks makes the model robust.
The loss function is key to how the LLM learns to predict the next words. The different losses and tasks teach the model important language skills like word meaning, sentence structure, grammar rules, and fluency. This develops a strong language foundation.
Pre-training is an essential first step when creating large language models. It builds a deep understanding of language from huge datasets that enables LLMs to perform well on downstream tasks.
Top comments (0)