Umair Syed

Posted on Oct 3, 2024

Introduction to LLMs: A Useful Handbook

#ai #llm #machinelearning #basic

In the rapidly advancing field of Artificial Intelligence (AI), understanding the foundations is key, especially when dealing with Large Language Models (LLMs). This guide aims to simplify complex topics for beginners, taking you through essential concepts like neural networks, natural language processing (NLP), and LLMs. We'll explore how LLMs are built, trained, and keywords related to LLMs, as well as the challenges they face, like biases and hallucinations.

What Are Neural Networks?

A neural network is a machine learning model that makes decisions like the human brain, by mimicking how biological neurons work together to recognize patterns, evaluate choices, and reach conclusions. Neural networks are the backbone of any AI model, including LLMs. As the backbone of AI models, including LLMs, neural networks are organized into layers that process data and learn from it. The neurons in a neural network are arranged in layers:

Input Layer: Where the data first enters the model.
Hidden Layers: Where computations happen, allowing the network to learn patterns.
Output Layer: Where the final prediction or decision is made.

For example, a neural network can be trained to recognize images of cats. It processes an image through layers of neurons, enabling it to identify a cat among different shapes or objects.

What Is Natural Language Processing (NLP)?

Natural language processing (NLP) is a computer program's ability to understand human language, both spoken and written. NLP is the field that focuses on enabling machines to understand and generate human language. NLP powers everything from chatbots to voice assistants, translating human language into something a machine can process. It involves tasks like:

Tokenization: Breaking text into smaller components (e.g., words or subwords).
Parsing: Understanding the structure of sentences.
Sentiment Analysis: Determining whether a piece of text is positive, negative, or neutral.

Without NLP, LLMs wouldn't be able to grasp the nuances of human language.

What Are Large Language Models (LLMs)?

Large Language Models (LLMs) are advanced neural networks designed to understand and generate human language. They are trained on vast amounts of text data, allowing them to learn patterns, context, and nuances of language. LLMs can perform various tasks, such as answering questions, writing essays, translating languages, and engaging in conversations. Their primary goal is to create human-like text that captures the intricacies of natural language.

The core idea behind LLMs is simple: predict the next word in a sentence. For example, if you input "The sun rises in the...", the LLM should predict "east." While this may seem basic, this task leads to the development of complex, emergent abilities like text generation, reasoning, and even creativity.

Key LLM Terminologies

The Transformer

A Transformer is a type of neural network architecture that revolutionized natural language processing (NLP) by enabling models to handle sequences of data more efficiently. The Transformer architecture was introduced in the groundbreaking paper Attention Is All You Need. Traditional models like Recurrent Neural Networks (RNNs) processed sequential data and maintained an internal state, allowing them to handle sequences like sentences. However, they struggled with long sequences due to issues like the vanishing gradient problem, because over time, they would forget earlier information. This happened because the adjustments made to improve the model became too small to have any real impact.

The Transformer addressed these challenges by using a mechanism called attention, which allowed the model to focus on different parts of a sentence or document more effectively, regardless of their position. This innovation laid the foundation of ground breaking models like GPT-4, Claude, and LLaMA.

The architecture was first designed as an encoder-decoder framework. In this setup, the encoder processes input text, picking out important parts and creating a representation of it. The decoder then transforms this representation back into readable text. This approach is useful for tasks like summarization, where the decoder creates summaries based on the input passed to the encoder. The encoder and decoder can work together or separately, offering flexibility for various tasks. Some models only use the encoder to turn text into a vector, while others rely on just the decoder, which is the foundation of large language models.

Language Modeling

Language modeling refers to the process of teaching LLMs to understand the probability distribution of words in a language. This allows models to predict the most likely next word in a sentence, a critical task in generating coherent text. The ability to generate coherent and contextually appropriate text is crucial in many applications, such as text summarization, translation, or conversational agents.

Tokenization

Tokenization is the first step when working with large language models (LLMs). It means breaking down a sentence into smaller parts, called tokens. These tokens can be anything from individual letters to whole words depending on the model, and how they're split can affect how well the model works.

For example, consider the sentence: The developer’s favorite machine..

If we split the text by spaces, we get:

["The", "developer's", "favorite", "machine."]

Here, punctuation like the apostrophe in developer’s and the period at the end of machine. stays attached to the words. But we can also split the sentence based on spaces and punctuation:

["The", "developer", "'", "s", "favorite", "machine", "."]

The way text is split into tokens depends on the model, and many advanced models use methods like subword tokenization. This breaks words into smaller, meaningful parts. For example, the sentence It's raining can be split as:

["It", "'", "s", "rain", "ing", "."]

In this case, raining is broken into rain and ing, which helps the model understand the structure of words. By splitting words into their base forms and endings (like rain and ing for raining), the model can learn the meaning more effectively without needing to store different versions of every word.

During tokenization, the text is scanned, and each token is assigned a unique ID in a dictionary. This allows the model to quickly refer to the dictionary when processing the text, making the input easier to understand and work with.

Embeddings

After tokenization, the next step is to convert these tokens into something a computer can work with — which is done using embeddings. Embeddings are a way to represent tokens (words or parts of words) as numbers that the computer can understand. These numbers help the model recognize relationships between words and their context.

For example, let's say we have the words happy and joyful. The model assigns each word a set of numbers (its embedding) that captures its meaning. If two words are similar, like happy and joyful, their numbers will be close together, even though the words are different.

At first, the model assigns random numbers to each token. But as the model trains—by reading and learning from large amounts of text—it adjusts those numbers. The goal is for tokens with similar meanings to have similar sets of numbers, helping the model understand the connections between them.

Although it may sound complicated, embeddings are just lists of numbers that allow the model to store and process information efficiently. Using these numbers (or vectors) makes it easier for the model to understand how tokens relate to one another.

Let's look at a simple example of how embeddings work:
Imagine we have three words: cat, dog, and car. The model will assign each word a set of numbers, like this:

cat → [1.2, 0.5]
dog → [1.1, 0.6]
car → [4.0, 3.5]
Here, cat and dog have similar numbers because they are both animals, so their meanings are related. On the other hand, "car" has very different numbers because it’s a vehicle, not an animal.

Training and Fine-Tuning

Large language models (LLMs) are trained by reading massive amounts of text to learn how to predict the next word in a sentence. The model's goal is to adjust its internal settings to improve the chances of making accurate predictions based on the patterns it observes in the text. Initially, LLMs are trained on general datasets from the internet, like The Pile or CommonCrawl, which contain a wide variety of topics. For specialized knowledge, the model might also be trained on focused datasets, such as Reddit Posts, which help it learn specific areas like programming.

This initial training phase is called pre-training, where the model learns to understand language overall. During this phase, the model’s internal weights (its settings) are adjusted to help it predict the next word more accurately based on the training data.

Once pre-training is done, the model usually undergoes a second phase called fine-tuning. In fine-tuning, the model is trained on smaller datasets focused on specific tasks or domains, like medical text or financial reports. This helps the model apply what it learned during pre-training to perform better on specific tasks, such as translating text or answering questions about a particular field.

For advanced models like GPT-4, fine-tuning requires complex techniques and even larger amounts of data to achieve their impressive performance levels.

Prediction

After training or fine-tuning, the model can create text by predicting the next word in a sentence (or next token to be precise). It does this by analyzing the input and giving each possible next token a score based on how likely it is to come next. The token with the highest score is chosen, and this process repeats for each new token. This way, the model can generate sentences of any length, but it’s important to remember that the model can only handle a certain amount of text at a time as input, known as its context size.

Context Size

The context size, or context window, is a crucial aspect of LLMs. It refers to the maximum number of tokens the model can process in a single request. It determines how much information the model can handle in a single go, which impacts how well it performs and the quality of its output.

Different models have different context sizes. For instance, OpenAI’s gpt-3.5-turbo-16k model can handle up to 16,000 tokens (which are parts of words or words themselves). Smaller models might manage only 1,000 tokens, while bigger ones like GPT-4-0125-preview can process up to 128,000 tokens. This limit affects how much text the model can generate at one time.

Scaling Laws

Scaling laws explain how a language model's performance is affected by different factors, such as the number of parameters, the size of the training dataset, the computing power available, and the model's design. These laws, discussed in the Chinchilla paper, help us understand how to best use resources to train models effectively. They also offer insights into optimizing performance. According to Scaling laws, The following elements determine a language model’s performance:

Number of Parameters (N): Parameters are like tiny parts of the model’s brain that help it learn. When the model reads data, it adjusts these parameters to get better at understanding patterns. The more parameters the model has, the smarter it becomes, meaning it can pick up on more complex and detailed patterns in the data.
Training Dataset Size (D): The training dataset is the collection of text or data the model learns from. The bigger the training dataset, the more the model can learn and recognize patterns in different texts.
FLOPs (Floating Point Operations Per Second): This term refers to the amount of computing power needed to train the model. It measures how fast the model can process data and perform calculations during training. More FLOPs mean the model can handle more complex tasks but also requires more computational resources to do so.

Emergent Abilities in LLMs

As LLMs grow in size and complexity, they start exhibiting emergent abilities that were not explicitly programmed into them. For example, GPT-4 can summarize long texts or even perform basic arithmetic without being specifically trained for those tasks. These abilities emerge because the model learns so much about language and data during training.

Prompts

Prompts are the instructions you give to LLMs to generate a desired output. Designing the right prompt can significantly improve the quality of the generated text. For example:

1. Use Clear Language:

Be specific in your prompts to get better results.

Less Clear: Write about Allama Iqbal.
More Clear: Write a 500-word article on the great poet of sub-continent Allama Iqbal.

2. Provide Enough Context:

Context helps the model know what you want.

Less Context: Write a story.
More Context: Write a short story about a baby girl lost in the woods with happy ending.

3. Try Different Variations:

Experiment with different prompt styles to see what works best.

Original: Write a blog post about the benefits of programming.
Variation 1: Write a 1000-word blog post on the mental and financial benefits of regularly practicing programming.
Variation 2: Create an engaging blog post highlighting the top 10 benefits of programming.

4. Review Outputs:

Always check the automated responses for accuracy before sharing.

Hallucinations

Hallucinations occur when LLMs generate content that is factually incorrect or nonsensical. For instance, an LLM might state that "The capital of Australia is Sydney," when the correct answer is Canberra. This happens because the model is focused on generating likely text based on its training, not verifying facts.

Biases

Bias in LLMs arises when the training data reflects cultural, gender, or racial biases. For example, if a model is trained predominantly on English text from Western sources, it may produce biased outputs that favor Western perspectives. Efforts are being made to minimize these biases, but they remain a critical challenge in the field.

Top comments (9)

Daniel Macák • Oct 7 '24

Nice summary of the most important terms, thank you. With the exception of Transformer - I didn't understand a thing what they are about from your text since it's too abstract and doesn't mention what they actually do. Found a better explanation here poloclub.github.io/transformer-exp....

Umair Syed • Oct 7 '24

Yep you are right. Since it was a long topic, that is why I only relied on the definition for the time being. Currently working on a more comprehensive blog on understanding transformers.

peterkmx • Oct 4 '24

Interesting read, upvoted, but ... it would be great to have references to functions "to try them out" such as how to tokenize in Python, how to generate embeddings etc. Anyway, a great summary ...

Umair Syed • Oct 7 '24

Noted for future reference, thank you!

anna lapushner • Oct 3 '24

This is a timely posts for all of anyone and everyone. Remember the rules and go back to the basics!! I like the examples as well. Thank you for the reminders …