Fahimkhan9

Posted on Sep 27

Steps of Building an LLM — An Overview

#ai #llm #machinelearning

LLM (Large Language Model) is one of the most trending topics today.

A few years ago, if you Googled this topic, you would see a very different scenario from what we see today. LLM development was a highly specialized activity for AI research.

But today, if you Google “How to build an LLM from scratch” or “Should I build an LLM”, you will see a much different story.

In general, there are four main steps to building an LLM:

Data Curation / Data Collection
Model Architecture
Training at Scale
Evaluation

1. Data Curation / Data Collection

This is the very first step of building an LLM.

It is also the most important and time-consuming step of the process.

It is said that: “The quality of the model is driven by the quality of the data.”

For example:

GPT-3 was trained on 300 billion tokens
LLaMA 2 was trained on 2 trillion tokens

What is a Token?

A token is nothing but a chunk of text.

For example, "PCIU Computer Club" might become 4 tokens:
["PCI", "U", " Computer", " Club"]

Questions that arise:

What kind of data do we need?

This depends on the objective of the LLM. It may be:

General-purpose
Domain-specific (e.g., medical, financial fields)

Where to collect this data?

Generally, data is collected from:

Books
Scientific papers
Codebases (e.g., GitHub, GitLab)
Websites

Alternatively, public datasets such as Hugging Face datasets and Common Crawl can be used.

Data enrichment:

After collection, the data is cleaned by:

Removing duplicate text
Removing low-quality text

Finally, the text is converted into tokens using a tokenizer.

2. Model Architecture

As the name suggests, in this step we design or choose the architecture of the model.

Most popular models are based on the Transformer architecture.

What is Transformer architecture?

Conceptually, it is a way to read and understand language by paying attention to the relationships between words rather than reading word-by-word.

This is why Transformers power most modern LLMs.

Example:

“I hit the basketball with a bat” (bat = a stick)
“I hit the bat with a basketball” (bat = an animal)

The model won’t memorize word meanings word-by-word.

Instead, it will look for relationships:

Who is doing the action? (He / I)
What is being hit? (basketball / bat)
What tool is used? (bat / basketball)

By paying attention to these links, the Transformer figures out which “bat” is meant (stick vs. animal).

For the simplicity of this article, we won’t go deep into the technical aspects of Transformers and their variants.

3. Training at Scale

In this step, we feed the data to the model and adjust its parameters until it becomes good at predicting and generating text.

The essential challenge of building an LLM is its scale.

When we train on trillions of tokens and billions of parameters, the computational cost is enormous.

It’s basically impossible to train without implementing computational tricks such as:

Mixed precision
Gradient checkpointing
Parallelism

These techniques reduce cost while keeping performance high.

4. Evaluation

In this step, we evaluate whether the model works for the desired use case and whether it generates toxic or biased outputs.

This is where model evaluation comes in.

There are many benchmark datasets available for evaluation, for example:

Open LLM Leaderboard

Evaluation is critical before deploying an LLM to ensure it is safe, reliable, and useful.

DEV Community