LLM (Large Language Model) is one of the most trending topics today.
A few years ago, if you Googled this topic, you would see a very different scenario from what we see today. LLM development was a highly specialized activity for AI research.
But today, if you Google “How to build an LLM from scratch” or “Should I build an LLM”, you will see a much different story.
In general, there are four main steps to building an LLM:
- Data Curation / Data Collection
- Model Architecture
- Training at Scale
- Evaluation
1. Data Curation / Data Collection
This is the very first step of building an LLM.
It is also the most important and time-consuming step of the process.
It is said that: “The quality of the model is driven by the quality of the data.”
For example:
- GPT-3 was trained on 300 billion tokens
- LLaMA 2 was trained on 2 trillion tokens
What is a Token?
A token is nothing but a chunk of text.
For example, "PCIU Computer Club"
might become 4 tokens:
["PCI", "U", " Computer", " Club"]
Questions that arise:
What kind of data do we need?
This depends on the objective of the LLM. It may be:
- General-purpose
- Domain-specific (e.g., medical, financial fields)
Where to collect this data?
Generally, data is collected from:
- Books
- Scientific papers
- Codebases (e.g., GitHub, GitLab)
- Websites
Alternatively, public datasets such as Hugging Face datasets and Common Crawl can be used.
Data enrichment:
After collection, the data is cleaned by:
- Removing duplicate text
- Removing low-quality text
Finally, the text is converted into tokens using a tokenizer.
2. Model Architecture
As the name suggests, in this step we design or choose the architecture of the model.
Most popular models are based on the Transformer architecture.
What is Transformer architecture?
Conceptually, it is a way to read and understand language by paying attention to the relationships between words rather than reading word-by-word.
This is why Transformers power most modern LLMs.
Example:
- “I hit the basketball with a bat” (bat = a stick)
- “I hit the bat with a basketball” (bat = an animal)
The model won’t memorize word meanings word-by-word.
Instead, it will look for relationships:
- Who is doing the action? (He / I)
- What is being hit? (basketball / bat)
- What tool is used? (bat / basketball)
By paying attention to these links, the Transformer figures out which “bat” is meant (stick vs. animal).
For the simplicity of this article, we won’t go deep into the technical aspects of Transformers and their variants.
3. Training at Scale
In this step, we feed the data to the model and adjust its parameters until it becomes good at predicting and generating text.
The essential challenge of building an LLM is its scale.
When we train on trillions of tokens and billions of parameters, the computational cost is enormous.
It’s basically impossible to train without implementing computational tricks such as:
- Mixed precision
- Gradient checkpointing
- Parallelism
These techniques reduce cost while keeping performance high.
4. Evaluation
In this step, we evaluate whether the model works for the desired use case and whether it generates toxic or biased outputs.
This is where model evaluation comes in.
There are many benchmark datasets available for evaluation, for example:
- Open LLM Leaderboard
Evaluation is critical before deploying an LLM to ensure it is safe, reliable, and useful.
Top comments (0)