DeepDive in everything of Llama3: revealing detailed insights and implementation from scratch

therealoliver — Wed, 26 Feb 2025 15:43:45 +0000

GitHub Project Link: https://github.com/therealoliver/Deepdive-llama3-from-scratch | Bilingual Code & Docs | Core Concepts | Process Derivation | Full Implementation

What Does This Project Do?

Large language models like Meta's Llama3 are reshaping AI, but their inner workings often feel like a "black box." In this project, we demystify Transformer inference by implementing Llama3 from scratch - with bilingual code annotations, dimension tracking, and KV-Cache derivations. Whether you're a beginner or an experienced developer, this is your gateway to understanding LLMs at the tensor level!

480+ stars in 4 days!

🔥 Key Features: 8 Major Characteristics

1. Well Organized Structure
A reorganized code flow that guides you from model loading to token prediction, layer by layer, matrix by matrix.

2. Code Annotations & Dimension Tracking
Every matrix operation is annotated with shape changes to eliminate confusion.

#### Example: Part of RoPE calculation ####

# Split the query vectors in pairs along the dimension direction.
# .float() is for switch back to full precision to ensure the precision and numerical stability in the subsequent trigonometric function calculations.
# [17x128] -> [17x64x2]
q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)

3. Principle Explanation
Abundant principle-related explanations and a large number of detailed derivations have been added. It not only tells you "what to do" but also deeply explains "why to do it", helping you fundamentally master the design concept of the model.

4. Deep Insights of KV-Cache
A dedicated chapter on KV-Cache - from theory to implementation - to optimize inference speed.

5. Bilingual Code & Docs
Native Chinese and English versions, avoiding awkward machine translations.

6. End-to-End Prediction
Input the prompt "the answer to the ultimate question…" and watch the model output 42 (a nod to The Hitchhiker's Guide to the Galaxy!).

7. Support Google Colab
Thanks to the help of the open-source community, currently it’s possible to run it for free on Google Colab with just one click, without any concerns about computing resources.

8. Model Switchable
Also thanks to the help of the open-source community, currently it is possible to switch between different llama models arbitrarily, such as llama-3.1, 3.2, as well as 1B, 3B, 8B and other different-sized models. This enables the comparison of different model effects and adaptation to different resource scenarios.

📖 Full Implementation Roadmap

Loading the model
- Loading the tokenizer
- Reading model files and configuration files
- Inferring model details using the configuration file
Convert the input text into embeddings
- Convert the text into a sequence of token ids
- Convert the sequence of token ids into embeddings
Build the first Transformer block
- Normalization
- Using RMS normalization for embeddings
- Implementing the single-head attention mechanism from scratch
- Obtain the QKV vectors corresponding to the input tokens
  - Obtain the query vector
  - Unfold the query weight matrix
  - Obtain the first head
  - Multiply the token embeddings by the query weights to obtain the query vectors corresponding to the tokens
  - Obtain the key vector (almost the same as the query vector)
  - Obtain the value vector (almost the same as the key vector)
- Add positional information to the query and key vectors
  - Rotary Position Encoding (RoPE)
  - Add positional information to the query vectors
  - Add positional information to the key vectors (same as the query)
- Everything's ready. Let's start calculating the attention weights between tokens.
  - Multiply the query and key vectors to obtain the attention scores.
  - Now we must mask the future query-key scores.
  - Calculate the final attention weights, that is, softmax(score).
- Finally! Calculate the final result of the single-head attention mechanism!
- Calculate the multi-head attention mechanism (a simple loop to repeat the above process)
- Calculate the result for each head
- Merge the results of each head into a large matrix
- Head-to-head information interaction (linear mapping), the final step of the self-attention layer!
- Perform the residual operation (add)
- Perform the second normalization operation
- Perform the calculation of the FFN (Feed-Forward Neural Network) layer
- Perform the residual operation again (Finally, we get the final output of the Transformer block!)
Everything is here. Let's complete the calculation of all 32 Transformer blocks. Happy reading :)
Let's complete the last step and predict the next token
- First, perform one last normalization on the output of the last Transformer layer
- Then, make the prediction based on the embedding corresponding to the last token (perform a linear mapping to the vocabulary dimension)
- Here's the prediction result!
Let's dive deeper and see how different embeddings or token masking strategies might affect the prediction results :)
Need to predict multiple tokens? Just using KV-Cache! (It really took me a lot of effort to sort this out. Orz)
Thank you all. Thanks for your continuous learning. Love you all :)
- From Me
- From the author of predecessor project
LICENSE

🔍 Why You Can Choose This Project?

Zero Magic, Just Math
Implement matrix multiplications and attention without high-level frameworks.

Bilingual Clarity
Code comments and docs in both English and Chinese for global accessibility.

Reproducible Results
Predict the iconic "42" using Meta's original model files, to discover the interesting process by which the model arrived at this answer.

Hands-On Experiments
Test unmasked attention, explore intermediate token predictions, and more.

🚀 Quick Start

1. Clone and Download The Project and Model Weights
2. Follow the Code Walkthrough
Start with Deepdive-llama3-from-scratch-en.ipynb in Jupyter Notebook.
3. Join the Community
Share your insights or ask questions in GitHub Discussions!

🌟 Hope this project will help you unravel the mysteries of LLMs!

GitHub Project Link: https://github.com/therealoliver/Deepdive-llama3-from-scratch

Let's unlock the secrets of Llama3 - one tensor at a time. 🚀