<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: therealoliver</title>
    <description>The latest articles on DEV Community by therealoliver (@therealoliver).</description>
    <link>https://dev.to/therealoliver</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2894859%2F95c78759-0479-486e-ba0e-c632f2e163ce.png</url>
      <title>DEV Community: therealoliver</title>
      <link>https://dev.to/therealoliver</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/therealoliver"/>
    <language>en</language>
    <item>
      <title>DeepDive in everything of Llama3: revealing detailed insights and implementation from scratch</title>
      <dc:creator>therealoliver</dc:creator>
      <pubDate>Wed, 26 Feb 2025 15:43:45 +0000</pubDate>
      <link>https://dev.to/therealoliver/deepdive-in-everything-of-llama3-revealing-detailed-insights-and-implementation-from-scratch-1m23</link>
      <guid>https://dev.to/therealoliver/deepdive-in-everything-of-llama3-revealing-detailed-insights-and-implementation-from-scratch-1m23</guid>
      <description>&lt;p&gt;&lt;strong&gt;GitHub Project Link: &lt;a href="https://github.com/therealoliver/Deepdive-llama3-from-scratch" rel="noopener noreferrer"&gt;https://github.com/therealoliver/Deepdive-llama3-from-scratch&lt;/a&gt; | Bilingual Code &amp;amp; Docs | Core Concepts | Process Derivation | Full Implementation&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What Does This Project Do?
&lt;/h2&gt;

&lt;p&gt;Large language models like Meta's Llama3 are reshaping AI, but their inner workings often feel like a "black box." In this project, we demystify Transformer inference by &lt;strong&gt;implementing Llama3 from scratch&lt;/strong&gt; - with &lt;strong&gt;bilingual code annotations&lt;/strong&gt;, &lt;strong&gt;dimension tracking&lt;/strong&gt;, and &lt;strong&gt;KV-Cache derivations&lt;/strong&gt;. Whether you're a beginner or an experienced developer, this is your gateway to understanding LLMs at the tensor level!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faia4zs0ry8pysoxf1g71.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faia4zs0ry8pysoxf1g71.png" alt="480+ stars in 4 days!" width="800" height="571"&gt;&lt;/a&gt;&lt;br&gt;
&lt;strong&gt;480+ stars in 4 days!&lt;/strong&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  🔥 Key Features: 8 Major Characteristics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Well Organized Structure&lt;/strong&gt;&lt;br&gt;
 A reorganized code flow that guides you from model loading to token prediction, layer by layer, matrix by matrix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Code Annotations &amp;amp; Dimension Tracking&lt;/strong&gt;&lt;br&gt;
 Every matrix operation is annotated with shape changes to eliminate confusion.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;#### Example: Part of RoPE calculation ####
&lt;/span&gt;
&lt;span class="c1"&gt;# Split the query vectors in pairs along the dimension direction.
# .float() is for switch back to full precision to ensure the precision and numerical stability in the subsequent trigonometric function calculations.
# [17x128] -&amp;gt; [17x64x2]
&lt;/span&gt;&lt;span class="n"&gt;q_per_token_split_into_pairs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;q_per_token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;float&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;q_per_token&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;3. Principle Explanation&lt;/strong&gt;&lt;br&gt;
 Abundant principle-related explanations and a large number of detailed derivations have been added. It not only tells you "what to do" but also deeply explains "why to do it", helping you fundamentally master the design concept of the model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Deep Insights of KV-Cache&lt;/strong&gt;&lt;br&gt;
 A dedicated chapter on KV-Cache - from theory to implementation - to optimize inference speed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Bilingual Code &amp;amp; Docs&lt;/strong&gt;&lt;br&gt;
 Native Chinese and English versions, avoiding awkward machine translations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. End-to-End Prediction&lt;/strong&gt;&lt;br&gt;
 Input the prompt "the answer to the ultimate question…" and watch the model output 42 (a nod to The Hitchhiker's Guide to the Galaxy!).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Support Google Colab&lt;/strong&gt;&lt;br&gt;
Thanks to the help of the open-source community, currently it’s possible to run it for free on Google Colab with just one click, without any concerns about computing resources.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;8. Model Switchable&lt;/strong&gt;&lt;br&gt;
Also thanks to the help of the open-source community, currently it is possible to switch between different llama models arbitrarily, such as llama-3.1, 3.2, as well as 1B, 3B, 8B and other different-sized models. This enables the comparison of different model effects and adaptation to different resource scenarios.&lt;/p&gt;




&lt;h2&gt;
  
  
  📖 Full Implementation Roadmap
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Loading the model

&lt;ul&gt;
&lt;li&gt;Loading the tokenizer&lt;/li&gt;
&lt;li&gt;Reading model files and configuration files&lt;/li&gt;
&lt;li&gt;Inferring model details using the configuration file&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Convert the input text into embeddings

&lt;ul&gt;
&lt;li&gt;Convert the text into a sequence of token ids&lt;/li&gt;
&lt;li&gt;Convert the sequence of token ids into embeddings&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Build the first Transformer block

&lt;ul&gt;
&lt;li&gt;Normalization&lt;/li&gt;
&lt;li&gt;Using RMS normalization for embeddings&lt;/li&gt;
&lt;li&gt;Implementing the single-head attention mechanism from scratch&lt;/li&gt;
&lt;li&gt;Obtain the QKV vectors corresponding to the input tokens

&lt;ul&gt;
&lt;li&gt;Obtain the query vector&lt;/li&gt;
&lt;li&gt;Unfold the query weight matrix&lt;/li&gt;
&lt;li&gt;Obtain the first head&lt;/li&gt;
&lt;li&gt;Multiply the token embeddings by the query weights to obtain the query vectors corresponding to the tokens&lt;/li&gt;
&lt;li&gt;Obtain the key vector (almost the same as the query vector)&lt;/li&gt;
&lt;li&gt;Obtain the value vector (almost the same as the key vector)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Add positional information to the query and key vectors

&lt;ul&gt;
&lt;li&gt;Rotary Position Encoding (RoPE)&lt;/li&gt;
&lt;li&gt;Add positional information to the query vectors&lt;/li&gt;
&lt;li&gt;Add positional information to the key vectors (same as the query)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Everything's ready. Let's start calculating the attention weights between tokens.

&lt;ul&gt;
&lt;li&gt;Multiply the query and key vectors to obtain the attention scores.&lt;/li&gt;
&lt;li&gt;Now we must mask the future query-key scores.&lt;/li&gt;
&lt;li&gt;Calculate the final attention weights, that is, softmax(score).&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Finally! Calculate the final result of the single-head attention mechanism!&lt;/li&gt;

&lt;li&gt;Calculate the multi-head attention mechanism (a simple loop to repeat the above process)&lt;/li&gt;

&lt;li&gt;Calculate the result for each head&lt;/li&gt;

&lt;li&gt;Merge the results of each head into a large matrix&lt;/li&gt;

&lt;li&gt;Head-to-head information interaction (linear mapping), the final step of the self-attention layer!&lt;/li&gt;

&lt;li&gt;Perform the residual operation (add)&lt;/li&gt;

&lt;li&gt;Perform the second normalization operation&lt;/li&gt;

&lt;li&gt;Perform the calculation of the FFN (Feed-Forward Neural Network) layer&lt;/li&gt;

&lt;li&gt;Perform the residual operation again (Finally, we get the final output of the Transformer block!)&lt;/li&gt;

&lt;/ul&gt;

&lt;/li&gt;

&lt;li&gt;Everything is here. Let's complete the calculation of all 32 Transformer blocks. Happy reading :)&lt;/li&gt;

&lt;li&gt;Let's complete the last step and predict the next token

&lt;ul&gt;
&lt;li&gt;First, perform one last normalization on the output of the last Transformer layer&lt;/li&gt;
&lt;li&gt;Then, make the prediction based on the embedding corresponding to the last token (perform a linear mapping to the vocabulary dimension)&lt;/li&gt;
&lt;li&gt;Here's the prediction result!&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Let's dive deeper and see how different embeddings or token masking strategies might affect the prediction results :)&lt;/li&gt;

&lt;li&gt;Need to predict multiple tokens? Just using KV-Cache! (It really took me a lot of effort to sort this out. Orz)&lt;/li&gt;

&lt;li&gt;Thank you all. Thanks for your continuous learning. Love you all :)

&lt;ul&gt;
&lt;li&gt;From Me&lt;/li&gt;
&lt;li&gt;From the author of predecessor project&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;LICENSE&lt;/li&gt;

&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔍 Why You Can Choose This Project?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Zero Magic, Just Math&lt;/strong&gt;&lt;br&gt;
 Implement matrix multiplications and attention without high-level frameworks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Bilingual Clarity&lt;/strong&gt;&lt;br&gt;
 Code comments and docs in both English and Chinese for global accessibility.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reproducible Results&lt;/strong&gt;&lt;br&gt;
 Predict the iconic "42" using Meta's original model files, to discover the interesting process by which the model arrived at this answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hands-On Experiments&lt;/strong&gt;&lt;br&gt;
 Test unmasked attention, explore intermediate token predictions, and more.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Quick Start
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Clone and Download The Project and Model Weights&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;2. Follow the Code Walkthrough&lt;/strong&gt;&lt;br&gt;
 Start with &lt;strong&gt;&lt;em&gt;Deepdive-llama3-from-scratch-en.ipynb&lt;/em&gt;&lt;/strong&gt; in Jupyter Notebook.&lt;br&gt;
&lt;strong&gt;3. Join the Community&lt;/strong&gt;&lt;br&gt;
 Share your insights or ask questions in GitHub Discussions!&lt;/p&gt;




&lt;h2&gt;
  
  
  🌟 Hope this project will help you unravel the mysteries of LLMs!
&lt;/h2&gt;

&lt;p&gt;GitHub Project Link: &lt;a href="https://github.com/therealoliver/Deepdive-llama3-from-scratch" rel="noopener noreferrer"&gt;https://github.com/therealoliver/Deepdive-llama3-from-scratch&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Let's unlock the secrets of Llama3 - one tensor at a time. 🚀&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>chatgpt</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
