<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pranay Bathini </title>
    <description>The latest articles on DEV Community by Pranay Bathini  (@pranaybathini).</description>
    <link>https://dev.to/pranaybathini</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F54499%2F1ca8304e-fb32-4f44-8936-85ad814ba2ed.jpg</url>
      <title>DEV Community: Pranay Bathini </title>
      <link>https://dev.to/pranaybathini</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pranaybathini"/>
    <language>en</language>
    <item>
      <title>The Transformer Architecture: A Deep Dive into How LLMs Actually Work</title>
      <dc:creator>Pranay Bathini </dc:creator>
      <pubDate>Sat, 27 Dec 2025 19:56:48 +0000</pubDate>
      <link>https://dev.to/pranaybathini/the-transformer-architecture-a-deep-dive-into-how-llms-actually-work-4c46</link>
      <guid>https://dev.to/pranaybathini/the-transformer-architecture-a-deep-dive-into-how-llms-actually-work-4c46</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;Originally published at &lt;a href="https://pranaybathini.com/learn/llm-fundamentals/transformer-architecture" rel="noopener noreferrer"&gt;pranaybathini.com/learn/llm-fundamentals/transformer-architecture&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After spending months studying transformer architectures and building LLM applications, I realized something: most explanations are overwhelming or missing out some details. This article is my attempt to bridge that gap — explaining transformers the way I wish someone had explained them to me.&lt;/p&gt;

&lt;p&gt;For an intro into what Large language model (LLM) means, refer this &lt;a href="https://pranaybathini.com/learn/llm-fundamentals/introduction-to-language-models" rel="noopener noreferrer"&gt;article&lt;/a&gt; I published previously.&lt;/p&gt;

&lt;p&gt;By the end of this lesson, you will be able to look at any LLM architecture diagram and understand what is happening.&lt;/p&gt;

&lt;p&gt;This is not just academic knowledge — understanding the Transformer architecture will help you make better decisions about model selection, optimize your prompts, and debug issues when your LLM applications behave unexpectedly.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;How to Read This Lesson:&lt;/strong&gt; You don't need to absorb everything in one read. Skim first, revisit later—this lesson is designed to compound over time. The concepts build on each other, so come back as you need deeper understanding.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What You Will Learn
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;The complete Transformer architecture from input to output&lt;/li&gt;
&lt;li&gt;How positional encodings let models understand word order&lt;/li&gt;
&lt;li&gt;The difference between encoder-only, decoder-only, and encoder-decoder models&lt;/li&gt;
&lt;li&gt;Why layer normalization and residual connections matter&lt;/li&gt;
&lt;li&gt;How to read and interpret architecture diagrams&lt;/li&gt;
&lt;li&gt;Practical implications for choosing the right model type&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Don't worry if some of these terms sound unfamiliar—we'll explain each concept step by step, starting with the basics. By the end of this lesson, these technical terms will make perfect sense, even if you're new to machine learning architecture.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Big Picture
&lt;/h2&gt;

&lt;p&gt;Let's start with a simple analogy. Imagine you're reading a book and trying to understand a sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The animal didn't cross the street because it was too tired."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;To understand this, your brain does several things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Recognizes the words&lt;/strong&gt; - You know what "animal", "street", and "tired" mean&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understands word order&lt;/strong&gt; - "The animal was tired" means something different from "Tired was the animal"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connects related words&lt;/strong&gt; - You figure out that "it" refers to "animal", not "street"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grasps the overall meaning&lt;/strong&gt; - The animal's tiredness caused it to not cross&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A Transformer does something remarkably similar, but using math. Let me give you a simple explanation of how it works:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What goes in:&lt;/strong&gt; Text broken into pieces (called tokens)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's a token? Think of tokens as the basic building blocks that language models understand:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sometimes a token is a full word (like "cat" or "the")&lt;/li&gt;
&lt;li&gt;Sometimes it's part of a word (like "under" and "stand" for "understand")&lt;/li&gt;
&lt;li&gt;Even punctuation marks and spaces can be their own tokens&lt;/li&gt;
&lt;li&gt;For example, "I love AI!" might be split into tokens: ["I", " love", " AI", "!"] &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What happens inside:&lt;/strong&gt; The model processes this text through several stages (we'll explore each in detail):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Converts words to numbers (because computers only understand math)&lt;/li&gt;
&lt;li&gt;Adds information about word positions (1st word, 2nd word, etc.)&lt;/li&gt;
&lt;li&gt;Figures out which words are related to each other&lt;/li&gt;
&lt;li&gt;Builds deeper understanding by repeating this process many times&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;What comes out:&lt;/strong&gt; Depends on what you need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understanding text (aka: encoding):&lt;/strong&gt; A mathematical representation that captures meaning (useful for: "Is this email spam?" or "Find similar articles")&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generating text (aka: decoding):&lt;/strong&gt; Prediction of what word should come next (useful for: ChatGPT, code completion, translation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of a Transformer like an assembly line where each station refines the product. Raw materials (words) enter, each station adds something (position info, relationships, meaning), and the final product emerges more polished at each step.&lt;/p&gt;

&lt;h3&gt;
  
  
  A Quick Visual Journey
&lt;/h3&gt;

&lt;p&gt;Here's how text flows through a Transformer:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttu2v4lwwmom9hx2zchy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttu2v4lwwmom9hx2zchy.png" alt="Tranformer Architecture: Text processing flow" width="800" height="858"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The diagram shows how a simple sentence like "The cat sat on the mat" gets processed through the transformer architecture - from tokenization to final output. The key steps include embedding the tokens into vectors, adding positional information, applying self-attention to understand relationships between words, and repeating the attention and processing steps multiple times to refine understanding.&lt;/p&gt;

&lt;p&gt;Modern LLMs repeat the attention and processing steps many times:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small models: 12 repetitions (like BERT)&lt;/li&gt;
&lt;li&gt;Large models: 120+ repetitions (like GPT-4)&lt;/li&gt;
&lt;li&gt;Each repetition = one "layer" that deepens understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now let's walk through each step in detail, starting from the very beginning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Tokenization and Embeddings
&lt;/h2&gt;

&lt;p&gt;Before the model can process text, it needs to solve two problems: breaking text into pieces (tokenization) and converting those pieces into numbers (embeddings).&lt;/p&gt;

&lt;h3&gt;
  
  
  Part A: Tokenization - Breaking Text Into Pieces
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; How do you break text into manageable chunks? You might think "just split by spaces into words," but that's too simple.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not just use words?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider these challenges:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"running" and "runs" are related, but treating them as completely separate words wastes the model's capacity&lt;/li&gt;
&lt;li&gt;New words like "ChatGPT" appear constantly - you can't have infinite vocabulary&lt;/li&gt;
&lt;li&gt;Different languages don't use spaces (Chinese, Japanese)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The solution: Subword Tokenization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Modern models break text into subwords - pieces smaller than words but larger than individual characters. Think of it like Lego blocks: instead of needing a unique piece for every possible structure, you reuse common blocks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text: "I am playing happily"

Split by spaces (naive approach):
["I", "am", "playing", "happily"]
Problem: Need separate entries for "play", "playing", "played", "player", "plays"...

Subword tokenization (smart approach):
["I", "am", "play", "##ing", "happy", "##ly"]
Better: Reuse "play" and "##ing" for "playing", "running", "jumping"
        Reuse "happy" and "##ly" for "happily", "sadly", "quickly"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters - concrete examples:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handling related words:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"unhappiness" → ["un", "##happy", "##ness"]&lt;/li&gt;
&lt;li&gt;Now the model knows: "un" = negative, "happy" = emotion, "ness" = state&lt;/li&gt;
&lt;li&gt;When it sees "uncomfortable", it recognizes "un" means negative!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Handling rare/new words:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Imagine the word "unsubscribe" wasn't in training&lt;/li&gt;
&lt;li&gt;Model breaks it down: ["un", "##subscribe"]&lt;/li&gt;
&lt;li&gt;It can guess meaning from pieces it knows: "un" (undo) + "subscribe" (join)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Vocabulary efficiency:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;50,000 tokens can represent millions of word combinations&lt;/li&gt;
&lt;li&gt;Like having 1,000 Lego pieces that make infinite structures&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Real example of tokenization impact:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input: "The animal didn't cross the street because it was tired"

Tokens (what the model actually sees):
["The", "animal", "didn", "'", "t", "cross", "the", "street", "because", "it", "was", "tired"]

Notice:
- "didn't" → ["didn", "'", "t"] (split to handle contractions)
- Each token gets converted to numbers (embeddings) next
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Part B: Embeddings - Converting Tokens to Numbers
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Computers don't understand tokens. They only work with numbers. So how do we convert "cat" into something a computer can process?&lt;/p&gt;

&lt;h3&gt;
  
  
  Understanding Dimensions with a Simple Analogy
&lt;/h3&gt;

&lt;p&gt;Before we dive in, let's understand what "dimensions" mean with a familiar example:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Describing a person in 3 dimensions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Height: 5.8 feet&lt;/li&gt;
&lt;li&gt;Weight: 150 lbs
&lt;/li&gt;
&lt;li&gt;Age: 30 years&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These 3 numbers (dimensions) give us a mathematical way to represent a person. Now, what if we want to represent a word mathematically?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Describing a word needs way more dimensions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To capture everything about the word "cat", we need hundreds of numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dimension 1: How "animal-like" is this word? (0.9 - very animal-like)&lt;/li&gt;
&lt;li&gt;Dimension 2: How "small" is this? (0.7 - fairly small)&lt;/li&gt;
&lt;li&gt;Dimension 3: How "domestic" is it? (0.8 - very domestic)&lt;/li&gt;
&lt;li&gt;Dimension 4: How "fluffy" is this? (0.6 - somewhat fluffy)&lt;/li&gt;
&lt;li&gt;... (and hundreds more capturing different aspects)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Modern models use 768 to 4096 dimensions because words are complex! But here's the key: &lt;strong&gt;you don't need to understand what each dimension represents&lt;/strong&gt;. The model figures this out during training.&lt;/p&gt;

&lt;h3&gt;
  
  
  How Words Get Converted to Numbers
&lt;/h3&gt;

&lt;p&gt;Let's walk through a concrete example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This is a simplified embedding table (real ones have thousands of words)
# Each word maps to a list of numbers (a "vector")
&lt;/span&gt;&lt;span class="n"&gt;embedding_table&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="c1"&gt;# 768 numbers total
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dog&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;    &lt;span class="c1"&gt;# Notice: similar to "cat"!
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bank&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# Very different from "cat"
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# When we input a sentence:
&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The cat sat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Break into tokens
&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Look up each token's vector
&lt;/span&gt;&lt;span class="n"&gt;embedded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="n"&gt;embedding_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# Gets: [0.1, 0.3, ..., 0.2]  (768 numbers)
&lt;/span&gt;    &lt;span class="n"&gt;embedding_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# Gets: [0.2, -0.5, ..., 0.1] (768 numbers)  
&lt;/span&gt;    &lt;span class="n"&gt;embedding_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sat&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;   &lt;span class="c1"&gt;# Gets: [0.4, 0.2, ..., 0.3]  (768 numbers)
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Result: We now have 3 vectors, each with 768 dimensions
# The model can now do math with these!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where Does This Table Come From?
&lt;/h3&gt;

&lt;p&gt;Great question! The embedding table isn't written by hand. Here's how it's created:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Start with random numbers&lt;/strong&gt;: Initially, every word gets random numbers&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"cat" → &lt;a href="https://dev.torandom"&gt;0.43, 0.12, 0.88, ...&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;"dog" → &lt;a href="https://dev.torandom"&gt;0.71, 0.05, 0.33, ...&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Training adjusts these numbers&lt;/strong&gt;: As the model trains on billions of text examples, it learns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"cat" and "dog" appear in similar contexts → Their numbers become similar&lt;/li&gt;
&lt;li&gt;"cat" and "bank" appear in different contexts → Their numbers stay different&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;After training&lt;/strong&gt;: Words with similar meanings have similar number patterns&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"cat" → [0.2, -0.5, 0.8, ...]&lt;/li&gt;
&lt;li&gt;"dog" → [0.3, -0.4, 0.7, ...] ← Very similar to "cat"!&lt;/li&gt;
&lt;li&gt;"happy" → [0.5, 0.8, 0.3, ...]&lt;/li&gt;
&lt;li&gt;"joyful" → [0.6, 0.7, 0.4, ...] ← Similar to "happy"!&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Why This Matters
&lt;/h3&gt;

&lt;p&gt;These embeddings capture word relationships mathematically:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"king" - "man" + "woman" ≈ "queen" (this actually works with the vectors!)&lt;/li&gt;
&lt;li&gt;Similar words cluster together in this high-dimensional space&lt;/li&gt;
&lt;li&gt;The model can now reason about word meanings using math&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Key Insight: Embeddings as Parameters
&lt;/h3&gt;

&lt;p&gt;When we say GPT-3 has 175 billion parameters, where are they? A significant chunk lives in the embedding table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What happens in the embedding layer:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Each token in your vocabulary (like "cat" or "the") gets its own vector of numbers&lt;/li&gt;
&lt;li&gt;These numbers ARE the parameters - they're what the model learns during training&lt;/li&gt;
&lt;li&gt;For a model with 50,000 tokens and 1,024 dimensions per token, that's 51.2 million parameters just for embeddings&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;br&gt;
If "cat" = token #847, the model looks up row #847 in its embedding table and retrieves a vector like [0.2, -0.5, 0.7, ...] with hundreds or thousands of numbers. Each of these numbers is a parameter that was optimized during training.&lt;/p&gt;

&lt;p&gt;This is why embeddings contain so much "knowledge" - they encode the meaning and relationships between words that the model learned from massive amounts of text.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 2: Adding Position Information
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; After converting words to numbers, we have another issue. Look at these two sentences:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;"The cat sat"&lt;/li&gt;
&lt;li&gt;"sat cat The"&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;They have the same words, just in different order. But right now, the model sees them as identical because it just has three vectors with no order information!&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world example:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The dog bit the man" vs "The man bit the dog"&lt;/li&gt;
&lt;li&gt;Same words, completely different meanings!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Transformers process all words at the same time (unlike reading left-to-right), so we need to explicitly tell the model: "This is word #1, this is word #2, this is word #3."&lt;/p&gt;
&lt;h3&gt;
  
  
  How We Add Position Information
&lt;/h3&gt;

&lt;p&gt;Think of it like adding page numbers to a book. Each word gets a "position tag" added to its embedding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple Example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# We have our word embeddings from Step 1:
&lt;/span&gt;&lt;span class="n"&gt;word_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# "The" (768 numbers)
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt; &lt;span class="c1"&gt;# "cat" (768 numbers)
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# "sat" (768 numbers)
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Now add position information:
&lt;/span&gt;&lt;span class="n"&gt;position_tags&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# Position 1 tag (768 numbers)
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# Position 2 tag (768 numbers)  
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# Position 3 tag (768 numbers)
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Combine them (add the numbers together):
&lt;/span&gt;&lt;span class="n"&gt;final_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# "The" at position 1
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt; &lt;span class="c1"&gt;# "cat" at position 2
&lt;/span&gt;    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...],&lt;/span&gt;  &lt;span class="c1"&gt;# "sat" at position 3
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Now each word carries both:
# - What the word means (from embeddings)
# - Where the word is located (from position tags)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How Are Position Tags Created?
&lt;/h3&gt;

&lt;p&gt;The original Transformer paper used a mathematical pattern based on sine and cosine waves. You don't need to understand the math — just know that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Each position gets a unique pattern&lt;/strong&gt; - Position 1 gets one pattern, position 2 gets another, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The pattern encodes relative distance&lt;/strong&gt; - The model can figure out "word 5 is 2 steps after word 3"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It works for any length&lt;/strong&gt; - The mathematical pattern can extend beyond what the model saw during training, so a model trained on 100-word sentences can still understand the position of words in much longer documents like 1000-word documents&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Modern Improvement: Rotary Position Embeddings (RoPE)
&lt;/h3&gt;

&lt;p&gt;Newer models like Llama and Mistral use an improved approach called &lt;strong&gt;RoPE (Rotary Position Embeddings)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple analogy:&lt;/strong&gt; Think of a clock face with moving hands:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Word at position 1: Clock hand at 12 o'clock (0°)
Word at position 2: Clock hand at 1 o'clock (30°)
Word at position 3: Clock hand at 2 o'clock (60°)
Word at position 4: Clock hand at 3 o'clock (90°)
...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzcnn0ch8mbfhgiffga9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzcnn0ch8mbfhgiffga9.png" alt="RoPE Example in LLM" width="800" height="497"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How this connects to RoPE:&lt;/strong&gt;&lt;br&gt;
Just like the clock hands rotate to show different times, RoPE literally &lt;em&gt;rotates&lt;/em&gt; each word's embedding vector based on its position. Word 1 gets rotated 0°, word 2 gets rotated 30°, word 3 gets rotated 60°, and so on. This rotation encodes position information directly into the word vectors themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this works:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Words next to each other have clock hands that are close (12 o'clock vs 1 o'clock)&lt;/li&gt;
&lt;li&gt;Words far apart have very different clock positions (12 o'clock vs 6 o'clock)&lt;/li&gt;
&lt;li&gt;Just by looking at the clock hands, the model can tell:

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Where each word is&lt;/strong&gt;: "This word is at the 5 o'clock position"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How far apart words are&lt;/strong&gt;: "These two words are 3 hours apart"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters in practice:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better performance on long documents&lt;/li&gt;
&lt;li&gt;Enables "context extension" tricks (train on 4K words, use with 32K words)&lt;/li&gt;
&lt;li&gt;More natural understanding of word distances&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key takeaway:&lt;/strong&gt; Position encoding ensures the model knows "The cat sat" is different from "sat cat The". Without this, word order would be lost!&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 3: Understanding Which Words Are Related (Attention)
&lt;/h2&gt;

&lt;p&gt;This is the magic that makes Transformers work! Let's understand it with a story.&lt;/p&gt;
&lt;h3&gt;
  
  
  The Dinner Party Analogy
&lt;/h3&gt;

&lt;p&gt;Imagine you're at a dinner party with 10 people. Someone mentions "Paris" and you want to understand what they mean:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;You scan the room&lt;/strong&gt; (looking at all other conversations)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You notice&lt;/strong&gt; someone just said "France" and another said "Eiffel Tower"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You connect the dots&lt;/strong&gt; - "Ah! They're talking about Paris the city, not Paris Hilton"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You gather information&lt;/strong&gt; from those relevant conversations&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Attention does exactly this for words in a sentence!&lt;/p&gt;
&lt;h3&gt;
  
  
  Example
&lt;/h3&gt;

&lt;p&gt;Let's process this sentence:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"The animal didn't cross the street because &lt;strong&gt;it&lt;/strong&gt; was too tired."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;When the model processes the word "it", it needs to figure out: What does "it" refer to?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: The word "it" asks questions&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"I'm a pronoun. Who do I refer to? I'm looking for nouns that came before me."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 2: All other words offer information&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"The" says: "I'm just an article, not important"&lt;/li&gt;
&lt;li&gt;"animal" says: "I'm a noun! I'm a subject! Pay attention to me!"&lt;/li&gt;
&lt;li&gt;"didn't" says: "I'm a verb helper, not what you're looking for"&lt;/li&gt;
&lt;li&gt;"street" says: "I'm a noun too, but I'm the location, not the subject"&lt;/li&gt;
&lt;li&gt;"tired" says: "I describe a state, might be relevant"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3: "it" calculates relevance scores&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"animal": 0.45 (45% relevant - very high!)&lt;/li&gt;
&lt;li&gt;"street": 0.08 (8% relevant - somewhat relevant)&lt;/li&gt;
&lt;li&gt;"tired": 0.15 (15% relevant - moderately relevant)&lt;/li&gt;
&lt;li&gt;All others: ~0.02 (2% each - barely relevant)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 4: "it" gathers information&lt;/strong&gt;&lt;br&gt;
The model now knows: "it" = mostly "animal" + a bit of "tired" + tiny bit of others&lt;/p&gt;
&lt;h3&gt;
  
  
  How This Works Mathematically ?
&lt;/h3&gt;

&lt;p&gt;The model creates three versions of each word:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Query (Q)&lt;/strong&gt;: "What am I looking for?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For "it": Looking for nouns, subjects, things that can be tired&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Key (K)&lt;/strong&gt;: "What do I contain?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For "animal": I'm a noun, I'm the subject, I can get tired&lt;/li&gt;
&lt;li&gt;For "street": I'm a noun, but I'm an object/location&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Value (V)&lt;/strong&gt;: "What information do I carry?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For "animal": Carries the actual meaning/features of "animal"&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;The matching process:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified example (real numbers would be 768-dimensional)
&lt;/span&gt;
&lt;span class="c1"&gt;# Word "it" creates its Query:
&lt;/span&gt;&lt;span class="n"&gt;query_it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Looking for: subject, noun, living thing
&lt;/span&gt;
&lt;span class="c1"&gt;# Word "animal" has this Key:
&lt;/span&gt;&lt;span class="n"&gt;key_animal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Offers: subject, noun, living thing
&lt;/span&gt;
&lt;span class="c1"&gt;# How well do they match? Multiply and sum:
&lt;/span&gt;&lt;span class="n"&gt;relevance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.72&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.72&lt;/span&gt;  
          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.56&lt;/span&gt;  &lt;span class="c1"&gt;# High match!
&lt;/span&gt;
&lt;span class="c1"&gt;# Compare with "street":
&lt;/span&gt;&lt;span class="n"&gt;key_street&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Offers: not-subject, noun, non-living thing
&lt;/span&gt;&lt;span class="n"&gt;relevance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="err"&gt;×&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.08&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.12&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.18&lt;/span&gt;
          &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.38&lt;/span&gt;  &lt;span class="c1"&gt;# Lower match
&lt;/span&gt;
&lt;span class="c1"&gt;# Convert to percentages (this is what "softmax" does):
# "animal" gets 45%, "street" gets 8%, etc.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where Does The Formula Come From?
&lt;/h3&gt;

&lt;p&gt;You might see this formula in papers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Attention(Q, K, V) = softmax(Q × K^T / √d_k) × V
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What it means in plain English:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Q × K^T&lt;/strong&gt;: Match each word's Query against all other words' Keys (like our multiplication above)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;/ √d_k&lt;/strong&gt;: Scale down the numbers (prevents them from getting too big)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;softmax&lt;/strong&gt;: Convert to percentages that add up to 100%&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;× V&lt;/strong&gt;: Gather information from relevant words based on those percentages&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Where it comes from&lt;/strong&gt;: Researchers from Google Brain discovered in 2017 that this mathematical formula effectively models how words should pay attention to each other. It's inspired by information retrieval (like how search engines find relevant documents).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You don't need to memorize this!&lt;/strong&gt; Just remember: attention = figuring out which words are related and gathering information from them.&lt;/p&gt;

&lt;h3&gt;
  
  
  Complete Example Walkthrough
&lt;/h3&gt;

&lt;p&gt;Let's see attention in action with actual numbers:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sentence:&lt;/strong&gt; "The animal didn't cross the street because it was tired"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When processing "it", the attention mechanism calculates:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Word         Relevance Score    What This Means
─────────────────────────────────────────────────────────
"The"        →  2%              Article, not important
"animal"     → 45%              Main subject! Likely referent
"didn't"     →  3%              Verb helper, not the focus
"cross"      →  5%              Action, minor relevance
"the"        →  2%              Article again
"street"     →  8%              Object/location, somewhat relevant
"because"    →  2%              Connector word
"it"         → 10%              Self-reference (checking own meaning)
"was"        →  8%              Linking verb, somewhat relevant  
"tired"      → 15%              State description, quite relevant
                ─────
Total        = 100%              (Scores sum to 100%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; The model now knows "it" primarily refers to "animal" (45%), with some connection to being "tired" (15%). This understanding gets encoded into the updated representation of "it".&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How does this actually update "it"?&lt;/strong&gt; The model takes a weighted average of all words' Value vectors using these percentages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Each word has a Value vector (what information it contains)
&lt;/span&gt;&lt;span class="n"&gt;value_animal&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Contains: mammal, four-legged, animate
&lt;/span&gt;&lt;span class="n"&gt;value_tired&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;   &lt;span class="c1"&gt;# Contains: state, adjective, fatigue
&lt;/span&gt;&lt;span class="n"&gt;value_street&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Contains: place, concrete, inanimate
# ... (other words)
&lt;/span&gt;
&lt;span class="c1"&gt;# Updated representation of "it" = weighted combination
&lt;/span&gt;&lt;span class="n"&gt;new_it&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;value_animal&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;15&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;value_tired&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="n"&gt;value_street&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.45&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.15&lt;/span&gt; &lt;span class="err"&gt;×&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;
       &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.52&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.19&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.61&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# Now "it" carries meaning from "animal" + "tired"
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The word "it" now has a richer representation that includes information from "animal" (heavily weighted) and "tired" (moderately weighted), helping the model understand the sentence better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why "Multi-Head" Attention?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Simple analogy:&lt;/strong&gt; When you read a sentence, you notice multiple things simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Grammar relationships (subject → verb)&lt;/li&gt;
&lt;li&gt;Meaning relationships (dog → animal)&lt;/li&gt;
&lt;li&gt;Reference relationships (it → what does "it" mean?)&lt;/li&gt;
&lt;li&gt;Position relationships (which words are nearby?)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Multi-head attention lets the model do the same thing! Instead of one attention mechanism, models use 8 to 128 different attention "heads" running in parallel.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example with the sentence "The fluffy dog chased the cat":&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Head 1&lt;/strong&gt; might focus on: "dog" ↔ "chased" (subject-verb)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head 2&lt;/strong&gt; might focus on: "fluffy" ↔ "dog" (adjective-noun)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head 3&lt;/strong&gt; might focus on: "chased" ↔ "cat" (verb-object)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head 4&lt;/strong&gt; might focus on: nearby words (local context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Head 5&lt;/strong&gt; might focus on: animate things (dog, cat)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; These specializations aren't programmed! During training, different heads naturally learn to focus on different relationships. Researchers discovered this by analyzing trained models—it emerges automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How they combine:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Each head produces its own understanding:
&lt;/span&gt;&lt;span class="n"&gt;head_1_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;attention_head_1&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Finds subject-verb
&lt;/span&gt;&lt;span class="n"&gt;head_2_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;attention_head_2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Finds adjective-noun
&lt;/span&gt;&lt;span class="n"&gt;head_8_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;attention_head_8&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Finds other patterns
&lt;/span&gt;
&lt;span class="c1"&gt;# Combine all heads into a rich understanding:
&lt;/span&gt;&lt;span class="n"&gt;final_output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;combine&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;head_1_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;head_2_output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...,&lt;/span&gt; &lt;span class="n"&gt;head_8_output&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Now each word has information from all types of relationships!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Having multiple attention heads is like having multiple experts analyze the same text from different angles. The final result is much richer than any single perspective.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Processing the Information (Feed-Forward Network)
&lt;/h3&gt;

&lt;p&gt;After attention gathers information, each word needs to process what it learned. This is where the &lt;strong&gt;Feed-Forward Network (FFN)&lt;/strong&gt; comes in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple analogy:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attention&lt;/strong&gt; = Gathering ingredients from your kitchen&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FFN&lt;/strong&gt; = Actually cooking with those ingredients&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After "it" gathered information that it refers to "animal" and relates to "tired", the FFN processes this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified version
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_word&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word_vector&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 1: Expand to more dimensions (gives more room to think)
&lt;/span&gt;    &lt;span class="n"&gt;bigger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word_vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 768 numbers → 3072 numbers
&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 2: Apply complex transformations (the "thinking")
&lt;/span&gt;    &lt;span class="n"&gt;processed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;activate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bigger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# Non-linear processing
&lt;/span&gt;
    &lt;span class="c1"&gt;# Step 3: Compress back to original size
&lt;/span&gt;    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;compress&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;processed&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;      &lt;span class="c1"&gt;# 3072 numbers → 768 numbers
&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What's it doing?&lt;/strong&gt; Let's trace through a concrete example using our sentence:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: Processing "it" in "The animal didn't cross the street because it was tired"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After attention, "it" has gathered information showing it refers to "animal" (45%) and relates to "tired" (15%). Now the FFN enriches this understanding:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 - What comes in:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vector for "it" after attention: [0.52, 0.19, 0.61, ...]
This already knows: "it" refers to "animal" and connects to "tired"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 - FFN adds learned knowledge:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of the FFN as having millions of pattern detectors (neurons) that learned from billions of text examples. When "it" enters with its current meaning, specific patterns activate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input pattern: word "it" + animal reference + tired state

FFN recognizes patterns:
- Pattern A activates: "Pronoun referring to living creature" → Strengthens living thing understanding
- Pattern B activates: "Subject experiencing fatigue" → Adds physical/emotional state concept  
- Pattern C activates: "Reason for inaction" → Links tiredness to not crossing
- Pattern D stays quiet: "Object being acted upon" → Not relevant here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What the FFN is really doing: It's checking "it" against thousands of patterns it learned during training, like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"When a pronoun refers to an animal + there's a state like 'tired', the pronoun is the one experiencing that state"&lt;/li&gt;
&lt;li&gt;"Tiredness causes inaction" (learned from millions of examples)&lt;/li&gt;
&lt;li&gt;"Animals get tired, streets don't" (learned semantic knowledge)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Step 3 - What comes out:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Enriched vector: [0.61, 0.23, 0.71, ...]
Now contains: pronoun role + animal reference + tired state + causal link (tired → didn't cross)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; The model now has a richer understanding: "it" isn't just referring to "animal"—it understands the animal is tired, and this tiredness is causally linked to why it didn't cross the street.&lt;/p&gt;

&lt;p&gt;Here's another example showing how FFN removes uncertainty of word meanings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example - "bank":&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input sentence: "I sat on the river bank"&lt;/li&gt;
&lt;li&gt;After attention: "bank" knows it's near "river" and "sat"&lt;/li&gt;
&lt;li&gt;FFN adds: bank → shoreline → natural feature → place to sit&lt;/li&gt;
&lt;li&gt;Output: Model understands it's a river bank (not a financial institution!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Think of FFN as the model's "knowledge base"&lt;/strong&gt; where millions of facts and &lt;br&gt;
patterns are stored in billions of network weights (the connections between neurons). &lt;br&gt;
Unlike attention (which gathers context from other words), FFN applies learned &lt;br&gt;
knowledge to that context.&lt;/p&gt;

&lt;p&gt;It's the difference between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attention: "What words are nearby?" → Finds "river" and "sat"&lt;/li&gt;
&lt;li&gt;FFN: "What does 'bank' mean here?" → Applies knowledge: must be shoreline, not finance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attention = figures out which words are related&lt;/li&gt;
&lt;li&gt;FFN = applies knowledge and reasoning to those relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Modern improvement:&lt;/strong&gt; Newer models use something called "SwiGLU" instead of older activation functions. It provides better performance, but the core idea remains: process the gathered information to extract deeper meaning.&lt;/p&gt;


&lt;h2&gt;
  
  
  Step 5: Two Important Tricks (Residual Connections &amp;amp; Normalization)
&lt;/h2&gt;

&lt;p&gt;These might sound technical, but they solve simple problems. Let me explain with everyday analogies.&lt;/p&gt;
&lt;h3&gt;
  
  
  Residual Connections: The "Don't Forget Where You Started" Trick
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Imagine you're editing a document. You make 96 rounds of edits. By round 96, you've completely forgotten what the original said! Sometimes the original information was important.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt; Keep a copy of the original and mix it back in after each edit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In the Transformer:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Start with a word's representation
&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;  &lt;span class="c1"&gt;# "cat" representation
&lt;/span&gt;
&lt;span class="c1"&gt;# After attention + processing, we get changes
&lt;/span&gt;&lt;span class="n"&gt;changes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;  &lt;span class="c1"&gt;# What we learned
&lt;/span&gt;
&lt;span class="c1"&gt;# Residual connection: Keep the original + add changes
&lt;/span&gt;&lt;span class="n"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;changes&lt;/span&gt;
      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="o"&gt;+&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;
      &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;1.1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...]&lt;/span&gt;  &lt;span class="c1"&gt;# Original info preserved!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Better analogy:&lt;/strong&gt; Think of editing a photo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Without residual&lt;/strong&gt;: Each filter completely replaces the image (after 50 filters, original is lost)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;With residual&lt;/strong&gt;: Each filter adds to the image (original always visible + 50 layers of enhancements)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Deep networks (96-120 layers) need this. Otherwise, information from early layers disappears by the time you reach the end.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer Normalization: The "Keep Numbers Reasonable" Trick
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Problem:&lt;/strong&gt; Imagine you're calculating daily expenses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Day 1: ₹500&lt;/li&gt;
&lt;li&gt;Day 2: ₹450&lt;/li&gt;
&lt;li&gt;Day 3: ₹520&lt;/li&gt;
&lt;li&gt;Then suddenly Day 4: ₹50,00,00,000 (a bug in your calculator!)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The huge number breaks everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution:&lt;/strong&gt; After each step, check if numbers are getting too big or too small, and adjust them to a reasonable range.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What normalization does:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before normalization:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Word vectors might be:
"the":  [0.1, 0.2, 0.3, ...]
"cat":  [5.2, 8.9, 12.3, ...]      ← Too big!
"sat":  [0.001, 0.002, 0.001, ...] ← Too small!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After normalization:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"the":  [0.1, 0.2, 0.3, ...]
"cat":  [0.4, 0.6, 0.8, ...]      ← Scaled down to reasonable range
"sat":  [0.2, 0.4, 0.1, ...]      ← Scaled up to reasonable range
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;How it works (simplified):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# For each word's vector:
# 1. Calculate average and spread of numbers
&lt;/span&gt;&lt;span class="n"&gt;average&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;
&lt;span class="n"&gt;spread&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;

&lt;span class="c1"&gt;# 2. Adjust so average=0, spread=1
&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;average&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;spread&lt;/span&gt;

&lt;span class="c1"&gt;# Now all numbers are in a similar range!
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prevents numbers from exploding or vanishing&lt;/li&gt;
&lt;li&gt;Makes training faster and more stable&lt;/li&gt;
&lt;li&gt;Like cruise control for your model's internal numbers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Key takeaway:&lt;/strong&gt; These two tricks (residual connections + normalization) are like safety features in a car—they keep everything running smoothly even when the model gets very deep (many layers).&lt;/p&gt;




&lt;h2&gt;
  
  
  Three Types of Transformer Models
&lt;/h2&gt;

&lt;p&gt;Transformers come in three varieties, like three different tools in a toolbox. Each is designed for specific jobs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Type 1: Encoder-Only (BERT-style) - The "Understanding" Expert
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Think of it like:&lt;/strong&gt; A reading comprehension expert who thoroughly understands text but can't write new text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Sees the entire text at once, looks at relationships in all directions (words can look both forward and backward).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Show it: "The [MASK] sat on the mat"
It learns: "The cat sat on the mat"

By filling in blanks, it learns deep understanding!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-world uses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Email spam detection&lt;/strong&gt;: "Is this email spam or legitimate?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Needs: Deep understanding of the entire email&lt;/li&gt;
&lt;li&gt;Example: Gmail's spam filter&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Search engines&lt;/strong&gt;: "Find documents similar to this query"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Needs: Understanding what documents mean&lt;/li&gt;
&lt;li&gt;Example: Google Search understanding your query&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Sentiment analysis&lt;/strong&gt;: "Is this review positive or negative?"&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Needs: Understanding the overall tone&lt;/li&gt;
&lt;li&gt;Example: Analyzing customer feedback&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Popular models:&lt;/strong&gt; BERT, RoBERTa (used by many search engines)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key limitation:&lt;/strong&gt; Can understand and classify text, but &lt;strong&gt;cannot generate&lt;/strong&gt; new text. It's like a reading expert who can't write.&lt;/p&gt;

&lt;h3&gt;
  
  
  Type 2: Decoder-Only (GPT-style) - The "Writing" Expert
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Think of it like:&lt;/strong&gt; A creative writer who generates text one word at a time, always building on what came before.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; Processes text from left to right. Each word can only "see" previous words, not future ones (because future words don't exist yet during generation!).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Training example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Show it: "The cat sat on the"
It learns: Next word should be "mat" (or "floor", "chair", etc.)

By predicting next words billions of times, it learns to write!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why only look backward?&lt;/strong&gt; Because when generating text, future words don't exist yet—you can only use what you've written so far. It's like writing a story one word at a time: after "The cat sat on the", you can only look back at those 5 words to decide what comes next.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;When predicting "sat":
  Can see: "The", "cat"  ← Use these to predict
  Cannot see: "on", "the", "mat"  ← Don't exist yet during generation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-world uses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;ChatGPT / Claude&lt;/strong&gt;: Conversational AI assistants&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Generate helpful responses to questions&lt;/li&gt;
&lt;li&gt;Example: "Explain quantum physics simply" → generates explanation&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Code completion&lt;/strong&gt;: GitHub Copilot&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Complete your code as you type&lt;/li&gt;
&lt;li&gt;Example: You type &lt;code&gt;def calculate_&lt;/code&gt; → it suggests the rest&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Content creation&lt;/strong&gt;: Blog posts, emails, stories&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Generate coherent, creative text&lt;/li&gt;
&lt;li&gt;Example: "Write a product description for..." → generates description&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Popular models:&lt;/strong&gt; GPT-4, Claude, Llama, Mistral (basically all modern chatbots)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this is dominant:&lt;/strong&gt; These models can both understand AND generate, making them incredibly versatile. This is what you use when you chat with AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Type 3: Encoder-Decoder (T5-style) - The "Translator" Expert
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Think of it like:&lt;/strong&gt; A two-person team: one person reads and understands (encoder), another person writes the output (decoder).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it works:&lt;/strong&gt; &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Encoder&lt;/strong&gt; (the reader): Thoroughly understands the input, looking in all directions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decoder&lt;/strong&gt; (the writer): Generates output one word at a time, consulting the encoder's understanding&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Training example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input (to encoder):  "translate English to French: Hello world"
Output (from decoder): "Bonjour le monde"

Encoder understands English, Decoder writes French!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Real-world uses:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Translation&lt;/strong&gt;: Google Translate&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Convert text from one language to another&lt;/li&gt;
&lt;li&gt;Example: English → Spanish, preserving meaning&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Summarization&lt;/strong&gt;: News article summaries&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Read long document (encoder), write short summary (decoder)&lt;/li&gt;
&lt;li&gt;Example: 10-page report → 3-sentence summary&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Question answering&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task: Read document (encoder), generate answer (decoder)&lt;/li&gt;
&lt;li&gt;Example: "Based on this article, what caused...?" → generates answer&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Popular models:&lt;/strong&gt; T5, BART (less common nowadays)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why less popular now:&lt;/strong&gt; Decoder-only models (like GPT) turned out to be more versatile—they can do translation AND chatting AND coding, all in one architecture. Encoder-decoder models are more specialized.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick Decision Guide: Which Type Should You Use?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Need to understand/classify text?&lt;/strong&gt; → Encoder (BERT)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Spam detection&lt;/li&gt;
&lt;li&gt;Sentiment analysis
&lt;/li&gt;
&lt;li&gt;Search/similarity&lt;/li&gt;
&lt;li&gt;Document classification&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Need to generate text?&lt;/strong&gt; → Decoder (GPT)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Chatbots (ChatGPT, Claude)&lt;/li&gt;
&lt;li&gt;Code completion&lt;/li&gt;
&lt;li&gt;Creative writing&lt;/li&gt;
&lt;li&gt;Question answering&lt;/li&gt;
&lt;li&gt;Content generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Need translation/summarization only?&lt;/strong&gt; → Encoder-Decoder (T5)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Language translation&lt;/li&gt;
&lt;li&gt;Document summarization&lt;/li&gt;
&lt;li&gt;Specific input→output transformations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not sure?&lt;/strong&gt; → Use Decoder-only (GPT-style)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Most versatile&lt;/li&gt;
&lt;li&gt;Can handle both understanding and generation&lt;/li&gt;
&lt;li&gt;This is what most modern AI tools use&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Bottom line:&lt;/strong&gt; If you're building something today, you'll most likely use a decoder-only model (like GPT, Claude, Llama) because they're the most flexible and powerful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Scaling the Architecture
&lt;/h2&gt;

&lt;p&gt;Now that you understand the components, let us see how they scale:&lt;/p&gt;

&lt;h3&gt;
  
  
  What Gets Bigger?
&lt;/h3&gt;

&lt;p&gt;As models grow from small to large, here's what changes:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Small (125M params)&lt;/th&gt;
&lt;th&gt;Medium (7B params)&lt;/th&gt;
&lt;th&gt;Large (70B params)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Layers&lt;/strong&gt; (depth)&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Hidden size&lt;/strong&gt; (vector width)&lt;/td&gt;
&lt;td&gt;768&lt;/td&gt;
&lt;td&gt;4,096&lt;/td&gt;
&lt;td&gt;8,192&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Attention heads&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key insights:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Layers (depth)&lt;/strong&gt; - This is how many times you repeat Steps 3 &amp;amp; 4&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Each layer = one pass of Attention (Step 3) + FFN (Step 4)&lt;/li&gt;
&lt;li&gt;Small model with 12 layers = processes the sentence 12 times&lt;/li&gt;
&lt;li&gt;Large model with 80 layers = processes the sentence 80 times&lt;/li&gt;
&lt;li&gt;Think of it like editing a document: more passes = more refinement and deeper understanding&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Example: Processing "it" in our sentence:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Layer 1: Figures out "it" refers to "animal"&lt;/li&gt;
&lt;li&gt;Layer 5: Understands the tiredness connection&lt;/li&gt;
&lt;li&gt;Layer 15: Grasps the causal relationship (tired → didn't cross)&lt;/li&gt;
&lt;li&gt;Layer 30: Picks up subtle implications (the animal wanted to cross but couldn't)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Hidden size (vector width)&lt;/strong&gt; - How many numbers represent each word&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bigger vectors = more "memory slots" to store information&lt;/li&gt;
&lt;li&gt;768 dimensions vs 8,192 dimensions = like having 768 notes vs 8,192 notes about each word&lt;/li&gt;
&lt;li&gt;Larger hidden size lets the model capture more nuanced meanings and relationships&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Attention heads&lt;/strong&gt; - How many different perspectives each layer examines&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;12 heads = looking at the sentence in 12 different ways simultaneously&lt;/li&gt;
&lt;li&gt;64 heads = 64 different ways (grammar, meaning, references, dependencies, etc.)&lt;/li&gt;
&lt;li&gt;More heads = catching more types of word relationships in parallel&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Where do the parameters live?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Surprising fact: The Feed-Forward Network (FFN) actually takes up most of the model's parameters, not the attention mechanism! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; In each layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Attention parameters: relatively small (mostly for Q, K, V transformations)&lt;/li&gt;
&lt;li&gt;FFN parameters: huge (expands 4,096 dimensions to 16,384 then back, with millions of learned patterns)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In large models, FFN parameters outnumber attention parameters by 3-4x. That's where the "knowledge" is stored!&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Self-Attention is Expensive: The O(N²) Problem
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Simple explanation:&lt;/strong&gt; Every word needs to look at every other word. If you have N words, that's N × N comparisons.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Concrete example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;3 words: "The cat sat"
- "The" looks at: The, cat, sat (3 comparisons)
- "cat" looks at: The, cat, sat (3 comparisons)
- "sat" looks at: The, cat, sat (3 comparisons)
Total: 3 × 3 = 9 comparisons

6 words: "The cat sat on the mat"
- Each of 6 words looks at all 6 words
Total: 6 × 6 = 36 comparisons (4x more for 2x words!)

12 words:
Total: 12 × 12 = 144 comparisons (16x more for 4x words!)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The scaling problem:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Sentence Length&lt;/th&gt;
&lt;th&gt;Attention Calculations&lt;/th&gt;
&lt;th&gt;Growth Factor&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;512 tokens&lt;/td&gt;
&lt;td&gt;262,144&lt;/td&gt;
&lt;td&gt;1x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2,048 tokens&lt;/td&gt;
&lt;td&gt;4,194,304&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;16x&lt;/strong&gt; more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8,192 tokens&lt;/td&gt;
&lt;td&gt;67,108,864&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;256x&lt;/strong&gt; more&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt; Doubling the length doesn't double the work—it &lt;strong&gt;quadruples&lt;/strong&gt; it! This is why:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long documents are expensive to process&lt;/li&gt;
&lt;li&gt;Context windows have hard limits (memory/compute)&lt;/li&gt;
&lt;li&gt;New techniques are needed for longer contexts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Solutions being developed:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flash Attention&lt;/strong&gt;: Clever memory tricks to compute attention faster&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sliding window attention&lt;/strong&gt;: Each word only looks at nearby words (not all words)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sparse attention&lt;/strong&gt;: Skip some comparisons that matter less&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These tricks help models handle longer texts without the exponential cost!&lt;/p&gt;




&lt;h2&gt;
  
  
  Understanding the Complete Architecture Diagram
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; This diagram represents the universal Transformer architecture. All Transformer models (BERT, GPT, T5) follow this basic structure, with variations in how they use certain components.&lt;/p&gt;

&lt;p&gt;Let's walk through the complete flow step by step:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgcgohq9qw9kiyb04tyu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkgcgohq9qw9kiyb04tyu.png" alt="Understanding the Complete Architecture Diagram of Transformer in LLM like ChatGPT, Claude" width="800" height="1137"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Detailed Walkthrough with Example
&lt;/h3&gt;

&lt;p&gt;Let's trace "The cat sat" through this architecture:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Input Tokens&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Your text: "The cat sat"
Tokens: ["The", "cat", "sat"]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Embeddings + Position&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"The" → [0.1, 0.3, ...] + position_1_tag → [0.1, 0.8, ...]
"cat" → [0.2, -0.5, ...] + position_2_tag → [0.4, -0.2, ...]
"sat" → [0.4, 0.2, ...] + position_3_tag → [0.8, 0.5, ...]

Now each word is a 768-number vector with position info!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Through N Transformer Layers&lt;/strong&gt; (repeated 12-120 times)&lt;/p&gt;

&lt;p&gt;Each layer does this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4a: Multi-Head Attention&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Each word looks at all other words
- "cat" realizes it's the subject
- "sat" realizes it's the action "cat" does
- Words gather information from related words
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4b: Add &amp;amp; Normalize&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Add original vector back (residual connection)
- Normalize numbers to reasonable range
- Keeps information stable
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4c: Feed-Forward Network&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Process the gathered information
- Apply learned knowledge
- Each word's vector gets richer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4d: Add &amp;amp; Normalize (again)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;- Add vector from before FFN (another residual)
- Normalize again
- Ready for next layer!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After going through all N layers, each word's representation is incredibly rich with understanding.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Linear + Softmax&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Take the final word's vector: [0.8, 0.3, 0.9, ...]

Convert to predictions for EVERY word in vocabulary (50,000 words):
"the"    → 5%
"a"      → 3%
"on"     → 15%  ← High probability!
"mat"    → 12%
"floor"  → 8%
...
(All probabilities sum to 100%)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 6: Output&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pick the most likely word: "on"

Complete sentence so far: "The cat sat on"

Then repeat the whole process to predict the next word!
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How the Three Model Types Use This Architecture
&lt;/h3&gt;

&lt;p&gt;Now that you've seen the complete flow, here's how each model type uses it differently:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Encoder-Only (BERT):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses: Steps 1-4 (everything except the final output prediction)&lt;/li&gt;
&lt;li&gt;Attention: &lt;strong&gt;Bidirectional&lt;/strong&gt; - each word sees ALL other words (past AND future)&lt;/li&gt;
&lt;li&gt;Training: Fill-in-the-blank ("The [MASK] sat" → predict "cat")&lt;/li&gt;
&lt;li&gt;Purpose: Rich understanding for classification, search, sentiment analysis&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Decoder-Only (GPT, Claude, Llama):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses: All steps 1-6 (the complete flow we just walked through)&lt;/li&gt;
&lt;li&gt;Attention: &lt;strong&gt;Causal/Unidirectional&lt;/strong&gt; - each word only sees PAST words&lt;/li&gt;
&lt;li&gt;Training: Next-word prediction ("The cat sat" → predict "on")&lt;/li&gt;
&lt;li&gt;Purpose: Text generation, chatbots, code completion&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. Encoder-Decoder (T5):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Uses: TWO stacks - one encoder (steps 1-4), one decoder (full steps 1-6)&lt;/li&gt;
&lt;li&gt;Encoder: Bidirectional attention to understand input&lt;/li&gt;
&lt;li&gt;Decoder: Causal attention to generate output, also attends to encoder&lt;/li&gt;
&lt;li&gt;Training: Input→output mapping ("translate: Hello" → "Bonjour")&lt;/li&gt;
&lt;li&gt;Purpose: Translation, summarization, transformation tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9s3kkl0rv41mu5ul7je.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb9s3kkl0rv41mu5ul7je.png" alt="Three Tranformer Architecture types" width="800" height="663"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key difference:&lt;/strong&gt; Same architecture blocks, different attention patterns and how they're connected!&lt;/p&gt;

&lt;h3&gt;
  
  
  Additional Key Insights
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;It's a loop&lt;/strong&gt;: For generation, this process repeats. After predicting "on", the model adds it to the input and predicts again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "N" matters&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small models: N = 12 layers&lt;/li&gt;
&lt;li&gt;GPT-3: N = 96 layers&lt;/li&gt;
&lt;li&gt;GPT-4: N = 120+ layers&lt;/li&gt;
&lt;li&gt;More layers = deeper understanding but slower/more expensive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;This is universal&lt;/strong&gt;: Whether you're reading a research paper about a new model or trying to understand GPT-4, this diagram applies. The core architecture is the same!&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Implications
&lt;/h2&gt;

&lt;p&gt;Understanding the architecture helps you make better decisions:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Context Window Limitations
&lt;/h3&gt;

&lt;p&gt;The context window is not just a number—it is a hard architectural limit. A model trained on 4K context cannot magically understand 100K tokens without modifications (RoPE interpolation, fine-tuning, etc.).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Why Position Matters
&lt;/h3&gt;

&lt;p&gt;Tokens at the beginning and end of context often get more attention (primacy and recency effects). If you have critical information, consider its placement in your prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Layer-wise Understanding
&lt;/h3&gt;

&lt;p&gt;Early layers capture syntax and basic patterns. Later layers capture semantics and complex reasoning. This is why techniques like layer freezing during fine-tuning work—early layers transfer well across tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Attention is Expensive
&lt;/h3&gt;

&lt;p&gt;Every extra token in your prompt increases compute quadratically. Be concise when you can.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Transformers process all tokens in parallel, using positional encoding to preserve order&lt;/li&gt;
&lt;li&gt;Self-attention lets each token gather information from all other tokens&lt;/li&gt;
&lt;li&gt;Multi-head attention captures different types of relationships simultaneously&lt;/li&gt;
&lt;li&gt;Residual connections and layer normalization enable training very deep networks&lt;/li&gt;
&lt;li&gt;Encoder-only models (BERT) excel at understanding; decoder-only (GPT) at generation&lt;/li&gt;
&lt;li&gt;Modern LLMs are decoder-only with causal masking&lt;/li&gt;
&lt;li&gt;Context window limitations come from O(n²) attention complexity&lt;/li&gt;
&lt;li&gt;Understanding architecture helps you write better prompts and choose appropriate models&lt;/li&gt;
&lt;/ul&gt;




</description>
      <category>architecture</category>
      <category>ai</category>
      <category>deeplearning</category>
      <category>llm</category>
    </item>
    <item>
      <title>Conventional commit specification</title>
      <dc:creator>Pranay Bathini </dc:creator>
      <pubDate>Fri, 08 Mar 2024 19:17:39 +0000</pubDate>
      <link>https://dev.to/pranaybathini/conventional-commit-specification-77p</link>
      <guid>https://dev.to/pranaybathini/conventional-commit-specification-77p</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3jkir2h0y69lhpd5v20.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg3jkir2h0y69lhpd5v20.png" alt="https://www.pranaybathini.com/2024/03/conventional-commit-specification.html" width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Conventional Commits specification is a lightweight convention that provides a standardized format for writing commit messages. It defines a simple set of rules to structure commit messages in a consistent manner, making it easier to understand the purpose and context of each commit. It is based on the above anatomies of git.&lt;/p&gt;

&lt;p&gt;By addressing numerous “why”, “what” and “how” questions, this specification can enable efficient understanding of past decisions and changes made in the codebase, even after months or years. They serve as an indirect wiki, offering valuable insights into the reasons behind code changes.&lt;/p&gt;

&lt;p&gt;According to this specification, a commit message should be structured as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;commit type&amp;gt;[optional scope]: &amp;lt;description&amp;gt;
[optional body]
[optional footer(s)]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Categorization of Commit types&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Build
&lt;/h2&gt;

&lt;p&gt;This type is used for changes that affect the build system or external dependencies. For example, if you upgrade a library or modify build configurations, it falls under this category.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;build: update webpack configuration
build: update Hive Maven dependency to version 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  2. Chore
&lt;/h2&gt;

&lt;p&gt;Chore-type commits are for regular maintenance tasks, such as updating dependencies, package manager configurations, or other tasks that don’t modify the source code or affect the behavior of the application&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chore: clean up unused dependencies from package.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. CI
&lt;/h2&gt;

&lt;p&gt;Commits related to continuous integration configurations or scripts. This includes changes to tools and processes used for automation and testing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ci: add vault stage to gitlab ci pipeline to fetch api secrets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. Docs
&lt;/h2&gt;

&lt;p&gt;Commits that only involve documentation changes, such as updating README files, adding comments, or writing documentation for functions or classes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;docs: add auth service instructions to README
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. Feat
&lt;/h2&gt;

&lt;p&gt;This type is for new features added to the project. If you introduce new functionality or capabilities, it’s considered a feature.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feat: introduce real-time notifications for new messages
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  6. Fix
&lt;/h2&gt;

&lt;p&gt;Fix-type commits are for patches that resolve bugs or issues in the codebase. If the commit addresses a problem or bug, it falls under this category.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fix: resolve issue with form submission not triggering validation
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  7. Performance
&lt;/h2&gt;

&lt;p&gt;Commits that improve the performance of the codebase without changing its external behavior. Optimization-related changes would be categorized as performance improvements.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;perf: optimize guest misconduct database search by reservation for faster retrieval
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  8. Refactor
&lt;/h2&gt;

&lt;p&gt;Refactoring commits involve modifications to the code that neither fixes a bug nor adds a feature. Refactoring aims to improve the code’s structure, readability, or maintainability without changing its external behavior.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;refactor: simplify validation logic in user registration
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  9. Revert
&lt;/h2&gt;

&lt;p&gt;Revert-type commits are used when reverting previous changes. It’s essential to mention the commit ID or title of the changes being reverted in the commit message.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;revert: revert changes made in commit abc123
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  10. Style
&lt;/h2&gt;

&lt;p&gt;Commits that are related to code style and formatting. This could include indentation changes, code reformatting, or renaming variables to follow coding conventions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;style: format code according to linting rules
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  11. Test
&lt;/h2&gt;

&lt;p&gt;Commits that add or modify tests. This includes unit tests, integration tests, or any other kind of automated testing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;test: add unit tests for authentication service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  12. BREAKING CHANGE
&lt;/h2&gt;

&lt;p&gt;This is not a commit type but a section in the description as per conventional commit specification. This introduces changes that are not backward compatible with the existing codebase or API. In other words, it signifies modifications that could potentially break existing functionality or require adjustments in dependent code or configurations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;feat: update API response format to include additional fields

This commit extends the API response format to include new fields for user profile information. 
The added fields include 'birthdate' and 'country', providing more comprehensive user data. This 
change enhances the functionality of the API and enables richer user experiences.

BREAKING CHANGE:
- The structure of the API response has been modified to include additional fields.
- Clients relying on the previous response format will need to be updated to accommodate the changes.
Another syntax for breaking change — https://www.conventionalcommits.org/en/v1.0.0/#commit-message-with--to-draw-attention-to-breaking-change
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Usage in open source
&lt;/h2&gt;

&lt;p&gt;One of the many open source examples using conventional commit specification is angular.&lt;/p&gt;

&lt;p&gt;Link — &lt;a href="https://github.com/angular/angular/blob/main/CONTRIBUTING.md"&gt;angular/CONTRIBUTING.md&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;type&amp;gt;(&amp;lt;scope&amp;gt;): &amp;lt;short summary&amp;gt;
│ │ │
│ │ └─⫸ Summary in present tense. Not capitalized. No period at the end.
│ │
│ └─⫸ Commit Scope: animations|bazel|benchpress|common|compiler|compiler-cli|core|
│ elements|forms|http|language-service|localize|platform-browser|
│ platform-browser-dynamic|platform-server|router|service-worker|
│ upgrade|zone.js|packaging|changelog|docs-infra|migrations|
│ devtools
│
└─⫸ Commit Type: build|ci|docs|feat|fix|perf|refactor|test
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What are the advantages?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Consistency&lt;/strong&gt;: Helps maintain consistency and standardization across commits within the team. When everyone follows the same format, it becomes easier to understand the history of changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Meaning&lt;/strong&gt;: By categorizing commits into types like feat, fix, docs, chore, etc., conventional commit messages provide semantic meaning to changes. This makes it easier to understand the nature of each commit at a glance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Clarity&lt;/strong&gt;: Encourages writing clear and descriptive commit messages, including information about the changes made, why they were made, and any associated issue or task identifiers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Communication&lt;/strong&gt;: Facilitates communication among team members by providing a structured way to convey information about code changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation&lt;/strong&gt;: Enables the use of automated tools that can parse commit messages to generate release notes, changelogs, or perform other tasks based on commit history.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What are the disadvantages?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Inconsistency&lt;/strong&gt;: Without guidelines, commit messages may vary widely in format and quality, making it harder to understand the history of changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ambiguity&lt;/strong&gt;: Lack of structure can lead to vague or incomplete commit messages, which may make it difficult for team members to understand the context or purpose of certain changes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Difficulty in Tracking&lt;/strong&gt;: Tracking changes becomes more challenging, especially when trying to understand why certain changes were made or when investigating issues.&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Miscommunication&lt;/strong&gt;: Poorly written commit messages can result in miscommunication or misunderstandings among team members.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Manual Processes&lt;/strong&gt;: Tasks such as generating release notes or tracking changes may require more manual effort without structured commit messages.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ about Conventional commit
&lt;/h2&gt;

&lt;p&gt;Recommended to go over this list — &lt;a href="https://www.conventionalcommits.org/en/v1.0.0/#faq"&gt;Conventional Commits&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Resources
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.conventionalcommits.org/en/v1.0.0/#specification"&gt;Conventional Commits Specification&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://commitizen-tools.github.io/commitizen/"&gt;Commitzen tool for the teams practices&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This blog was originally published on &lt;a href="https://www.pranaybathini.com/2024/03/conventional-commit-specification.html"&gt;https://www.pranaybathini.com/2024/03/conventional-commit-specification.html&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Creating Loot Royale NFT Collection DApp</title>
      <dc:creator>Pranay Bathini </dc:creator>
      <pubDate>Sat, 19 Feb 2022 18:48:56 +0000</pubDate>
      <link>https://dev.to/pranaybathini/creating-an-loot-royale-nft-collection-dapp-2ie</link>
      <guid>https://dev.to/pranaybathini/creating-an-loot-royale-nft-collection-dapp-2ie</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;In this series of blogs, I am going to show how to create an on chain and off chain NFT collection smart contract and create a DApp to mint NFTs.&lt;/p&gt;

&lt;p&gt;I am also going to show how to create your own DApp which I designed for fun last year and also to learn about NFTs and DApps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idea
&lt;/h2&gt;

&lt;p&gt;The idea to create an NFT DApp is inspired from the loot project but I also want to add my own flavors of creativity to it and understand the working of smart contracts in the process.&lt;/p&gt;

&lt;h2&gt;
  
  
  Learning
&lt;/h2&gt;

&lt;p&gt;Before this , I tried out different things like building my &lt;a href="https://www.thecryptoinsight.com/2021/07/how-to-create-your-own-cryptocurrency-on-ethereum.html"&gt;own token on Ethereum test network&lt;/a&gt; back in July, 2021 when I am new to cryptocurrency space and understanding little things like how &lt;a href="https://www.pranaybathini.com/2021/05/merkle-tree.html"&gt;merkle tree&lt;/a&gt; is used.&lt;/p&gt;

&lt;p&gt;It is this time when I learned about &lt;a href="https://remix.ethereum.org/"&gt;remix IDE&lt;/a&gt;, MetaMask wallet, different test networks of Ethereum blockchain, deploying smart contracts through remix IDE. I also learned little things like we can deploy the same contract which deployed on Ethereum blockchain on other blockchains &lt;a href="https://www.thecryptoinsight.com/2021/07/how-to-create-your-own-cryptocurrency-on-binance-smart-chain.html"&gt;like Binance Smart Chain&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;I have decided what to build but don’t know how. I began exploring some resources with little knowledge and found this  &lt;a href="https://cryptomarketpool.com/getting-started-with-solidity/"&gt;solidity tutorial&lt;/a&gt; from a reddit post.&lt;/p&gt;

&lt;p&gt;I have gone through the solidity basics to understand code from verified smart contracts and to develop new ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Smart Contract Development
&lt;/h2&gt;

&lt;p&gt;After getting a bit of knowledge on solidity, I began exploring the smart contract of loot project. It didn’t make much sense but after spending some time with it, I understood the working.&lt;/p&gt;

&lt;p&gt;I know ERC-721 standard is used to build NFTs and ERC-20 standard to build our own token on Ethereum block chain but don’t know what functions exactly we need to define in those standards. &lt;/p&gt;

&lt;p&gt;I referred &lt;a href="https://openzeppelin.com/"&gt;openzeppelin docs&lt;/a&gt; to understand more about the functions in the ERC-721. So, without any further journey lessons, I will jump into smart contract development.&lt;/p&gt;

&lt;p&gt;To develop our own NFT smart contract, we require&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://docs.openzeppelin.com/contracts/2.x/api/ownership#Ownable"&gt;Ownable&lt;/a&gt; Smart contract

&lt;ul&gt;
&lt;li&gt;It is used to manage the ownership of the contract. By default, the owner of a smart contract is the address from which it is deployed.&lt;/li&gt;
&lt;li&gt;It has functions that lets transfer ownership of the contract to other address using &lt;a href="https://docs.openzeppelin.com/contracts/2.x/api/ownership#Ownable-transferOwnership-address-"&gt;transferOwnership(newOwner)&lt;/a&gt; method.&lt;/li&gt;
&lt;li&gt;It lets you renounce your ownership of the contract with  &lt;a href="https://docs.openzeppelin.com/contracts/2.x/api/ownership#Ownable-renounceOwnership--"&gt;renounceOwnership()&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;It provides  &lt;a href="https://docs.openzeppelin.com/contracts/2.x/api/ownership#Ownable-onlyOwner--"&gt;onlyOwner()&lt;/a&gt; modifier to let some functions get executed by only owner like starting the sale, pausing the contract, Give away NFTs from the contract.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.openzeppelin.com/contracts/4.x/api/token/erc721#ERC721Enumerable"&gt;Enumerable721&lt;/a&gt; Smart contract

&lt;ul&gt;
&lt;li&gt;It extends &lt;a href="https://docs.openzeppelin.com/contracts/2.x/api/token/erc721"&gt;ERC721 standards&lt;/a&gt;, so we don’t need our contract to extend ERC721 contract explicitly.&lt;/li&gt;
&lt;li&gt;This provides enumerability of all the token ids in the contract as well as all token ids owned by each account.&lt;/li&gt;
&lt;li&gt;See the functions provided by it the he &lt;a href="https://docs.openzeppelin.com/contracts/4.x/api/token/erc721#ERC721Enumerable"&gt;documentation&lt;/a&gt;.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.openzeppelin.com/contracts/4.x/api/security#ReentrancyGuard"&gt;ReentrancyGuard&lt;/a&gt; Smart Contract

&lt;ul&gt;
&lt;li&gt;This module that helps prevent reentrant calls to a function meaning it makes &lt;a href="https://docs.openzeppelin.com/contracts/4.x/api/security#ReentrancyGuard-nonReentrant--"&gt;nonReentrant&lt;/a&gt; modifier available, which can be applied to functions to make sure there are no nested (reentrant) calls to them.&lt;/li&gt;
&lt;li&gt;Functions marked as &lt;code&gt;nonReentrant&lt;/code&gt; may not call one another.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://docs.openzeppelin.com/contracts/4.x/api/security#Pausable"&gt;Pausable&lt;/a&gt; Smart Contract

&lt;ul&gt;
&lt;li&gt;It is also advisable to extend this one which allows child contracts to implement an emergency stop mechanism that can be triggered by an authorized account.&lt;/li&gt;
&lt;li&gt;This used through inheritance and makes modifiers &lt;code&gt;whenNotPaused&lt;/code&gt;
 and &lt;code&gt;whenPaused&lt;/code&gt; available which can be applied to the functions of your contract.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can make our main smart contract extend these smart contracts directly by importing these in our code&lt;/p&gt;

&lt;p&gt;The code is found here for Loot Royale Onchain NFT Smart Contract - &lt;a href="https://gist.github.com/pranaybathini/52f4ff2da4c3518855264febe7d95739"&gt;Loot Royale On chain NFT Collection Smart contract code &lt;/a&gt;. Please open this in new tab to follow along.&lt;/p&gt;

&lt;p&gt;Let us discuss about the functions inside the main smart contract - &lt;strong&gt;BattleRoyale.sol&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Three functions are important. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Function to let the users mint NFT Tokens by paying required amount of price for the NFTs they are minting.
&lt;/li&gt;
&lt;li&gt;Function to get the token URI of the NFT. This takes in token Id and gives a JSON with the NFT. Here in our case, it is an SVG image generated from the code - On Chain. No external calls required. &lt;/li&gt;
&lt;li&gt;Function to get the amount from the contract address to your own address or any address you want to withdraw. I wrote the function to withdraw to owner address of the smart contract.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  NFT Design for on chain smart contract
&lt;/h2&gt;

&lt;p&gt;Here is one of my NFTs look like. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--HMvRlqmN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xcmk8jhbky04eznvxw1o.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--HMvRlqmN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xcmk8jhbky04eznvxw1o.jpg" alt="Loot Royale NFT" width="492" height="543"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A simple HTML CSS Design inside svg tag with curved border and it also displays the token ID at the top. It took some time for me to realize that we can include html CSS inside svg tag.  Initially, I took the HTML CSS code and used &lt;a href="https://www.hiqpdf.com/demo/ConvertHtmlToSvg.aspx"&gt;this site&lt;/a&gt; to convert to SVG but the output svg size is very heavy.&lt;/p&gt;

&lt;p&gt;Another problem I faced while storing this data for design was that the max size of a smart contract cannot be more than &lt;strong&gt;24576&lt;/strong&gt; bytes. So, I moved the code to another smart contract then to library, just so to reuse it in another contract. This is just one line but took some time since I tried optimizing it first before realizing I can do move this to another contract. Learning++.&lt;/p&gt;

&lt;p&gt;Here is the link to &lt;a href="https://gist.github.com/pranaybathini/d3875355c93e726899c5104cb388cb66?short_path=a38f07c"&gt;SVG Code for the above NFT.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also, you can see my data in the code which is different from loot project. You should have guessed it by now where did I get my data from. Now lets see what are on chain vs off chain NFTs.&lt;/p&gt;

&lt;h2&gt;
  
  
  On chain vs Off chain NFTs
&lt;/h2&gt;

&lt;p&gt;The difference between on chain and off chain NFTs as the name suggest we will be storing all the data related to NFTs on the blockchain itself for on chain NFTs.&lt;/p&gt;

&lt;p&gt;In case of off chain, I need to provide some external URLs like &lt;a href="https://somedomain.com"&gt;https://&lt;/a&gt;somedomain.com , ipfs://sha256hash/1.png.&lt;/p&gt;

&lt;p&gt;In both cases, I need  to return the JSON output file following &lt;a href="https://docs.metaplex.com/token-metadata/Versions/v1.0.0/nft-standard"&gt;NFT metadata standard.&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A sample IPFS image URL looks like this -  ipfs://QmV3yGkzx2Uw3NHPZV9SAMLA58j7LvCFLFYtyfCMBAvstF/2.jpg&lt;/p&gt;

&lt;p&gt;You can open this in brave browser directly - a pin I have drawn on art flow mobile app long time back. &lt;/p&gt;

&lt;p&gt;If we were to create off chain NFTs, how would the code look like? &lt;/p&gt;

&lt;p&gt;Refer this github gist link for offchain smart contract.&lt;/p&gt;

&lt;p&gt;I will set the IPFS or HTTPS BASE URI  in the contract when the sale starts or now a days, the developers are revealing the NFTs after all NFTs are minted.&lt;/p&gt;

&lt;p&gt;It is simple to write a function to set URI variable and call it once all NFTs are minted. &lt;/p&gt;

&lt;p&gt;There are two variables&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Base URI - actual URI for NFTs&lt;/li&gt;
&lt;li&gt;Blind URI - Till we reveal the NFTs, this image or video will be same for all the minted NFTs. &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The URI we should set will be like ipfs://QmV3yGkzx2Uw3NHPZV9SAMLA58j7LvCFLFYtyfCMBAvstF/&lt;/p&gt;

&lt;p&gt;When we query, our token Id gets appended to it and Open Sea or any other market place can find and display it.&lt;/p&gt;

&lt;p&gt;Starting steps are to set the contract active, then set the blind URI. Once all NFTs are minted, you can reveal the NFTs minted.&lt;/p&gt;

&lt;p&gt;The tokenURI method output will look like&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{
    "image": "ipfs://QmV3yGkzx2Uw3NHPZV9SAMLA58j7LvCFLFYtyfCMBAvstF/1.png",
    "name": "Loot Royale 1",
    "description" : "Some description"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Compiling and Deploying to blockchain
&lt;/h2&gt;

&lt;p&gt;Now we have our contracts, how do we deploy them to the blockchain network. Many blockchains forked Ethereum with added changes to bring DApp development features. You can deploy to any network that supports solidity.  &lt;/p&gt;

&lt;p&gt;Let us deploy this on chain NFT smart contract to Ethereum’s Ropsten test network. But before that, we need&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Remix IDE - You can use it online. Simply click &lt;a href="https://remix.ethereum.org/"&gt;https://remix.ethereum.org/&lt;/a&gt;. In case you want to run it offline, you can refer &lt;a href="https://www.thecryptoinsight.com/2021/07/how-to-create-your-own-cryptocurrency-on-ethereum.html"&gt;this blog on how to download it on docker and run it.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Metamask wallet  - If you don’t have metamask wallet, follow &lt;a href="https://www.thecryptoinsight.com/2021/06/how-to-create-an-ethereum-wallet-using-metamask.html"&gt;this blog to install metamask wallet in the browser of your choice.&lt;/a&gt;  If you are using brave, it is inbuilt within the browser. You can access it from settings.&lt;/li&gt;
&lt;li&gt;Get some Ropsten testnet ether/ Rinkeby testnet ether/ Polygon testnet Matic to deploy the contract on the ropsten network. Change the network to ropsten on metamask.

&lt;ol&gt;
&lt;li&gt;
&lt;a href="https://faucet.ropsten.be/"&gt;https://faucet.ropsten.be/&lt;/a&gt;  - Ropsten Testnet Faucet&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://faucet.polygon.technology/"&gt;https://faucet.polygon.technology/&lt;/a&gt; - Polygon testnet faucet&lt;/li&gt;
&lt;/ol&gt;


&lt;/li&gt;
&lt;li&gt;If you deploy on Rinkeby testnet or polygon testnet you will be able to see your NFTs on opensea testnetwork. Next steps are same for all the networks. &lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Next steps
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Copy the contract to your remix IDE. It should look like below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7jMbuXmM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xam1dwd6mijis55y9k54.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7jMbuXmM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xam1dwd6mijis55y9k54.jpg" alt="Remix IDE" width="880" height="451"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You can select the compiler version to any version above 8.0 But remember this version which you used to compile. This is needed while verifying contract on the block explorer.&lt;/li&gt;
&lt;li&gt;In case you don’t know what is a block explorer, refer &lt;a href="https://www.thecryptoinsight.com/2021/08/block-explorers-how-to-use-a-block-explorer.html"&gt;this blog to understand in detail.&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Go to compiler tab and click on compile. You should see a green tick mark like below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3OVS9xyr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/omy9d2b30l4v1jt2u383.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3OVS9xyr--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/omy9d2b30l4v1jt2u383.jpg" alt="Compiled Contract" width="880" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go to deploy tab. You should select the contract as BattleRoyale.sol&lt;/li&gt;
&lt;li&gt;By Default, the environment will be Javascript VM. All the transactions will be executed in a sandbox blockchain in the browser. This means nothing will be persisted when you reload the page. The Javascript VM is its own blockchain and on each reload it will start a new blockchain, the old one will not be saved.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--m8vFM0bY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lx21ydeeepqkq98jazk9.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--m8vFM0bY--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lx21ydeeepqkq98jazk9.jpg" alt="IDeploy to local Network Remix" width="880" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You need to select Injected Web3 as environment, which will allow us to inject metamask and deploy to the network selected on metamask. You should be able to see the your metamask account and balance in the accounts section.&lt;/li&gt;
&lt;li&gt;When you click on deploy, you will be prompted twice. First the CardDesign Library is deployed and then our smart contract.&lt;/li&gt;
&lt;li&gt;You will receive two metamask notifications once the contracts deployed. Click on them, you will be redirected to Ropsten block explorer. You can also view the transactions from activity tab in metamask.&lt;/li&gt;
&lt;li&gt;Congrats, now the On chain NFT smart contract - loot royale has been deployed.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But how do we interact with it. I will explain in detail on how to interact it from our custom react frontend later. &lt;/p&gt;

&lt;p&gt;Let’s see how to interact with the contract from the block explorer. &lt;/p&gt;

&lt;p&gt;FYI, block explorer is also a DApp.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verifying the contract on blockchain Explorer
&lt;/h2&gt;

&lt;p&gt;Now, let us verify the contract on the block chain explorer and interact with it from the explorer itself. &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open contract addresses of Card Design Library in one tab (We require this address), Open the contract address of the Loot Royale contract in another tab like below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FdLDwTFx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ssy5pe9p44p9w4ykei74.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FdLDwTFx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ssy5pe9p44p9w4ykei74.jpg" alt="Loot Royale Contract" width="880" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Click on verify and Publish and fill the details as below. I used 8.7 as compiler version while deploying from remix, so I am using the same here. Click on continue.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cAqmmc89--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tj8o1hb8zvzr6a3l0rqh.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cAqmmc89--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/tj8o1hb8zvzr6a3l0rqh.jpg" alt="Verify Step 2" width="880" height="411"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Copy the code from remix IDE and paste in the code part. Remove the string from constructor part like below.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vJbLrRyd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/13qj8nhcgxugfkush8jb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vJbLrRyd--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/13qj8nhcgxugfkush8jb.jpg" alt="Step 3" width="880" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enter the library address as below and verify you are not a robot and click on Verify and publish.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vdv21KUP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ffmvedd3zy1ams0qwxs7.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vdv21KUP--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ffmvedd3zy1ams0qwxs7.jpg" alt="Step 4" width="880" height="407"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, our contract is verified. It looks like below image. You should see the functions to read and write to blockchain. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--6VaNQl_e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rn9hyfx92byudhj3u4ee.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--6VaNQl_e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rn9hyfx92byudhj3u4ee.jpg" alt="Verified" width="880" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Navigate to write contract section and click on connect to web3, you will be prompted to connect to a provider like below. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Q1ytCEZm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gl8pm82hgix6zyqvckwz.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Q1ytCEZm--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gl8pm82hgix6zyqvckwz.jpg" alt="Write COntract" width="880" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Click on Metamask, it will prompt to connect. Click on connect to web3 again and you should be able to connect explorer with Metamask like below. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--O--X4ARi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xd026zlm3098q1m8dst.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--O--X4ARi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1xd026zlm3098q1m8dst.jpg" alt="Connected Explorer" width="880" height="464"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now click on mintSingle function, it will an NFT. Enter 0.01 as NFT Price like below. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--y7pgCNJq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qvw54by15sxj19p6iqrm.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--y7pgCNJq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/qvw54by15sxj19p6iqrm.jpg" alt="Mint Single" width="880" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Congratulations. You have successfully minted an NFT. Now, lets view the token URI. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--vkCJgIST--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r12a40lp42ptma1u1xuy.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--vkCJgIST--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r12a40lp42ptma1u1xuy.jpg" alt="TOken URI" width="880" height="424"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can go to &lt;a href="https://testnets.opensea.io/account"&gt;testnets.opensea.io&lt;/a&gt;,  connect your wallet and you should be able to see your Loot Royale NFTs. &lt;/p&gt;

&lt;h2&gt;
  
  
  Loot Royale NFTs on Open sea Testnet
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--Vc7WZQ5R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gaym44cfhi7d8ph4roz3.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--Vc7WZQ5R--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gaym44cfhi7d8ph4roz3.jpg" alt="NFTs" width="880" height="425"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the next blog, I will show how to design the frontend for &lt;a href="https://lootroyale.xyz/"&gt;lootroyale.xyz&lt;/a&gt; to mint the loot royale NFTs since this blog has become lengthy already.&lt;/p&gt;

&lt;p&gt;Any feedback is appreciated. In case of any doubts or issues or any new ideas, DM me on twitter - &lt;a href="https://twitter.com/pranay_bathini"&gt;@pranay_bathini&lt;/a&gt;. Let us learn together. &lt;/p&gt;

&lt;p&gt;Thanks for reading!!&lt;/p&gt;

</description>
      <category>nfts</category>
      <category>solidity</category>
      <category>smartcontracts</category>
      <category>web3</category>
    </item>
  </channel>
</rss>
