<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Abhinav Tripathi</title>
    <description>The latest articles on DEV Community by Abhinav Tripathi (@genuineabhinav).</description>
    <link>https://dev.to/genuineabhinav</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3959803%2Fc12d632e-7097-4a9b-aed1-e8e29d52c82a.jpg</url>
      <title>DEV Community: Abhinav Tripathi</title>
      <link>https://dev.to/genuineabhinav</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/genuineabhinav"/>
    <language>en</language>
    <item>
      <title>Learning LLMs by training a 1B param model from scratch on Strix Halo</title>
      <dc:creator>Abhinav Tripathi</dc:creator>
      <pubDate>Mon, 08 Jun 2026 12:38:39 +0000</pubDate>
      <link>https://dev.to/genuineabhinav/learning-llms-by-training-a-1b-param-model-from-scratch-on-strix-halo-1p3k</link>
      <guid>https://dev.to/genuineabhinav/learning-llms-by-training-a-1b-param-model-from-scratch-on-strix-halo-1p3k</guid>
      <description>&lt;p&gt;About 1 year ago, AMD released their &lt;a href="https://www.amd.com/en/products/processors/laptop/ryzen/ai-300-series/amd-ryzen-ai-max-plus-395.html" rel="noopener noreferrer"&gt;AI Max+ series CPUs&lt;/a&gt; (aka &lt;code&gt;Strix Halo&lt;/code&gt;). It seemed that all of my youtube feed was filled with praise of the architectural decision of getting unified memory on a non-mac hardware. I finally bought a &lt;a href="https://www.gmktec.com/products/amd-ryzen%E2%84%A2-ai-max-395-evo-x2-ai-mini-pc?srsltid=AfmBOoo7F2woafROWl2OLcXy3kFzaoyrLYtvaccIB5HbtXEOBWRucOfH&amp;amp;variant=cd4ec00d-73f2-4f62-b5d1-91129c439d1c" rel="noopener noreferrer"&gt;GMKTec EVO x2&lt;/a&gt; in November last year with 128GB RAM. I started trying to follow different tutorials, trying to install &lt;code&gt;rocm&lt;/code&gt; via non-official releases and trying to run local LLMs!&lt;/p&gt;

&lt;p&gt;I wanted to write about my journey of trying to train a 1B parameter LLM on the strix halo! I will also share all the things I learned along the way. I will provide some links for all the topics that felt helpful to me.&lt;/p&gt;

&lt;h2&gt;
  
  
  TL;DR:
&lt;/h2&gt;

&lt;p&gt;This is the repo for the AI training framework I created using Antigravity: &lt;a href="https://github.com/genuinelucifer/onix" rel="noopener noreferrer"&gt;https://github.com/genuinelucifer/onix&lt;/a&gt;&lt;br&gt;
The repo has all the tools needed from downloading dataset, to pretraining to finetuning to running it locally.&lt;/p&gt;

&lt;p&gt;The repo also contains this doc which outlines the full process to train and fine-tune a model using the tiny-stories dataset. It also mentions how to run it locally using the framework: &lt;a href="https://github.com/genuinelucifer/onix/blob/main/docs/training_llama1b_on_tinystories.md" rel="noopener noreferrer"&gt;https://github.com/genuinelucifer/onix/blob/main/docs/training_llama1b_on_tinystories.md&lt;/a&gt;&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Note:&lt;/strong&gt; Only the code for the framework was written with AI. This blog is completely hand written by me.
&lt;/div&gt;


&lt;h1&gt;
  
  
  Starting phase:
&lt;/h1&gt;

&lt;p&gt;I got my mini-pc with Windows 11 because I also wanted to play games on it. I didn’t know, at that time, how far proton support for steam games has come. I tried to install  rocm drivers and run models on windows, it was a mistake. Support for windows is very less and none of the youtubers seems to be using windows for running AI models (atleast on strix halo).&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; Use Linux (most recent releases) for any local AI training/running.
&lt;/div&gt;


&lt;p&gt;After trying for multiple weeks and failing to get &lt;a href="https://github.com/Comfy-Org/ComfyUI" rel="noopener noreferrer"&gt;ComfyUI&lt;/a&gt; or &lt;a href="https://qwen.ai/blog?id=qwen-image-edit" rel="noopener noreferrer"&gt;Qwen-image-edit&lt;/a&gt; properly working on Windows, I finally went ahead and dual booted Ubuntu. This helped me unlock access to a wide set of tutorials people had created to run and train models on Linux. Most people trying to search tutorials for running models on strix halo would almost certainly stumble upon the awesome &lt;a href="https://github.com/kyuz0/amd-strix-halo-toolboxes" rel="noopener noreferrer"&gt;toolboxes for strix-halo&lt;/a&gt; created by &lt;a href="https://www.youtube.com/@donatocapitella" rel="noopener noreferrer"&gt;Donato Capitella&lt;/a&gt;. These helped me start my journey to run models on my PC.&lt;br&gt;
But, the documentation for the toolboxes asks to set the lowest amount of dedicated VRAM in the BIOS config and then edit grub config to add the following to increase the shared memory size between RAM and VRAM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;amd_iommu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;off amdgpu.gttsize&lt;span class="o"&gt;=&lt;/span&gt;126976 ttm.pages_limit&lt;span class="o"&gt;=&lt;/span&gt;32505856
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Doing this change eagerly was my second mistake. Although this might be a great hack for running models which need 100+ GB VRAM but I never ran any model which needed such large VRAM. I found that the strix halo is way more compute limited than memory limited. At a much later time, I found that setting 96 GB of dedicated VRAM and 32 GB RAM turned out to be a much much better solution for me (more on this later).&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; Do not prematurely optimize your setup just looking at what configurations worked for others.
&lt;/div&gt;



&lt;p&gt;I ran a few models using ComfyUI and LM-Studio, I got about 40 tokens/second on 30B parameter models which felt pretty good. Creating a 1080p image via FLUX model on ComfyUI takes about 50 seconds. So the PC is good for local coding agents and image creation. Creating videos is still out of reach though; it took me about 6 hours to create a 4 second video using Hunyuan video model on ComfyUI!&lt;/p&gt;

&lt;p&gt;After all the experimentation on running models, I wanted to train my own model, just running existing models wasn’t why I bought this PC.&lt;/p&gt;

&lt;h1&gt;
  
  
  Phase Two: Learning about LLMs and setting goals
&lt;/h1&gt;

&lt;p&gt;Note that I had already learned about CNNs, RNNs etc, so I knew all the basics about how the models work. I knew the maths involved and how a basic model's architecture looks like.&lt;br&gt;
What I wanted to learn was all the components of an actual LLM without too much details on the maths involved. I was also focussed on learning it from software perspective than from research perspective.&lt;br&gt;
I bought the book &lt;a href="https://www.amazon.in/dp/1633437167" rel="noopener noreferrer"&gt;Build a Large Language Model from Scratch&lt;/a&gt; by &lt;a href="https://www.amazon.in/stores/Sebastian-Raschka/author/B00J1DHHFS" rel="noopener noreferrer"&gt;Sebastian Raschka&lt;/a&gt; to learn all the fundamentals of how LLMs work and what are the components inside an LLM. This was the perfect book for my use-case! It took me about 1.5 months to go through the book the old fashioned way (read the text and write  every line of code manually). I have my code uploaded to &lt;a href="https://github.com/genuinelucifer/yallm" rel="noopener noreferrer"&gt;my yallm repo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After this, I wanted to speed up my workflow. I also wanted to check all the hype with the AI tools on some personal project. So, I took Gemini Pro subscription, installed Antigravity, and decided that I will give the code from my yallm repo to the agent and then ask it to write all the code now. I also wanted to create a generic framework to train an LLM from scratch on my PC, because (in my head) I was going to be training all different types of large language models.&lt;br&gt;
I also had the goal to &lt;strong&gt;NOT&lt;/strong&gt; be writing any code for this framework. I would (time-to-time) review some of the code and just see what I get when training, finetuning and running my models.&lt;/p&gt;

&lt;p&gt;So, I went full steam ahead and created my Onix framework (slight pun on the ONNX model format) which I used to fully train, finetune and run a 1B parameter model on my strix-halo. I intend to keep developing the framework (maybe not 100% via AI going forward) and hopefully add support for training more types of model architectures with more optimizations. Currently it supports training a text based LLM, a VQ-VAE embedding for images and then a “multimodal” model using the trained VQ-VAE that generates images based on text.&lt;/p&gt;

&lt;p&gt;I looked for english datasets that I could pretrain my model on, I selected the &lt;a href="https://huggingface.co/datasets/roneneldan/TinyStories" rel="noopener noreferrer"&gt;tiny-stories dataset&lt;/a&gt; as my first dataset. It has about 470M tokens. It seemed small enough to do some useful tests before moving on to bigger datasets (considering I was trying to train a 1B parameter model).&lt;/p&gt;
&lt;h1&gt;
  
  
  Phase Three: Memory footprint for pretraining the model
&lt;/h1&gt;

&lt;p&gt;After this I selected llama as the base model architecture that I will start with. Llama's architecture is very well documented and easy to re-use. I stuck to the GPT2 tokenizer that I read about in the book by Sabastian Raschka, just so that I could re-use most of my code even in new model architectures. After this I started the pretraining process.&lt;/p&gt;

&lt;p&gt;When trying to run training &lt;strong&gt;on a 1B Llama model&lt;/strong&gt;, I started with &lt;strong&gt;with 1024 token context window&lt;/strong&gt;. I saw that &lt;strong&gt;it used about 26 GB of VRAM&lt;/strong&gt;! Which was both a relief and a shock. Shock because I didn't know why a 1B parameter model would take 26 GB of VRAM and relief because I had 110+ GB of shared memory at this point. So, I could increase both the context window and the batch size. I could even increase the model size and train much larger models. Oh, such were those times of naivety!&lt;/p&gt;

&lt;p&gt;So, I looked though  the memory usage and it turned out to be like this. I am training my model on the default fp32 accuracy. It takes about 4 bytes (32 bits) to store a fp32 model parameter.&lt;/p&gt;

&lt;p&gt;Following are some approximate guesses on which things need how much VRAM during training (not sure how to find exact values):&lt;/p&gt;

&lt;p&gt;1 - Total Static overhead: (&lt;strong&gt;~16 GB&lt;/strong&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~1B model parameters&lt;/li&gt;
&lt;li&gt;~1B model gradients.&lt;/li&gt;
&lt;li&gt;~2B parameters for the AdamW optimizer. It has 2 values, momentum and variance, per model parameter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;2 - Total Activation Memory: (&lt;strong&gt;~5GB&lt;/strong&gt;)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://sebastianraschka.com/blog/2023/self-attention-from-scratch.html" rel="noopener noreferrer"&gt;Self-attention matrix&lt;/a&gt; (~context size^2 *n_heads * n_layers)&lt;/li&gt;
&lt;li&gt;Feed forward block outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;3 - Cached parameters for the eval stage of the model since I am running eval after the first iteration of training.&lt;/p&gt;

&lt;p&gt;4 - Pytorch/rocm Caching has some overhead as well.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; I cannot train a 100B model on 100GB VRAM! :(
&lt;/div&gt;


&lt;p&gt;So, I went and increased the batch size to be 8. Now I saw that it used about 65 GB of VRAM. This was also a shock because I expected it to take about 8x VRAM and cause OOM. But it did not!&lt;br&gt;
It turned out that the static overhead remains the same, since there is only 1 instance of the model. So all the model and optimizer parameters still need only about 16GB of VRAM. But the other activations need to be replicated across batches to parallely work on different sets of data.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; With batch size, the activation memory scales linearly. The static overhead remains the same.
&lt;/div&gt;


&lt;p&gt;After this, I decided I will train on a larger context window. Because I presumed that whatever data I will finally train the model on; it would sort of work like a chatbot. And 1024 is very small context window. So I increased the context window to be 8192 tokens (8x from earlier and the least amount of context needed as per my uneducated guess).&lt;br&gt;
And to my shock, &lt;strong&gt;even with batch size of one, I ran into OOM when trying to train a 8192 context size llama model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In this case as well, the static memory overhead is still the same. Since self-attention is quadratic in nature, it is proportional to context_size&lt;sup&gt;2&lt;/sup&gt;. If we increase the context size by 8x, the self-attention parameters increase by 64x.&lt;br&gt;
So, self-attention memory needed would itself be &amp;gt;150GB!&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; With context window size, the self-attention memory scales quadratically.
&lt;/div&gt;


&lt;p&gt;Now I needed to optimize this memory usage. Self-attention didn't seem useful anymore. So I switched to &lt;a href="https://towardsdatascience.com/understanding-flash-attention-writing-the-algorithm-from-scratch-in-triton-5609f0b143ea/" rel="noopener noreferrer"&gt;flash attention mechanism&lt;/a&gt;. Which made all the attention parameters scale linearly instead of quadratically.&lt;br&gt;
With &lt;a href="https://docs.pytorch.org/docs/2.12/generated/torch.nn.functional.scaled_dot_product_attention.html" rel="noopener noreferrer"&gt;flash attention (SDPA)&lt;/a&gt;, my model took 58GB with batch-size of 1.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; With flash attention, all the attention parameters scale linearly with context size.
&lt;/div&gt;


&lt;p&gt;I then tried &lt;a href="https://residentmario.github.io/pytorch-training-performance-guide/gradient-checkpoints.html" rel="noopener noreferrer"&gt;gradient checkpointing&lt;/a&gt;, which allows the training to discard all the forward pass values and re-calculate them during backward pass. This will save us more memory. It took only about 29 GB VRAM at this point. And every iteration took about 47 seconds.&lt;/p&gt;

&lt;p&gt;All the data till this point is summarized in this table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Context size&lt;/th&gt;
&lt;th&gt;Batch Size&lt;/th&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;VRAM (GB)&lt;/th&gt;
&lt;th&gt;Time s/step&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;3.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;65&lt;/td&gt;
&lt;td&gt;23.5&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;td&gt;-&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Flash Attention&lt;/td&gt;
&lt;td&gt;58&lt;/td&gt;
&lt;td&gt;38&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;4&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Flash Attention&lt;/td&gt;
&lt;td&gt;95&lt;/td&gt;
&lt;td&gt;80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Flash Attention + Grad Checkpointing&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;47&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;6&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8192&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Flash Attention + Grad Checkpointing&lt;/td&gt;
&lt;td&gt;34&lt;/td&gt;
&lt;td&gt;98&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note that the data in this table was re-calculated while writing this post. So it may not align 100% with the next table. I had already enabled dedicated VRAM at this point. Also upgraded pytorch, rocm, ubuntu etc.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The main issue now started to come into light. It took about 2x the amount of time for training per step when I increase batch-size to 2x (both with and without grad checkpointing). So, it became very clear that the GPU was limited by compute and not by memory.&lt;/p&gt;

&lt;p&gt;I stopped trying to use more VRAM at this point. I now needed to focus on improving the training time. &lt;strong&gt;I decided to not use grad checkpointing&lt;/strong&gt; since it was about 25% slower and time matters more at this scale.&lt;br&gt;
Now, each step of the training process needed about 80 seconds. For the tiny-stories dataset, with batch-size of 2, I still needed 24414 steps of training to complete one epoch of training.&lt;br&gt;
So, it would take me about one month of continuous training to complete just 1 epoch of training (after adding evaluation steps every few 100 training steps)! It was not feasible.&lt;br&gt;
And there was no point in increasing the training data size anymore, since that would be completely impractical. So, I decided that tiny-stories would be the only dataset I would train my model on.&lt;/p&gt;

&lt;p&gt;Since I had already spent so much time optimizing the memory, &lt;strong&gt;I decided to continue with 8192 context window size and 2 batch size for training my model&lt;/strong&gt;. This was obviously a mistake. Tiny stories doesn't need such huge context window. I would have gotten much faster training times if I had stuck to 512 tokens of context window size which would have been enough for this dataset. But then I wouldn't have needed to learn how to optimize training time which I got to do here. So it was a win, in hindsight.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; First decide the dataset and then define the model configuration that would work for that dataset, not the other way around.
&lt;/div&gt;


&lt;h1&gt;
  
  
  Phase Four: Improving training speed on my hardware
&lt;/h1&gt;

&lt;p&gt;At this point, I still have the shared GTT memory of 110+ GB and dedicated VRAM of 1 GB. But since both GPU and CPU share the memory, it is often fragmented and all access needs to go via a GTT (Graphics Translation Table).&lt;br&gt;
So, I went ahead and removed the grub config and updated dedicated VRAM in my BIOS to 96GB. This left 32 GB for RAM. And viola, I got about 1.5x speedup just from doing this!&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; Dedicated VRAM is not same as shared memory. It is much faster even on the same hardware (LPDDR5).
&lt;/div&gt;


&lt;p&gt;After this, I learned about &lt;a href="https://cloud.google.com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus" rel="noopener noreferrer"&gt;BF16 numbers&lt;/a&gt; which use the exact same number of precision bits as FP32 numbers and hence give the same accuracy in machine learning algorithms where we do not need to store high absolute value numbers but need the same precision for small numbers.&lt;br&gt;
Shifting to BF16 for training gave me a compounded ~5x improvement!&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; Prefer BF16 over FP32 for storing small numbers with the same precision. 
&lt;/div&gt;


&lt;p&gt;After this I learned about &lt;a href="https://rocm.blogs.amd.com/artificial-intelligence/torch_compile/README.html" rel="noopener noreferrer"&gt;compiling models so that torch can create a fused kernel&lt;/a&gt; (which can optimize away certain operations). This gave further improvement to training time.&lt;/p&gt;

&lt;p&gt;Pytorch also has optimization to calculate &lt;a href="https://www.youtube.com/watch?v=bhplJt1XAMI" rel="noopener noreferrer"&gt;Flash Attention kernels ahead-of-time&lt;/a&gt; which can be enabled via &lt;code&gt;TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL&lt;/code&gt; which I enabled along with the last optimization.&lt;/p&gt;

&lt;p&gt;After all these optimization, I got about 10x total improvement in training speed. Which was good enough for me to start training the model.&lt;/p&gt;

&lt;p&gt;All the data for speed is summarized in this table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Optimization&lt;/th&gt;
&lt;th&gt;Time/step&lt;/th&gt;
&lt;th&gt;Improvement&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;0&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Baseline&lt;/td&gt;
&lt;td&gt;110s&lt;/td&gt;
&lt;td&gt;1.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;1&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;96GB Dedicated VRAM&lt;/td&gt;
&lt;td&gt;74s&lt;/td&gt;
&lt;td&gt;~1.5x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;2&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+ BF16 Precision&lt;/td&gt;
&lt;td&gt;15s&lt;/td&gt;
&lt;td&gt;~7.3x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;3&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+ torch.compile + AOTriton&lt;/td&gt;
&lt;td&gt;~11s&lt;/td&gt;
&lt;td&gt;~10.0x&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: After all 3 optimizations, the model takes about 63 GB VRAM while training which was a welcome surprise. I was not looking to reduce memory footprint at this point, so I did not capture memory footprint after every optimization.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;After this, I saw that my GPU utilization graph dips every iteration and the CPU spikes at exactly the same time. It looked something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36em3w7i0sgd7nbby68t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F36em3w7i0sgd7nbby68t.png" alt="Image showcasing GPU usage dip below 100% at regular intervals for 1-2 seconds" width="722" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This was because after every iteration, the CPU needed to load the next set of data for the next iteration. So I needed to optimize my data loader to get full 100% GPU usage at all times.&lt;br&gt;
I did so by adding &lt;a href="https://docs.pytorch.org/docs/2.12/data.html" rel="noopener noreferrer"&gt;num-workers and prefetch-factor&lt;/a&gt; to my data loaders. This didn't increase the speed of training by any noticeable amount but I can atleast see my GPU being at 100% utilization throughout the training! :)&lt;/p&gt;

&lt;p&gt;After all these optimizations, I finally ran the full pretraining on the tiny-stories dataset. &lt;strong&gt;It took about 3.5 days for 1 epoch of training to finish.&lt;/strong&gt; I did have to create pause and resume mechanism so that I could resume training when my PC would restart due to electricity issues.&lt;br&gt;
I did not have the heart to continue training for more epochs. And I ran it at this stage and got good enough output, IMO. Example (at this point model is auto-completing the story where I started the story with "There was a boy"):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbexrksnhg7xahryt402.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkbexrksnhg7xahryt402.png" alt="Image showcasing an LLM completing a story after the users inputs " width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h1&gt;
  
  
  Phase Five: Model fine-tuning
&lt;/h1&gt;

&lt;p&gt;Although the model was ready to do auto-complete for tiny stories; it didn't feel natural. I wanted to do instruction fine-tuning on it. I used the &lt;a href="https://huggingface.co/datasets/roneneldan/TinyStoriesInstruct" rel="noopener noreferrer"&gt;TinyStoriesInstruct dataset&lt;/a&gt; for this. Although this dataset isn’t really what I expected it to be. It is more like “story-elements-to-story dataset”. So I wrote &lt;a href="https://github.com/genuinelucifer/onix/blob/main/utils/preprocess_tinystories_instruct_data.py" rel="noopener noreferrer"&gt;a python script to convert this to how I wanted my dataset to be like&lt;/a&gt;. I wanted the dataset to be something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User: Tell me a story about a king and a queen. It should have the words river, gate and knight.

Model: ….&amp;lt;tells a story about the same&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;After this I finetuned the model on the newly created dataset. Fine tuning was exactly the same as training. Just use the earlier model and train it further. This dataset was surprising larger (in number of data points) as compared to the actual pretraining data.&lt;br&gt;
I had the boon of hindsight and decided that for pretraining I will only use the first 1024 values of context window which will drastically reduce the time needed to train my dataset. I also added grad checkpointing to drastically reduce the memory it would take to run finetuning. And then I increased the batch-size to 32.&lt;/p&gt;


&lt;div class="crayons-card c-embed"&gt;

  &lt;strong&gt;Learning:&lt;/strong&gt; Same model can be fine-tuned with slightly different config to focus on smaller set of parameters and do a faster fine-tuning. 
&lt;/div&gt;



&lt;p&gt;With 1024 context, grad checkpointing enabled and batch-size of 32, my finetuning took about 9 seconds per step of finetuning consuming about 53 GB VRAM.&lt;br&gt;
With this, &lt;strong&gt;it would take about 6 days to finetune the model on the full instruct dataset, just for 1 epoch&lt;/strong&gt;. I ran the finetune only on half the dataset and checked the results after 3 days. They looked okay to me for it being my first time trying to finetune a model.&lt;br&gt;
Example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bm6f1dvrlna1eqrjfqm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bm6f1dvrlna1eqrjfqm.png" alt="Image showcasing an LLM replying with a story after the user asks it to generate a story." width="799" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Next Steps:
&lt;/h1&gt;

&lt;p&gt;Next, I want to work on the following items (in no particular order); which, I think, will lead me to improve my framework (&amp;amp; knowledge of LLMs) significantly more:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Work on the model runner to learn about what optimizations are applied for inference (for both memory and time).&lt;/li&gt;
&lt;li&gt;Retraining the same model with smaller context size.&lt;/li&gt;
&lt;li&gt;Converting model to safetensors and sharing via huggingface.&lt;/li&gt;
&lt;li&gt;Quantize the Bf16 models to 4bit and compare output.&lt;/li&gt;
&lt;li&gt;Train 8-bit or maybe even 1-bit model and compare outputs.&lt;/li&gt;
&lt;li&gt;Create Lora to allow fine-tuning the model with smaller set of parameters. Learn how it works.&lt;/li&gt;
&lt;li&gt;Train a model to create small images... Maybe galaxies &amp;amp; 32x32 platforming characters. Learn about image models.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Final thoughts:
&lt;/h1&gt;

&lt;p&gt;If you are starting today, install Ubuntu 26.04 and go hacking. Try my framework and let me know what other things I should try to make the model training framework better!&lt;br&gt;
Please share your thoughts/experience on training LLMs and what are some things I should learn next to better understand how everything works.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>strixhalo</category>
    </item>
  </channel>
</rss>
