<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thokozani Buthelezi</title>
    <description>The latest articles on DEV Community by Thokozani Buthelezi (@thokozani_buthelezi_2cd41).</description>
    <link>https://dev.to/thokozani_buthelezi_2cd41</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3908811%2Ff6d8e313-f93a-4c40-baa1-d2f31d776099.png</url>
      <title>DEV Community: Thokozani Buthelezi</title>
      <link>https://dev.to/thokozani_buthelezi_2cd41</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thokozani_buthelezi_2cd41"/>
    <language>en</language>
    <item>
      <title>DDP Is Not Always Faster</title>
      <dc:creator>Thokozani Buthelezi</dc:creator>
      <pubDate>Sat, 16 May 2026 12:11:57 +0000</pubDate>
      <link>https://dev.to/thokozani_buthelezi_2cd41/ddp-is-not-always-faster-23mh</link>
      <guid>https://dev.to/thokozani_buthelezi_2cd41/ddp-is-not-always-faster-23mh</guid>
      <description>&lt;p&gt;That is the result of this experiment, and it is the most important thing to understand about distributed training before you reach for it.&lt;/p&gt;

&lt;p&gt;I ran my nanoGPT implementation, a 4M parameter character-level transformer across two T4 GPUs on Kaggle using PyTorch's DistributedDataParallel. The single GPU baseline ran at 22ms per step. The DDP run across two GPUs ran at 26ms per step. Adding a second GPU made training slower.&lt;/p&gt;




&lt;h2&gt;
  
  
  How DDP Works
&lt;/h2&gt;

&lt;p&gt;DDP splits each batch across GPUs, each GPU sees a different subset of the data via &lt;code&gt;DistributedSampler&lt;/code&gt;. Each GPU runs its own forward and backward pass independently. After backward, DDP runs an all-reduce operation that synchronizes gradients across all GPUs before the optimizer step. Every process ends up with identical gradients and takes an identical optimizer step.&lt;/p&gt;

&lt;p&gt;The key line in the code is one wrapping call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;DDP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device_ids&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;rank&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Everything else, the model definition, the optimizer, the training loop stays the same. DDP handles the gradient synchronization automatically via hooks registered on each parameter.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Single GPU step time&lt;/td&gt;
&lt;td&gt;22.03ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DDP step time (2 GPUs)&lt;/td&gt;
&lt;td&gt;26.36ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compute time per step&lt;/td&gt;
&lt;td&gt;15.12ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communication time per step&lt;/td&gt;
&lt;td&gt;11.24ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Communication overhead&lt;/td&gt;
&lt;td&gt;42.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Scaling efficiency&lt;/td&gt;
&lt;td&gt;41.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Scaling efficiency measures how close you got to ideal linear speedup. At 100% efficiency, two GPUs would halve your step time to 11ms. At 41.8%, the DDP run is actually slower than single GPU.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Scaling Efficiency Was Low
&lt;/h2&gt;

&lt;p&gt;42.6% of every DDP step was gradient communication, not compute. The all-reduce has to move every gradient across the PCIe bus connecting the two T4s, and at 4M parameters that communication cost dominates.&lt;/p&gt;

&lt;p&gt;This is a compute-to-communication ratio problem. DDP only pays off when the model is large enough that compute time swamps communication time. At 4M parameters on T4s without NVLink, the ratio is inverted, you spend more time talking between GPUs than doing actual work.&lt;/p&gt;

&lt;p&gt;The same experiment on a 1B parameter model would look completely different. Compute would dominate, and scaling efficiency would climb toward 80-90%.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;The right experiment for demonstrating DDP scaling is a model at least one order of magnitude larger, or running across more than two GPUs where the communication overhead amortizes better. Weeks 13-14 exposed the constraint rather than the benefit, which is a legitimate result, knowing when not to use DDP is as useful as knowing how to use it.&lt;/p&gt;

&lt;p&gt;Phase III (FSDP) addresses this directly: instead of replicating the full model on every GPU, FSDP shards parameters across GPUs, which changes the communication pattern and makes large model training viable.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Code and results committed to &lt;code&gt;distributed_data_parallel&lt;/code&gt; in the monorepo.&lt;/em&gt;&lt;br&gt;
&lt;a href="https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026" rel="noopener noreferrer"&gt;https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>pytorch</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Evaluating LLMs for Under a Dollar</title>
      <dc:creator>Thokozani Buthelezi</dc:creator>
      <pubDate>Thu, 14 May 2026 13:39:16 +0000</pubDate>
      <link>https://dev.to/thokozani_buthelezi_2cd41/evaluating-llms-for-under-a-dollar-4d4b</link>
      <guid>https://dev.to/thokozani_buthelezi_2cd41/evaluating-llms-for-under-a-dollar-4d4b</guid>
      <description>&lt;h2&gt;
  
  
  Why Evals Matter
&lt;/h2&gt;

&lt;p&gt;Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. The problem is that evaluation is easy to do badly, you can run a benchmark, get a number, and walk away thinking you know something when you don't.&lt;/p&gt;

&lt;p&gt;This post is about doing it properly on a budget. I ran three standard benchmarks against Qwen2.5-0.5B on a free Colab T4, logged wall-clock time and dollar cost for each task, and documented every methodological decision along the way. Total spend: &lt;strong&gt;$0.1185&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Benchmarks
&lt;/h2&gt;

&lt;p&gt;I picked three tasks that cover meaningfully different capabilities rather than variations of the same thing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GSM8K&lt;/strong&gt; (Cobbe et al., 2021) tests grade-school math reasoning. The model has to produce a chain-of-thought and arrive at a final numeric answer. Scoring is exact match, either the answer is right or it isn't. This is a generative task, which makes it slower and more expensive than the others. I used 5-shot prompting following the original paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;HellaSwag&lt;/strong&gt; (Zellers et al., 2019) tests commonsense sentence completion. Given a partial sentence, the model scores four candidate continuations using normalized log-likelihood and picks the highest. The dataset was constructed with adversarial filtering, meaning the wrong answers were specifically chosen to fool models that rely on surface-level patterns. Human performance is around 95%. I used 10-shot following the original paper.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TruthfulQA-MC2&lt;/strong&gt; (Lin et al., 2021) tests whether the model produces truthful answers to questions that commonly elicit false beliefs. I used the MC2 variant multiple choice scored by log-likelihood, rather than the generative version, which requires a GPT-4 judge model. This keeps the eval fully self-contained and free. 0-shot, following the original paper.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Harness
&lt;/h2&gt;

&lt;p&gt;All three tasks were run through &lt;a href="https://github.com/EleutherAI/lm-evaluation-harness" rel="noopener noreferrer"&gt;lm-evaluation-harness&lt;/a&gt; by EleutherAI. The harness standardizes few-shot prompt construction, normalization, and metric computation across tasks, which matters a lot for reproducibility. Running the same eval twice should give the same number.&lt;/p&gt;

&lt;p&gt;One non-obvious decision: GSM8K in the harness defaults to &lt;code&gt;max_gen_toks=2048&lt;/code&gt;, which generates up to 2048 tokens per sample. On a T4 that was running over 4 hours. I capped it at 256 tokens and included a &lt;code&gt;limit=0.25&lt;/code&gt; which runs 25% of the test set. I figured this is enough to capture a complete chain-of-thought for grade-school math and brings runtime down to under 50 minutes.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Model
&lt;/h2&gt;

&lt;p&gt;Qwen2.5-0.5B is a 500M parameter base model from Alibaba. I chose it because it fits comfortably in the 15GB VRAM on a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model rather than an instruction-tuned one is worth noting, the experiment primarily reflects runtime, generation behaviour, and evaluation cost characteristics of the base model under standard benchmark workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  Cost Accounting
&lt;/h2&gt;

&lt;p&gt;Cost basis: Colab Pro at approximately $0.10/hr for a T4 session.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GSM8K&lt;/td&gt;
&lt;td&gt;46.52 min&lt;/td&gt;
&lt;td&gt;$0.0775&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;HellaSwag&lt;/td&gt;
&lt;td&gt;23.67 min&lt;/td&gt;
&lt;td&gt;$0.0394&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TruthfulQA-MC2&lt;/td&gt;
&lt;td&gt;0.97 min&lt;/td&gt;
&lt;td&gt;$0.0016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;71.16 min&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;$.1185&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Published Scores
&lt;/h2&gt;

&lt;p&gt;Free Colab T4 sessions kept disconnecting mid-run — GSM8K's generative evaluation took 46+ minutes per run, and the session limit hit before clean results could be saved. Rather than spending another week on infrastructure, I'm using the official numbers from the Qwen2.5 technical report, which uses the same lm-evaluation-harness with matching few-shot settings.&lt;br&gt;
The runtime and cost figures are from the actual runs. Scores are from the Qwen2.5 technical report (Qwen Team, 2025), cited explicitly because my runs had the extraction bug.&lt;br&gt;
Total cost: $0.1185 across all three tasks&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Task&lt;/th&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Score&lt;/th&gt;
&lt;th&gt;Duration (min)&lt;/th&gt;
&lt;th&gt;Cost ($)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;HellaSwag(10-shot)&lt;/td&gt;
&lt;td&gt;acc_norm&lt;/td&gt;
&lt;td&gt;0.521&lt;/td&gt;
&lt;td&gt;23.67&lt;/td&gt;
&lt;td&gt;$0.0394&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TruthfulQA(0-shot)&lt;/td&gt;
&lt;td&gt;mc2&lt;/td&gt;
&lt;td&gt;0.402&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;td&gt;$0.0016&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GSM8k (5-shot, 25% subset)&lt;/td&gt;
&lt;td&gt;exact_match&lt;/td&gt;
&lt;td&gt;0.416&lt;/td&gt;
&lt;td&gt;46.52&lt;/td&gt;
&lt;td&gt;$0.0775&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Limitations
&lt;/h2&gt;

&lt;p&gt;A few things worth being honest about before drawing conclusions from these numbers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Contamination.&lt;/strong&gt; Qwen's training data composition is not fully disclosed. Any of these benchmarks could have been in the pretraining mix, which would inflate scores. There is no way to verify this from the outside.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact match undercounts GSM8K.&lt;/strong&gt; A model that produces the right reasoning but formats the final answer differently, writing "42 dollars" instead of "42", gets marked wrong. The real accuracy is likely slightly higher than the number reported.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt sensitivity.&lt;/strong&gt; Benchmark scores can shift meaningfully with different few-shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Would Do Differently
&lt;/h2&gt;

&lt;p&gt;Running a single model against three benchmarks gives you a snapshot, not a story. The more interesting experiment is running the same benchmarks against multiple checkpoints say, the base model, a LoRA fine-tune, and a DPO fine-tune and measuring the delta. That is what weeks 13+ will set up.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Results and notebook committed to lm_eval_harness in my github.&lt;/em&gt;&lt;br&gt;
&lt;a href="https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026" rel="noopener noreferrer"&gt;https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>Reproducing Chinchilla Scaling on a Budget</title>
      <dc:creator>Thokozani Buthelezi</dc:creator>
      <pubDate>Sat, 02 May 2026 11:58:58 +0000</pubDate>
      <link>https://dev.to/thokozani_buthelezi_2cd41/reproducing-chinchilla-scaling-on-a-budget-227b</link>
      <guid>https://dev.to/thokozani_buthelezi_2cd41/reproducing-chinchilla-scaling-on-a-budget-227b</guid>
      <description>&lt;p&gt;Training a 70B parameter model costs millions of dollars. Scaling laws exist so you don't have to guess how to spend that budget. Here's what I learned reproducing them on a free GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Scaling laws are basically rules that tell us how model performance improves as you increase quantities such as model size, dataset size, and compute.&lt;/p&gt;

&lt;p&gt;Instead of guessing "bigger models = better", scaling laws gives a mathematical relationship between:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;model size (N, number of parameters)&lt;/li&gt;
&lt;li&gt;dataset size (D, number of tokens) &lt;/li&gt;
&lt;li&gt;compute (C, number of training FLOPs)&lt;/li&gt;
&lt;li&gt;loss (L, how wrong the model is)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;the core idea&lt;/strong&gt; &lt;br&gt;


&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;L(N,D)=ANα+BDβ+E
L(N, D) = \frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;L&lt;/span&gt;&lt;span class="mopen"&gt;(&lt;/span&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="mpunct"&gt;,&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;D&lt;/span&gt;&lt;span class="mclose"&gt;)&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;N&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;α&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;A&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;D&lt;/span&gt;&lt;span class="msupsub"&gt;&lt;span class="vlist-t"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="sizing reset-size6 size3 mtight"&gt;&lt;span class="mord mtight"&gt;&lt;span class="mord mathnormal mtight"&gt;β&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;B&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;E&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;This looks intimidating but it's simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;increasing N(model size) -&amp;gt; loss goes down&lt;/li&gt;
&lt;li&gt;increasing D(data) -&amp;gt; loss goes down&lt;/li&gt;
&lt;li&gt;but both have &lt;strong&gt;diminishing returns&lt;/strong&gt; because of the scaling exponents (α,β)&lt;/li&gt;
&lt;li&gt;where E is the irreducible entropy error of the model&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The relationship between the loss and these quantities is not linear, it is a power law.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Kaplan vs Chinchilla disagreement
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Kaplan said scale model size faster than dataset size&lt;/li&gt;
&lt;li&gt;Chinchilla said scale both equally&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;why they disagreed?&lt;/strong&gt;&lt;br&gt;
The three experimental assumptions used by Kaplan led to conclusions that model size should be scaled faster. These assumptions include:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the use of non-embedding parameters only when scaling&lt;/li&gt;
&lt;li&gt;undertraining of large models &lt;/li&gt;
&lt;li&gt;omission of the offset term in the compute-loss form&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When these factors are corrected by Chinchilla you have:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the use of all model's parameters &lt;/li&gt;
&lt;li&gt;models are fully trained to compute-optimal point&lt;/li&gt;
&lt;li&gt;the offset term is included in the compute-loss form&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;one clean takeaway&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kaplan didn’t “get it wrong”, the setup just made model scaling look more effective than it actually is.&lt;br&gt;
Chinchilla corrected the setup, and revealed the true balance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The experiment
&lt;/h2&gt;

&lt;p&gt;In my experiment to reconstruct the Chinchilla scaling, I used 3 models of different parameter sizes: 786K, 4M, 25M params on the same dataset &lt;strong&gt;WikiText-2&lt;/strong&gt;, for the same compute budget. I trained all three models for 500 steps using a T4 GPU on Google Colab. At every 50 steps I logged the validation loss and total FLOPs consumed. Here's what the data showed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkny13y432u1r7b6k0xf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbkny13y432u1r7b6k0xf.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75q1sfjx4to384bzhp6g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F75q1sfjx4to384bzhp6g.png" alt=" " width="800" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Graph 1: Validation Loss vs Training Steps&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This graph shows what happens as you let the models practice over time.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt; &lt;em&gt;size matters immediately&lt;/em&gt;: even at the very first step, the larger model(green) starts with a much lower error than the small model(blue)&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;the "Head Start" effect&lt;/em&gt;: the large model's starting point is actually better than the small model's finishing point. This shows that having more parameters makes inherently more capable.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;plateauing&lt;/em&gt;: all the three lines curve and flatten out. This represents the &lt;u&gt;diminishing returns&lt;/u&gt;, that is, the longer you train a model, the harder it becomes to extract extra accuracy from it.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Graph 2: Loss vs Compute (Log-Log Scale)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the "Power Law" graph. By plotting the data on a log-log scale, the curves become straight lines.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;predictable progress&lt;/em&gt;: because these lines are straight, researchers can look at the small model's slope and mathematically predict exactly how much more compute they need to reach a specific performance level.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;efficiency gains&lt;/em&gt;: notice how the green dots(large model) extend further to the right. To get the lowest loss on the chart, you must use the large model; the small model simply doesn't have the capacity to get that "smart", no matter how much compute you throw at it.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;the slope (-α)&lt;/em&gt;: the legend show "slopes" like -0.136. This is the scaling exponent that tells us the "exchange rate" between spending more money on GPUs and getting a smarter AI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The Big Picture&lt;/strong&gt;&lt;br&gt;
Together, these graphs prove scaling isn't random. If you want a smarter AI you don't just guess, you use these straight lines to calculate exactly how many parameters and how much compute you need to reach your goal.&lt;/p&gt;

&lt;p&gt;Full code and results are on my GitHub: &lt;a href="https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026" rel="noopener noreferrer"&gt;https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Next:&lt;/strong&gt; I'll be running lm-evaluation-harness across all three model sizes and analysing what benchmarks like HellaSwag and GSM8K actually measure and where they mislead.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>deeplearning</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
