Reproducing Chinchilla Scaling on a Budget

Thokozani Buthelezi — Sat, 02 May 2026 11:58:58 +0000

Training a 70B parameter model costs millions of dollars. Scaling laws exist so you don't have to guess how to spend that budget. Here's what I learned reproducing them on a free GPU.

Introduction

Scaling laws are basically rules that tell us how model performance improves as you increase quantities such as model size, dataset size, and compute.

Instead of guessing "bigger models = better", scaling laws gives a mathematical relationship between:

model size (N, number of parameters)
dataset size (D, number of tokens)
compute (C, number of training FLOPs)
loss (L, how wrong the model is)

the core idea

\frac{A}{N^{\alpha}} + \frac{B}{D^{\beta}} + E

This looks intimidating but it's simple:

increasing N(model size) -> loss goes down
increasing D(data) -> loss goes down
but both have diminishing returns because of the scaling exponents (α,β)
where E is the irreducible entropy error of the model

The relationship between the loss and these quantities is not linear, it is a power law.

The Kaplan vs Chinchilla disagreement

Kaplan said scale model size faster than dataset size
Chinchilla said scale both equally

why they disagreed?
The three experimental assumptions used by Kaplan led to conclusions that model size should be scaled faster. These assumptions include:

the use of non-embedding parameters only when scaling
undertraining of large models
omission of the offset term in the compute-loss form

When these factors are corrected by Chinchilla you have:

the use of all model's parameters
models are fully trained to compute-optimal point
the offset term is included in the compute-loss form

one clean takeaway

Kaplan didn’t “get it wrong”, the setup just made model scaling look more effective than it actually is.
Chinchilla corrected the setup, and revealed the true balance.

The experiment

In my experiment to reconstruct the Chinchilla scaling, I used 3 models of different parameter sizes: 786K, 4M, 25M params on the same dataset WikiText-2, for the same compute budget. I trained all three models for 500 steps using a T4 GPU on Google Colab. At every 50 steps I logged the validation loss and total FLOPs consumed. Here's what the data showed.

Graph 1: Validation Loss vs Training Steps

This graph shows what happens as you let the models practice over time.

size matters immediately: even at the very first step, the larger model(green) starts with a much lower error than the small model(blue)
the "Head Start" effect: the large model's starting point is actually better than the small model's finishing point. This shows that having more parameters makes inherently more capable.
plateauing: all the three lines curve and flatten out. This represents the diminishing returns, that is, the longer you train a model, the harder it becomes to extract extra accuracy from it.

Graph 2: Loss vs Compute (Log-Log Scale)

This is the "Power Law" graph. By plotting the data on a log-log scale, the curves become straight lines.

predictable progress: because these lines are straight, researchers can look at the small model's slope and mathematically predict exactly how much more compute they need to reach a specific performance level.
efficiency gains: notice how the green dots(large model) extend further to the right. To get the lowest loss on the chart, you must use the large model; the small model simply doesn't have the capacity to get that "smart", no matter how much compute you throw at it.
the slope (-α): the legend show "slopes" like -0.136. This is the scaling exponent that tells us the "exchange rate" between spending more money on GPUs and getting a smarter AI.

The Big Picture
Together, these graphs prove scaling isn't random. If you want a smarter AI you don't just guess, you use these straight lines to calculate exactly how many parameters and how much compute you need to reach your goal.

Full code and results are on my GitHub: https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

What Next: I'll be running lm-evaluation-harness across all three model sizes and analysing what benchmarks like HellaSwag and GSM8K actually measure and where they mislead.

DEV Community: Thokozani Buthelezi

Reproducing Chinchilla Scaling on a Budget

Introduction

The Kaplan vs Chinchilla disagreement

The experiment