DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

What Actually Happens When You Train an LLM? Following the First 12 Hours of the Original Transformer

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


In 2017, eight NVIDIA P100 GPUs sat in a Google data center for about twelve hours.

p100 gpus

During those twelve hours, they repeatedly did something almost embarrassingly simple.

They picked up a batch of sentences.

Made predictions.

Measured how wrong those predictions were.

Adjusted a few million numbers.

Then did it again.

Exactly 100,000 times.

Those twelve hours produced the base Transformer model described in Attention Is All You Need. The larger version trained for 300,000 optimization steps, taking roughly 3.5 days on the same hardware.

Today, frontier LLMs train on tens of thousands of GPUs for weeks, but if you inspected the training logs, the core loop would still look remarkably familiar.

This article follows that loop.

We'll watch one training run unfold—from raw text on disk to a model that can translate languages—and along the way learn a bit about every intimidating term that deal with training a transformer model.

8:00 AM — Nothing Has Been Learned Yet

Imagine switching on the machine.

The Transformer knows absolutely no language. It doesn't know English, or German. And it doesn't know grammar. It doesn't even have conception of what a "word" is.

Internally it contains millions of parameters—ordinary floating-point numbers initialized almost randomly.

The training data, however, already contains knowledge.

For the English-German task, the authors used the WMT 2014 dataset containing roughly 4.5 million sentence pairs.

A tiny sample might look like this:

English:
The meeting begins tomorrow.

German:
Das Treffen beginnt morgen.
Enter fullscreen mode Exit fullscreen mode

Or

English:
The cat sat on the mat.

German:
Die Katze saß auf der Matte.
Enter fullscreen mode Exit fullscreen mode

Notice what's missing.

Nobody wrote rules like

"Adjectives come before nouns."

or

"German verbs often appear at the end."

The only supervision is example after example after example.

The model's job is to discover those rules itself.

8:00:01 — The Computer Doesn't See Words

Before training starts, the text is transformed into something GPUs understand.

Integers.

The paper uses Byte Pair Encoding (BPE), introduced a year earlier by Rico Sennrich and colleagues.

Instead of storing every possible English word, BPE builds a vocabulary of common subword pieces.

For example,

unbelievable
Enter fullscreen mode Exit fullscreen mode

might become

un
believ
able
Enter fullscreen mode Exit fullscreen mode

Those pieces become IDs.

un      → 517
believ  → 10328
able    → 294
Enter fullscreen mode Exit fullscreen mode

Why go through this trouble?

Imagine giving every English word its own entry.

You'd need hundreds of thousands of entries, and every new word—"ChatGPT", "Kubernetes", "DeepSeek"—would be unknown.

Subwords solve that elegantly.

Once the model understands "micro", "service" and "architecture", it already has much of what it needs to interpret "microservice architecture", even if it has never encountered the exact phrase before.

Modern tokenizers have evolved, but this basic idea remains.

8:00:02 — The First Batch Arrives

One beginner misconception is that the GPU trains on one sentence at a time.

That would waste almost all of its computational power.

GPUs are throughput machines.

They become efficient only when thousands of arithmetic units work simultaneously.

Instead, the Transformer paper groups examples into batches containing approximately

  • 25,000 source-language tokens
  • 25,000 target-language tokens

or about 50,000 tokens in total.

Think of a factory.

Running one car down an assembly line would be absurd.

Factories move hundreds of products simultaneously because keeping machines idle is expensive.

GPU training works the same way.

Batching is not a machine-learning trick.

It's operations optimization.

8:00:02.4 — The Model Makes Its First Mistake

The first forward pass takes roughly 0.4 seconds.

The model receives

The cat sat on the mat.
Enter fullscreen mode Exit fullscreen mode

and produces...

garbage.

Maybe something equivalent to

House.

Tomorrow.

Blue.

Water.
Enter fullscreen mode Exit fullscreen mode

That isn't failure.

It's exactly what we expect.

Every parameter was random only moments ago.

Now comes the crucial question.

How wrong was the prediction?

The answer is summarized by a single number called the loss.

Everything that follows exists solely to reduce that number.


8:00:02.5 — Which of the 65 Million Parameters Was Responsible?

Suppose I asked you to tune an old radio using sixty-five million knobs.

After hearing static, which knob would you turn?

You wouldn't know.

Yet that's essentially the problem.

The Transformer base model contains about 65 million trainable parameters.

The larger model contains around 213 million.

Backpropagation solves this enormous credit-assignment problem.

Rather than saying

"Parameter #18,423 is wrong,"

it computes

"If this parameter increased slightly, would the loss increase or decrease?"

for every single parameter.

The result is a gigantic map of tiny suggested adjustments called gradients.

Now another algorithm enters the story.

Adam: The Engineer Who Turns the Knobs

The paper uses the Adam optimizer with

  • β₁ = 0.9
  • β₂ = 0.98
  • ε = 10⁻⁹

These aren't arbitrary constants copied from Stack Overflow.

Adam remembers recent gradients, rather like giving the optimization process momentum.

Imagine descending a foggy mountain.

If every step depended only on the slope beneath your feet, you'd zigzag constantly.

Adam remembers where you've been heading over the past several steps, smoothing the journey downhill.

Interestingly, the paper chose β₂ = 0.98 rather than the more familiar 0.999 found in many deep-learning libraries today. That makes Adam respond more quickly to changing gradients—a small but deliberate engineering decision.

Millions of parameters are nudged.

Tiny, incremental changes.

Often by less than one thousandth.

Then the next batch arrives.

9:00 AM — The Strange Equation Everyone Hates

One of the paper's most intimidating equations defines the learning rate.

learning rate eq

It looks frightening.

Its purpose is not.

Early in training, every parameter is effectively random.

Large updates can make optimization unstable.

So the authors warm up the learning rate over the first 4,000 optimization steps, gradually increasing it rather than starting at full speed.

After warmup, the learning rate begins shrinking.

Imagine sanding a table.

At first you remove material aggressively.

Near the end you make tiny finishing passes.

Training behaves similarly.

The equation also contains the term (d_{\text{model}}^{-1/2}).

This compensates for model size.

As hidden representations become larger, gradients naturally change scale. Dividing by the square root of the model dimension helps keep parameter updates numerically well behaved as architectures grow.

The equation is just common sensical engineering.

Noon — Preventing the Model From Memorizing

If optimization only chased lower loss, the network could simply memorize the training data.

The paper deliberately makes learning harder.

First comes dropout.

Ten percent of activations are randomly disabled during training.

Every batch therefore sees a slightly different network.

No neuron can become indispensable.

Second comes label smoothing with a value of 0.1.

Instead of pretending the correct next token has probability exactly one, the target distribution is softened slightly.

That sounds counterintuitive.

Yet translation quality improved.

Real language is messy.

There are often several acceptable translations.

Slight uncertainty produces a less overconfident model.

8:00 PM — Twelve Hours Later

After roughly 100,000 optimization steps, the base model has finished training.

The larger model continues until 300,000 steps, taking approximately 3.5 days.

Each step processed about 50,000 tokens.

Back-of-the-envelope, that's around five billion token presentations during the base run—not unique tokens, but training exposures. The same examples are revisited across multiple passes through the dataset.

The paper doesn't stop by reporting translation accuracy.

It also reports FLOPs.

That's significant.

Even in 2017, the authors understood that machine learning was becoming an engineering discipline constrained not only by accuracy, but also by computation.

A model that is 1% better but requires ten times more compute is often a poor engineering trade-off.

That thinking has only become more relevant.

Today, training an LLM is as much about distributed systems, networking, storage bandwidth, GPU utilization, checkpointing and failure recovery as it is about neural networks.

Closing Thoughts

People often remember Attention Is All You Need for introducing self-attention.

Equally important was something less glamorous: it demonstrated a training recipe that scaled.

Large batches kept GPUs busy.

Carefully designed learning-rate schedules stabilized optimization.

Adam made billions of tiny updates practical.

Regularization techniques prevented memorization.

None of these ideas are individually magical. Together, repeated hundreds of thousands of times, they turned random numbers into a model that could translate language.

Nearly a decade later, today's frontier LLMs still follow the same rhythm.

The numbers have changed by orders of magnitude.

The loop has not.

Load a batch. Predict. Measure the loss. Update the weights. Repeat.

That's the heartbeat of every modern language model.


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit




GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)