DEV Community

Shrijith Venkatramana
Shrijith Venkatramana

Posted on

Multi-Step Learning Rate Schedulers in LLM Training: Why Some Teams Are Moving Beyond Cosine Decay

Hello, I'm Shrijith Venkatramana. I'm building git-lrc, an AI code reviewer that runs on every commit. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.


Training modern Large Language Models is expensive.

When a single training run can consume millions of GPU hours, even small optimization decisions become important. Most developers focus on model architecture, dataset quality, and scaling laws. Yet one of the most influential knobs in training is surprisingly simple:

How should the learning rate change over time?

For years, cosine decay has been the default answer. But many recent LLM projects have quietly adopted an alternative: the multi-step learning rate scheduler.

What's interesting is that the reason isn't necessarily better final accuracy.

It's because multi-step schedules make something else much easier: continuing training later without wasting previous compute.

Let's explore why.

The Learning Rate Is the Model's Step Size

A neural network learns by repeatedly adjusting its parameters.

The learning rate controls the size of those adjustments.

Imagine hiking toward a destination in dense fog:

  • Too large a step → you overshoot repeatedly.
  • Too small a step → progress becomes painfully slow.
  • A well-chosen step size gets you there efficiently.

During LLM training, we rarely keep the learning rate constant.

Instead, we start with a relatively large learning rate to make rapid progress and gradually reduce it as training converges.

The scheduler determines how that reduction happens.

The Popular Choice: Cosine Decay

The most common scheduler in modern LLM training is cosine decay.

The learning rate follows a smooth curve:

learning rate 1

At the beginning, the learning rate is high.

As training progresses, it smoothly decreases until reaching a very small value near the end.

Visually, it looks like a gently descending hill.

Why is cosine decay popular?

  • Simple
  • Stable
  • Works across many model sizes
  • Requires little tuning

For years, it became the default choice for transformer training.

However, it has an important limitation.

The Problem with Cosine Decay in Continual Training

Suppose your original plan was:

  • Train for 1 trillion tokens
  • Stop
  • Evaluate results

A month later you decide:

Let's continue training for another trillion tokens.

With cosine decay, things become awkward.

The scheduler assumed training would end at a specific point.

By the time you reach that endpoint, the learning rate has already decayed close to zero.

Extending training now requires redesigning the schedule.

You must decide:

  • Restart the scheduler?
  • Stretch the curve?
  • Create a new decay function?

Each choice changes optimization behavior.

This becomes increasingly inconvenient for organizations that frequently expand training runs as new compute becomes available.

Enter Multi-Step Learning Rate Scheduling

A multi-step scheduler divides training into distinct phases.

Instead of continuously decreasing the learning rate, it stays constant for long periods and then drops abruptly.

For example:

Training Progress Learning Rate
0% - 80% 1.0×
80% - 90% 0.1×
90% - 100% 0.01×

Rather than a smooth curve, the graph resembles a staircase.

The learning rate remains fixed during a stage and changes only at predefined milestones.

Conceptually:

Stage 1: High LR
───────────────

Stage 2: Lower LR
        ─────────

Stage 3: Very Low LR
                 ───
Enter fullscreen mode Exit fullscreen mode

Many recent LLM efforts use schedules resembling an 80% / 10% / 10% distribution.

Most computation happens in the first phase, while later phases act as refinement stages.

Why Multi-Step Schedulers Work Surprisingly Well

At first glance, abrupt drops seem less elegant than smooth cosine decay.

Yet empirical results often show something surprising:

Final model quality is frequently very similar.

Why?

Because optimization is dominated by the large early stage.

Most useful learning occurs when:

  • The learning rate is relatively high
  • The model is far from convergence
  • Large parameter updates are still beneficial

The later stages mainly fine-tune the model.

Whether the transition between stages is smooth or abrupt often matters less than developers expect.

In practice, many teams observe nearly identical downstream performance between:

  • Cosine schedules
  • Carefully tuned multi-step schedules

This makes the operational advantages of multi-step scheduling very attractive.

The Hidden Advantage: Reusing Training Compute

This is where multi-step scheduling becomes especially interesting for large-scale training.

Imagine training proceeds like this:

Phase Tokens Trained
Stage 1 800B
Stage 2 100B
Stage 3 100B

Now suppose additional funding or GPU capacity appears.

Instead of redesigning the entire learning-rate trajectory, you can simply:

  1. Extend Stage 1
  2. Continue training at the same learning rate
  3. Delay later stage transitions

The optimization process remains consistent.

The expensive computation already completed during Stage 1 remains fully reusable.

This flexibility becomes valuable when:

  • Scaling budgets change
  • New datasets arrive
  • Training targets evolve
  • Additional compute becomes available unexpectedly

For frontier model teams, this operational convenience can outweigh theoretical elegance.

Implementing a Multi-Step Scheduler

A simplified PyTorch example looks like this:

optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-4
)

scheduler = torch.optim.lr_scheduler.MultiStepLR(
    optimizer,
    milestones=[800000, 900000],
    gamma=0.1
)
Enter fullscreen mode Exit fullscreen mode

This configuration means:

  • Learning rate = 1e-4 initially
  • At step 800,000 → drop by 10×
  • At step 900,000 → drop by another 10×

Result:

Phase LR
0 - 800k 1e-4
800k - 900k 1e-5
900k+ 1e-6

Real LLM training systems usually include:

  • Warmup stages
  • More sophisticated milestone selection
  • Token-based rather than step-based scheduling
  • Distributed optimizer considerations

But the core idea remains the same.

When Should You Use Multi-Step Scheduling?

For smaller projects, cosine decay remains an excellent default.

However, multi-step scheduling becomes compelling when:

  • Training runs may be extended later
  • Compute availability is uncertain
  • Continual pretraining is expected
  • Multiple training phases are planned
  • Reusing partially completed training is important

In these environments, optimization quality may remain nearly unchanged while operational flexibility improves significantly.

Sometimes the best engineering decision isn't the theoretically cleanest one.

It's the one that makes future decisions easier.

Conclusion

Learning-rate schedulers are often treated as a minor implementation detail.

Yet at LLM scale, they influence not only optimization but also the economics of training.

Cosine decay offers smooth and reliable convergence. Multi-step schedules often achieve similar final performance while making continual training far easier to manage.

That tradeoff explains why several modern LLM training efforts have adopted multi-step schedulers as their default strategy.

As model training increasingly becomes a long-running, iterative process rather than a single fixed experiment, flexibility may matter just as much as raw optimization performance.


*AI agents write code fast. They also silently remove logic, change behavior, and introduce bugs -- without telling you. You often find out in production.

git-lrc fixes this. It hooks into git commit and reviews every diff before it lands. 60-second setup. Completely free.*

Any feedback or contributors are welcome! It's online, source-available, and ready for anyone to use.

GitHub logo HexmosTech / git-lrc

Free, Micro AI Code Reviews That Run on Git Commit




GenAI today is a race car without brakes. It accelerates fast -- you describe something, and large blocks of code appear instantly. But AI agents silently break things: they remove logic, relax constraints, introduce expensive cloud calls, leak credentials, and change behavior -- without telling you. You often find out in production.

git-lrc is your braking system. It hooks into git commit and runs an AI review on every diff before it lands. 60-second setup. Completely free.

In short, git-lrc helps Prevent Outages, Breaches, and Technical Debt Before They Happen

At a glance: 10 risk categories · 100+ failure patterns tracked · every commit…

Top comments (0)