DEV Community

shangkyu shin
shangkyu shin

Posted on • Originally published at zeromathai.com

Optimization in Machine Learning — How Models Learn Parameters and What Actually Improves Training

Learn how optimization in machine learning works, from parameter learning and loss minimization to gradient descent, backpropagation, and hyperparameter tuning.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/optimization-in-machine-learning-en/

Why optimization is the real core of machine learning

When people first study machine learning, they usually focus on model types, architectures, or frameworks.

But underneath all of that, machine learning is doing something much simpler:

It is searching for parameter values that make error smaller.

That process is optimization.

A trained model is not just a container of stored examples. It is a system whose parameters have been adjusted so that its outputs fit patterns in data well enough to make useful predictions on new inputs.

So if you want to understand why training works, why it fails, or why changing one setting can completely alter results, optimization is the right place to look.


Training is parameter optimization

At the center of training is one compact idea:

minimize L(θ)

Where:

  • θ = model parameters
  • L = loss function

That is the real meaning of learning in most ML systems.

You define a model, run a forward pass, measure error with a loss function, compute gradients, update parameters, and repeat. That loop is training.

In practice:

  1. the model makes predictions
  2. the loss measures how wrong they are
  3. gradients show how the parameters affected that loss
  4. the optimizer updates parameters to reduce future error

This is true for linear models, logistic regression, and deep neural networks. The scale changes. The principle does not.


Backpropagation and gradient descent do different jobs

A lot of beginners blur these together, but it helps to separate them clearly.

Backpropagation computes gradients efficiently through layered models.

Gradient-based optimization uses those gradients to update parameters.

So a good mental model is:

  • backprop tells you how the loss changes
  • the optimizer decides how to step

That distinction becomes especially useful when you compare optimizers like SGD, Momentum, and Adam. They can all use gradients from backprop, but they turn those gradients into updates differently.


Optimization happens at two levels

One of the most useful ideas in machine learning is that optimization is not only about training weights.

1. Inner loop: parameter learning

This is the normal training loop.

You optimize:

  • weights
  • biases
  • other trainable parameters

The goal is to reduce training loss.

2. Outer loop: model selection

Here, you optimize the setup around training.

You choose:

  • learning rate
  • batch size
  • optimizer type
  • depth or width of the network
  • regularization strength

This is why hyperparameter tuning is also optimization.

You are not directly updating weights with gradient descent here. You are searching for a training configuration that allows the inner loop to produce a better model.

That distinction explains why a perfectly reasonable architecture can still perform badly. Sometimes the problem is not the model family. It is the optimization setup.


The first thing I check when training goes wrong

When a run starts behaving badly, I usually do not blame the architecture first.

I check the optimization setup:

  • Is the learning rate too high?
  • Is the batch size introducing too much noise?
  • Did the optimizer change?
  • Did the loss function change in a way that altered gradient behavior?

A lot of “model problems” are actually optimization problems.

That is one reason optimization is such a practical topic. It is not just theory for textbooks. It is what you debug when a real training run explodes, stalls, or behaves inconsistently.


Why learning rate matters so much

If there is one hyperparameter that beginners underestimate, it is probably the learning rate.

Too high:

  • updates overshoot good regions
  • loss oscillates or diverges
  • training becomes unstable

Too low:

  • progress is painfully slow
  • the model may look weak simply because it is learning inefficiently

That is why bad model performance is often really bad optimization behavior.


First-order vs. second-order methods

Another useful distinction is based on what information the optimizer uses.

First-order methods

These use gradients only.

Examples:

  • SGD
  • Momentum
  • Adam

Why they dominate practice:

  • cheap updates
  • scalable to large datasets
  • practical for deep networks

This is why most real deep learning systems default to first-order optimization.

Second-order methods

These use curvature information, often through the Hessian.

Classic example:

  • Newton’s method

Why they matter:

  • they can converge very quickly in theory
  • they give richer information about the loss landscape

Why they are less common in deep learning:

  • computing and storing second-order information is expensive
  • large neural networks make that cost hard to justify

So the real trade-off is simple:

  • first-order methods are cheaper and more scalable
  • second-order methods are richer but often impractical

Convex vs. non-convex optimization

In classical ML, convexity is a big deal.

If an objective is convex, local minima are global minima. That gives strong guarantees and makes the optimization story much cleaner.

But deep learning usually lives in a non-convex world.

That means:

  • many local minima
  • saddle points
  • complicated geometry
  • no simple global guarantee

At first, that sounds like a serious weakness. In practice, it leads to a more realistic goal:

We do not need the perfect global optimum. We need a solution that works well.

That is one of the reasons deep learning is so practical despite being hard to analyze perfectly.


SGD vs. Adam in real workflows

This is where optimization becomes concrete for developers.

SGD

  • simple and reliable
  • often slower to converge
  • still meaningful when you care about training dynamics and sometimes long-run behavior

Adam

  • usually easier to get working quickly
  • often converges faster in practice
  • a common default when you want stable early progress

A simple rule of thumb is this:

  • if I want fast iteration and a strong default baseline, I usually start with Adam
  • if I care about comparing training behavior more carefully, I may still test SGD variants

The point is not that one is always better. The point is that optimizer choice changes training behavior in ways you can directly observe.

Same model. Same data. Different optimizer. Different outcome.

That is optimization in action.


What optimization explains in practice

A surprising number of ML questions are really optimization questions:

  • Why is loss not going down?
  • Why did training become unstable after changing one setting?
  • Why does Adam behave differently from SGD?
  • Why is tuning taking longer than expected?
  • Why does a simpler model sometimes beat a larger one?

Optimization gives the language for answering all of them.

It connects the mathematics of the objective function to the messy reality of actual training runs.


Key takeaways

  • machine learning is fundamentally an optimization problem
  • training means learning parameter values that minimize loss
  • backpropagation computes gradients, while the optimizer uses them
  • optimization happens both inside the model and around the model
  • hyperparameter tuning is an outer optimization process
  • first-order methods dominate deep learning because they scale
  • deep learning usually involves non-convex optimization, so practical solutions matter more than perfect guarantees

If you understand optimization, you understand a large part of why machine learning systems succeed, fail, converge, stall, or improve.

Discussion

When you debug a training run, what do you usually check first: learning rate, optimizer choice, batch size, loss function, or model architecture?

And in your own projects, do you still start with Adam by default, or have you moved back toward SGD for certain workloads?

Top comments (0)