zeromathai

Posted on Apr 11 • Edited on May 7 • Originally published at zeromathai.com

Optimization in Machine Learning — How Models Learn Parameters and What Actually Improves Training

#ai #programming #machinelearning #deeplearning

Learn how optimization in machine learning works, from parameter learning and loss minimization to gradient descent, backpropagation, and hyperparameter tuning.

Cross-posted from Zeromath. Original article: https://zeromathai.com/en/optimization-in-machine-learning-en/

Why optimization is the real core of machine learning

When people first study machine learning, they usually focus on model types, architectures, or frameworks.

But underneath all of that, machine learning is doing something much simpler:

It is searching for parameter values that make error smaller.

That process is optimization.

A trained model is not just a container of stored examples. It is a system whose parameters have been adjusted so that its outputs fit patterns in data well enough to make useful predictions on new inputs.

So if you want to understand why training works, why it fails, or why changing one setting can completely alter results, optimization is the right place to look.

Training is parameter optimization

At the center of training is one compact idea:

minimize L(θ)

Where:

θ = model parameters
L = loss function

That is the real meaning of learning in most ML systems.

You define a model, run a forward pass, measure error with a loss function, compute gradients, update parameters, and repeat. That loop is training.

In practice:

the model makes predictions
the loss measures how wrong they are
gradients show how the parameters affected that loss
the optimizer updates parameters to reduce future error

This is true for linear models, logistic regression, and deep neural networks. The scale changes. The principle does not.

Backpropagation and gradient descent do different jobs

A lot of beginners blur these together, but it helps to separate them clearly.

Backpropagation computes gradients efficiently through layered models.

Gradient-based optimization uses those gradients to update parameters.

So a good mental model is:

backprop tells you how the loss changes
the optimizer decides how to step

That distinction becomes especially useful when you compare optimizers like SGD, Momentum, and Adam. They can all use gradients from backprop, but they turn those gradients into updates differently.

Optimization happens at two levels

One of the most useful ideas in machine learning is that optimization is not only about training weights.

1. Inner loop: parameter learning

This is the normal training loop.

You optimize:

weights
biases
other trainable parameters

The goal is to reduce training loss.

2. Outer loop: model selection

Here, you optimize the setup around training.

You choose:

learning rate
batch size
optimizer type
depth or width of the network
regularization strength

This is why hyperparameter tuning is also optimization.

You are not directly updating weights with gradient descent here. You are searching for a training configuration that allows the inner loop to produce a better model.

That distinction explains why a perfectly reasonable architecture can still perform badly. Sometimes the problem is not the model family. It is the optimization setup.

The first thing I check when training goes wrong

When a run starts behaving badly, I usually do not blame the architecture first.

I check the optimization setup:

Is the learning rate too high?
Is the batch size introducing too much noise?
Did the optimizer change?
Did the loss function change in a way that altered gradient behavior?

A lot of “model problems” are actually optimization problems.

That is one reason optimization is such a practical topic. It is not just theory for textbooks. It is what you debug when a real training run explodes, stalls, or behaves inconsistently.

Why learning rate matters so much

If there is one hyperparameter that beginners underestimate, it is probably the learning rate.

Too high:

updates overshoot good regions
loss oscillates or diverges
training becomes unstable

Too low:

progress is painfully slow
the model may look weak simply because it is learning inefficiently

That is why bad model performance is often really bad optimization behavior.

First-order vs. second-order methods

Another useful distinction is based on what information the optimizer uses.

First-order methods

These use gradients only.

Examples:

SGD
Momentum
Adam

Why they dominate practice:

cheap updates
scalable to large datasets
practical for deep networks

This is why most real deep learning systems default to first-order optimization.

Second-order methods

These use curvature information, often through the Hessian.

Classic example:

Newton’s method

Why they matter:

they can converge very quickly in theory
they give richer information about the loss landscape

Why they are less common in deep learning:

computing and storing second-order information is expensive
large neural networks make that cost hard to justify

So the real trade-off is simple:

first-order methods are cheaper and more scalable
second-order methods are richer but often impractical

Convex vs. non-convex optimization

In classical ML, convexity is a big deal.

If an objective is convex, local minima are global minima. That gives strong guarantees and makes the optimization story much cleaner.

But deep learning usually lives in a non-convex world.

That means:

many local minima
saddle points
complicated geometry
no simple global guarantee

At first, that sounds like a serious weakness. In practice, it leads to a more realistic goal:

We do not need the perfect global optimum. We need a solution that works well.

That is one of the reasons deep learning is so practical despite being hard to analyze perfectly.

SGD vs. Adam in real workflows

This is where optimization becomes concrete for developers.

SGD

simple and reliable
often slower to converge
still meaningful when you care about training dynamics and sometimes long-run behavior

Adam

usually easier to get working quickly
often converges faster in practice
a common default when you want stable early progress

A simple rule of thumb is this:

if I want fast iteration and a strong default baseline, I usually start with Adam
if I care about comparing training behavior more carefully, I may still test SGD variants

The point is not that one is always better. The point is that optimizer choice changes training behavior in ways you can directly observe.

Same model. Same data. Different optimizer. Different outcome.

That is optimization in action.

What optimization explains in practice

A surprising number of ML questions are really optimization questions:

Why is loss not going down?
Why did training become unstable after changing one setting?
Why does Adam behave differently from SGD?
Why is tuning taking longer than expected?
Why does a simpler model sometimes beat a larger one?

Optimization gives the language for answering all of them.

It connects the mathematics of the objective function to the messy reality of actual training runs.

Key takeaways

machine learning is fundamentally an optimization problem
training means learning parameter values that minimize loss
backpropagation computes gradients, while the optimizer uses them
optimization happens both inside the model and around the model
hyperparameter tuning is an outer optimization process
first-order methods dominate deep learning because they scale
deep learning usually involves non-convex optimization, so practical solutions matter more than perfect guarantees

If you understand optimization, you understand a large part of why machine learning systems succeed, fail, converge, stall, or improve.

Discussion

When you debug a training run, what do you usually check first: learning rate, optimizer choice, batch size, loss function, or model architecture?

And in your own projects, do you still start with Adam by default, or have you moved back toward SGD for certain workloads?

GitHub Resources
AI diagrams, study notes, and visual guides:
https://github.com/zeromathai/zeromathai-ai

DEV Community