Ertugrul

Posted on Jan 9

Hysteresis in Neural Networks — Part 1

#ai #deeplearning #computerscience #machinelearning

Training Order Is Not Innocent

What if seeing the same data is not enough?

A common, almost implicit assumption in machine learning is that if a model is trained on the same dataset with the same architecture and optimizer, it should end up learning essentially the same thing. Training order is usually treated as an implementation detail — a convenience, not a defining factor.

In this post, I show a simple but striking counterexample:

Even when two neural networks see exactly the same data, the order in which the data is presented can determine what the model permanently remembers and what it completely forgets.

This is the first part of a short series on hysteresis in neural networks. Here, I focus only on observable behavior (accuracy and forgetting). In the next part, we will look inside the model and explain why this happens geometrically.

The Core Question

Assume we have two datasets, A and B.

We train the same model in two different ways:

SAB: first on A, then on B
SBA: first on B, then on A

Crucially:

The architecture is identical
The optimizer and hyperparameters are identical
The random seed is fixed
The union of the data is the same: A ∪ B

The only difference is chronological order.

The standard intuition is that this should not matter.

This intuition is wrong.

Experimental Setup (Intentionally Simple)

To avoid hiding effects behind complexity, I used a deliberately minimal setup.

Dataset: MNIST
Split:
- A = digits {0,1,2,3,4}
- B = digits {5,6,7,8,9}
Model: small CNN + MLP head
Training: 20 epochs total (10 per phase)

Several things were explicitly disabled:

No Batch Normalization
No data augmentation
Deterministic initialization (fixed seed)

This ensures that any difference we observe is not an artifact of randomness or regularization, but a consequence of training order alone.

What Happens During Training?

Let’s start with the SAB scenario: the model learns A first, then B.

SAB: A → B

Observed behavior:

During the first phase, accuracy on A quickly rises to ~99%
Accuracy on B remains near 0% (expected)
After switching to B:
- Accuracy on B rises to ~99%
- Accuracy on A collapses to 0%

SBA: B → A

The mirror experiment produces the mirror result:

Observed behavior:

During the first phase, accuracy on B rises to ~99%
Accuracy on A remains near 0%
After switching to A:
- Accuracy on A rises to ~99%
- Accuracy on B collapses to 0%

Accuracy Curves

SAB accuracy over epochs

Description: Line plot showing acc_A, acc_B, and acc_full across epochs. A vertical dashed line marks the phase transition (A → B). Accuracy on A drops sharply after the transition.

SBA accuracy over epochs

Description: Same plot as above, but for SBA. Accuracy on B drops sharply after switching to A.

A Subtle but Important Observation

Despite the dramatic forgetting, overall accuracy on the full test set remains around ~50% in both cases.

This is not a contradiction.

Each model performs extremely well on half of the classes and completely fails on the other half. Aggregated metrics hide this asymmetry.

Two models can have similar overall accuracy while representing fundamentally different worlds internally.

Quantifying the Effect: Hysteresis Loss

We can define a simple, order-dependent metric:

[\mathcal{L}_{\text{hyst}}(A) = |\text{Acc}_A(SAB) - \text{Acc}_A(SBA)|]

In this experiment:

Acc_A(SAB) ≈ 0.00
Acc_A(SBA) ≈ 0.99

This yields a hysteresis loss close to 1.0, i.e. the maximum possible difference.

The same holds symmetrically for dataset B.

Hysteresis summary

Description: Bar chart showing absolute accuracy differences between SAB and SBA.
Hysteresis for the individual subsets A and B is near-maximal, while hysteresis in the aggregate (full test accuracy) is close to zero and visually compressed due to the shared scale.

This highlights a key point: global performance metrics can remain almost invariant, even when class-conditional behavior is maximally path-dependent.

What This Means

This result shows that training order is not just an optimization detail.

The network does not converge to a single, order-independent solution
Learning leaves path-dependent traces in weight space
Once the model commits to one subset, returning to a balanced representation is not trivial

This behavior is strongly reminiscent of hysteresis in physical systems, where the final state depends on the path taken, not only on the endpoint.

What This Post Does Not Explain (Yet)

This post only shows that hysteresis exists.

It does not explain:

Whether the two final models lie in different basins of attraction
Whether their internal representations are aligned or incompatible
Whether one can smoothly interpolate between them without loss spikes

These questions require looking at the geometry of the weight space, not just accuracy curves.

That is exactly what Part 2 will address.

Reproducibility

All experiments were run with a fixed configuration and deterministic setup. The code used for this post (FAZ 1) is available on GitHub:

Hysteresis in Neural Networks

The geometric analysis (weight trajectories, representation similarity, interpolation barriers) will be released together with the next post.

Closing Thoughts

If training order alone can erase entire subsets of knowledge, then:

Curriculum learning has hidden costs
"Same data" does not imply "same model"
Optimization is not just minimization — it is a history-dependent process

In the next post, we will open the model and examine how these irreversible choices are encoded in the geometry of neural manifolds.

Part 2: Inside the Weight Space — Geometry of Hysteresis

Links

Linkedin: Linkedin
Github: Github
Website: Website

DEV Community