Training Order Is Not Innocent
What if seeing the same data is not enough?
A common, almost implicit assumption in machine learning is that if a model is trained on the same dataset with the same architecture and optimizer, it should end up learning essentially the same thing. Training order is usually treated as an implementation detail — a convenience, not a defining factor.
In this post, I show a simple but striking counterexample:
Even when two neural networks see exactly the same data, the order in which the data is presented can determine what the model permanently remembers and what it completely forgets.
This is the first part of a short series on hysteresis in neural networks. Here, I focus only on observable behavior (accuracy and forgetting). In the next part, we will look inside the model and explain why this happens geometrically.
The Core Question
Assume we have two datasets, A and B.
We train the same model in two different ways:
- SAB: first on A, then on B
- SBA: first on B, then on A
Crucially:
- The architecture is identical
- The optimizer and hyperparameters are identical
- The random seed is fixed
- The union of the data is the same: A ∪ B
The only difference is chronological order.
The standard intuition is that this should not matter.
This intuition is wrong.
Experimental Setup (Intentionally Simple)
To avoid hiding effects behind complexity, I used a deliberately minimal setup.
- Dataset: MNIST
-
Split:
- A = digits {0,1,2,3,4}
- B = digits {5,6,7,8,9}
Model: small CNN + MLP head
Training: 20 epochs total (10 per phase)
Several things were explicitly disabled:
- No Batch Normalization
- No data augmentation
- Deterministic initialization (fixed seed)
This ensures that any difference we observe is not an artifact of randomness or regularization, but a consequence of training order alone.
What Happens During Training?
Let’s start with the SAB scenario: the model learns A first, then B.
SAB: A → B
Observed behavior:
- During the first phase, accuracy on A quickly rises to ~99%
- Accuracy on B remains near 0% (expected)
-
After switching to B:
- Accuracy on B rises to ~99%
- Accuracy on A collapses to 0%
SBA: B → A
The mirror experiment produces the mirror result:
Observed behavior:
- During the first phase, accuracy on B rises to ~99%
- Accuracy on A remains near 0%
-
After switching to A:
- Accuracy on A rises to ~99%
- Accuracy on B collapses to 0%
Accuracy Curves
SAB accuracy over epochs
Description: Line plot showing acc_A, acc_B, and acc_full across epochs. A vertical dashed line marks the phase transition (A → B). Accuracy on A drops sharply after the transition.
SBA accuracy over epochs
Description: Same plot as above, but for SBA. Accuracy on B drops sharply after switching to A.
A Subtle but Important Observation
Despite the dramatic forgetting, overall accuracy on the full test set remains around ~50% in both cases.
This is not a contradiction.
Each model performs extremely well on half of the classes and completely fails on the other half. Aggregated metrics hide this asymmetry.
Two models can have similar overall accuracy while representing fundamentally different worlds internally.
Quantifying the Effect: Hysteresis Loss
We can define a simple, order-dependent metric:
[\mathcal{L}_{\text{hyst}}(A) = |\text{Acc}_A(SAB) - \text{Acc}_A(SBA)|]
In this experiment:
- Acc_A(SAB) ≈ 0.00
- Acc_A(SBA) ≈ 0.99
This yields a hysteresis loss close to 1.0, i.e. the maximum possible difference.
The same holds symmetrically for dataset B.
Hysteresis summary
Description: Bar chart showing absolute accuracy differences between SAB and SBA.
Hysteresis for the individual subsets A and B is near-maximal, while hysteresis in the aggregate (full test accuracy) is close to zero and visually compressed due to the shared scale.
This highlights a key point: global performance metrics can remain almost invariant, even when class-conditional behavior is maximally path-dependent.
What This Means
This result shows that training order is not just an optimization detail.
- The network does not converge to a single, order-independent solution
- Learning leaves path-dependent traces in weight space
- Once the model commits to one subset, returning to a balanced representation is not trivial
This behavior is strongly reminiscent of hysteresis in physical systems, where the final state depends on the path taken, not only on the endpoint.
What This Post Does Not Explain (Yet)
This post only shows that hysteresis exists.
It does not explain:
- Whether the two final models lie in different basins of attraction
- Whether their internal representations are aligned or incompatible
- Whether one can smoothly interpolate between them without loss spikes
These questions require looking at the geometry of the weight space, not just accuracy curves.
That is exactly what Part 2 will address.
Reproducibility
All experiments were run with a fixed configuration and deterministic setup. The code used for this post (FAZ 1) is available on GitHub:
The geometric analysis (weight trajectories, representation similarity, interpolation barriers) will be released together with the next post.
Closing Thoughts
If training order alone can erase entire subsets of knowledge, then:
- Curriculum learning has hidden costs
- "Same data" does not imply "same model"
- Optimization is not just minimization — it is a history-dependent process
In the next post, we will open the model and examine how these irreversible choices are encoded in the geometry of neural manifolds.
Part 2: Inside the Weight Space — Geometry of Hysteresis



Top comments (0)