114 pages of ML math, and what actually shows up at work

#ai #machinelearning #deeplearning #learning

Sixty-two pages of machine learning math. Fifty-two pages of deep learning math. One hundred and fourteen pages of formulas, compiled by a student studying for theory exams, posted to r/learnmachinelearning over the weekend and currently sitting at 138 upvotes. Every entry carries consistent notation, tensor shapes, and a one-line use label.

The cheatsheet is what a thorough course covers. Linear and logistic regression. Decision trees and tree ensembles. K-means and anomaly detection. PCA. Reinforcement learning and Q-learning on the ML side. Forward prop, backprop, Adam, RMSProp, CNNs, RNNs, GRUs, LSTMs, transformers, self-attention, word embeddings and seq2seq on the DL side. Plus shape reference tables — which is the section a working engineer actually flips back to.

That last line is the spine of this post. Most of the 114 pages disappears behind a .fit() call, a .compile() call, or an import once a project ships. The cheatsheet is genuinely good — students and interview-prep candidates will save themselves real time with it. For the working ML or applied AI engineer in 2026, the load-bearing subset is much smaller than the page count suggests. The interesting question isn't whether to learn the math — it's which math earns the time.

Teams shipping LLM features in 2026 keep landing on roughly the same answer. Three formulas show up on the screen often enough to be worth memorizing, and one section of any course-shaped cheatsheet has quietly become a museum piece.

The first is the chain-rule application in backprop — ∂L/∂w = ∂L/∂y · ∂y/∂z · ∂z/∂w. Autograd computes it; the engineer reads its consequences. Vanishing gradients in a deep stack, exploding gradients in an RNN-style block, an activation saturating before it can update — none of these failures are visible from the loss curve alone. They become visible the moment an engineer can trace the gradient backward through a layer in their head. Mixed-precision overflows, gradient-clipping thresholds, the choice between tanh and gelu in a custom block all rest on this. Backprop gets taught as the hardest thing in a deep-learning course and shows up at work as the most-used.

The second is the softmax-with-cross-entropy gradient — ∂L/∂z = ŷ − y. The simplest derivative in the entire cheatsheet, and the one that decides whether a classifier converges or oscillates. Autograd does the substitution but does not pick the loss; the engineer does. Label smoothing, temperature scaling, focal loss, the choice between BCE and CCE — each of these is a small perturbation on the same identity, and an engineer who can derive it in two lines can debug a misbehaving loss in two minutes.

The third is scaled dot-product attention — softmax(QK^T / √d_k) · V. Every transformer-shaped model from the smallest open-weights base model to the latest frontier release runs this inside a tight loop. The dimensional algebra of it is what determines KV-cache footprint, multi-head split widths, grouped-query layouts, RoPE position rotation, and the inference-cost story for serving. A team running its own inference stack reaches for this formula every time it tunes batch size, decides between FP8 and BF16 for the projection matrices, or evaluates whether a longer context window is worth the quadratic memory cost.

Three formulas. Backprop, cross-entropy, attention. The other 111 pages have a different shape of usefulness — they are the substrate that makes the three legible — but they are not what shows up on the engineer's screen in a normal week.

The section that is missing from the cheatsheet because the field moved past it is the one its author named on the cover. SVM and Naive Bayes are not in the 62 ML pages. Neither is in the 2026 production stack at the teams this post follows. SVMs were the central classifier of the 2008–2013 era and are still taught, but the working engineer in 2026 reaches for a gradient-boosted tree if the data is tabular and for a small fine-tuned encoder if it isn't. Naive Bayes is a teaching object now. The author of the cheatsheet flagged the omission as a limitation; from a practitioner's standpoint it reads as a feature of the document, not a gap.

The two other omissions — GANs and diffusion models — point the opposite direction. GANs are largely gone as a training paradigm; the field moved to diffusion and rectified flows. Diffusion is the gap that matters going forward, and any 2026-edition cheatsheet that adds one section should make it the diffusion forward/reverse process, the noise schedule, and the score-matching identity. The 2024 generation of ML curricula didn't include diffusion. The 2026 generation will.

The honest read of the cheatsheet is that it is two documents at once. The first is a study aid for the exam the author wrote it for — comprehensive, consistent, with shape tables that an interview-prep candidate will be grateful for. The second is, by accident, a map of which pieces of the standard ML curriculum still earn their teaching time and which are mostly historical. The student who built it for exams probably didn't intend to write the second document. But the cover note about SVM, Naive Bayes, GANs, and diffusion is itself the most interesting page of the 114.

The repo is at github.com/Jerry-0821/ml-dl-formula-cheatsheet, and the original r/learnmachinelearning thread is where the post landed. Star it if it matches the way a course is taught. The dozen formulas that show up at work are a different list, and the cheatsheet is most useful as the substrate against which a working engineer can name them.

DEV Community

114 pages of ML math, and what actually shows up at work

Top comments (0)