DEV Community

Cover image for How a 1787 Formula Picks the Next Word in ChatGPT
bytessyster
bytessyster

Posted on

How a 1787 Formula Picks the Next Word in ChatGPT

Mathematics has an annoying habit: it keeps handing you the same idea over and over under different names, and never once admits you've already met. In your first year you get told about the Legendre transform in mechanics (you understand nothing). Then about free energy in thermodynamics (you understand a bit more but see no connection). Then about softmax in neural networks (by this point you're fairly sure it's just an engineering hack someone came up with because it happened to work). And then, at some point — for me it happened while reading a particular pedagogical paper, more on that below — you suddenly realize that all along you'd been shown the same thing. One operation. Just wearing different costumes.

This is a genuinely satisfying feeling, something like the moment in a detective novel when it turns out the butler, the gardener, and the mysterious stranger on the train are the same person. I want to share that feeling, because it's pleasant, and also because it's useful: once you understand one construction, you get thermodynamics, exponential families from statistics, Fisher information, and — out of nowhere — the last layer of every large language model, plus the temperature slider in its API, all for free. Not a bad exchange rate for one idea.

I'll lean on a paper by Zia, Redish, and McKay, "Making Sense of the Legendre Transform" (arXiv:0806.1147) — a rare genre: three physicists wrote a paper about why nobody properly explains the Legendre transform to students, and then properly explained it. Then I'll pull the thread somewhere they never went: machine learning. Buckle up, there will be formulas, but I promise every single one earns its keep.

The formula everyone memorizes and nobody understands

Let's start with how the Legendre transform usually gets introduced in textbooks, so you can appreciate the scale of the pedagogical disaster. You're given a function $F(x)$, and with no explanation whatsoever, you're shown:

$$G(s) = s\,x(s) - F\big(x(s)\big),$$

where $s = dF/dx$ is the slope, and $x(s)$ is its inverse function. That's it. What usually follows is the phrase "it is easy to check that" (a classic warning sign that something unpleasant is coming), followed by a second magic identity: $dG/ds = x$. A reasonable person at this point has three questions: why the extra term $sx$, where does the minus sign come from, and why couldn't we just plug $x(s)$ into $F$ and call it a day. The textbook traditionally answers these questions with silence.

And yet an answer exists, and it isn't even complicated. Here's the thing: a function is a way of storing information — an input, the control variable $x$, and an output, the response $F$. Sometimes it's more convenient to store the same information in a different package. Everyone's made peace with the Fourier transform precisely because the repackaging is explained: instead of "values at points" you store "how much of each frequency is in the signal." The Legendre transform is the same kind of repackaging, except the new coordinate is the function's slope instead of a frequency. This is legitimate when the function is strictly convex (then every $x$ has its own unique slope, and the slope can serve as a full-fledged ID for the point), and useful when the slope is easier to measure or control than $x$ itself. In mechanics, this happens with velocity and momentum. In thermodynamics, with energy and temperature (try setting a system's energy directly; now try setting its temperature — feel the difference? thermostats are sold in stores; "energostats" are not). In a language model, as we'll see, it happens with probabilities and logits.

The picture that makes the minus sign stop being scary

Now, about that minus sign — and here the authors of the paper do the single most valuable thing: they draw a picture. Take a convex curve $F(x)$. Pick a point, go up to the curve, draw the tangent line with slope $s$, and extend it left until it hits the vertical axis. The intercept on that axis turns out to equal $-G$. What's left is seventh-grade geometry: in the resulting right triangle, "slope times horizontal leg" equals the vertical leg, and the vertical leg is made up of $F$ (above the axis) plus $G$ (below it). Put together:

$$F(x) + G(s) = s\,x.$$

And in this form, everything suddenly clicks into place. The "extra" $sx$ and the "weird" minus sign are there for exactly one reason: to make the equation symmetric. $F$ and $G$ enter it on completely equal footing — neither one is the "main" function. (The variables $x$ and $s$, to be clear, aren't independent here — they're locked together via $dF/dx = s$ and $dG/ds = x$; there's exactly one independent variable in this equation, you just get to choose which one it is.)

The symmetry immediately gives you a property that genuinely delighted me the first time I saw it: the Legendre transform is its own inverse. Apply it to $G$, and you get $F$ back. With Fourier, the forward and inverse transforms are different formulas; here it's the same operation both ways, like a light switch. Mathematicians call this an involution; in human terms it means ${F, x}$ and ${G, s}$ aren't an original and a copy — they're two equally valid descriptions of one object. Duality.

There's one more gift here that we'll need later. Differentiate the pair $dF/dx = s$, $dG/ds = x$ one more time, and you get

$$\frac{d^2F}{dx^2}\cdot\frac{d^2G}{ds^2} = 1.$$

The curvatures of dual functions are reciprocals of each other: wherever one bends sharply, the other is nearly flat. It resembles the uncertainty relation $\Delta x\,\Delta k \approx 1$, and that resemblance isn't entirely a coincidence. Hang on to this formula — three sections from now it comes back wearing a Fisher-information costume, and you'll recognize it.

For a sanity check, the simplest example: the parabola $F(x) = \tfrac{1}{2}\alpha x^2$. The slope is $s = \alpha x$, the inverse is $x = s/\alpha$, and the transform gives $G(s) = \tfrac{1}{2\alpha}s^2$. Boring? Boring. But this exact boring parabola is why statisticians are so fond of the Gaussian distribution: its dual is also Gaussian, and the whole construction closes neatly on itself.

Fenchel, or what to do when there's no derivative

The classical Legendre transform has an Achilles' heel: it needs a smooth, strictly convex function, or the equation $dF/dx = s$ doesn't have a unique solution and the whole scheme collapses. Real functions from optimization and statistics, meanwhile, are full of kinks, flat stretches, and awkward domains. So what do you do?

In the twentieth century, Werner Fenchel offered a solution I admire for its bluntness: if you can't solve the equation for the slope — don't. Take a supremum instead:

$$f^{*}(s) = \sup_{x}\big(\langle s, x\rangle - f(x)\big).$$

This is the Legendre–Fenchel transform, also known as convex conjugation. The geometric meaning is the same — we're still hunting for that tangent-line intercept — except now, instead of differentiating, we simply lay every possible line of slope $s$ against the graph and take the best intercept. No smoothness required, works on anything. As a bonus, $f^{*}$ is automatically convex (it's a supremum over a family of linear functions, and the pointwise max of lines is always convex — you can draw this and check), even if the original $f$ was a mess.

The symmetric equality correspondingly degrades, honestly, into an inequality — and this inequality is the main character of the rest of the story. The Fenchel–Young inequality:

$$f(x) + f^{*}(s) \ge \langle s, x\rangle,$$

with equality exactly when $x$ and $s$ are dual to each other (formally: $s$ is a subgradient of $f$ at $x$). Keep this one in your head — it'll turn out that the gap in this inequality is exactly what you're minimizing every time you train a neural network.

One last property, and it's the deepest one. Conjugate twice, and you get $f^{}$ — and this isn't always the original function: it's its convex hull, the largest convex function lying below $f$. For convex functions, $f^{} = f$, and we're back in involution territory; for non-convex ones, the transform neatly "pours concrete" over every dip. A nice detail: in thermodynamics, this pouring isn't an abstraction — it's exactly the Maxwell construction describing the coexistence of liquid and vapor. When water boils in your kettle, nature is convexifying a non-convex free energy. (You have to admit, your kettle sounds a lot more distinguished now.)

Thermodynamics: a dress rehearsal for neural networks

Now a short detour into physics — not for physics's own sake, but for the vocabulary. Everything we name here is going to show up again in neural networks one section from now, wearing the exact same name tag, and I want you to recognize it on sight.

An isolated system with energy $E$ has some number of microstates $\Omega(E)$, and its logarithm is the dimensionless entropy $\mathcal{S}(E) = \ln \Omega(E)$. Computing $\Omega$ directly is usually miserable, so physicists take a Laplace transform and introduce the partition function:

$$Z(\beta) = \int \Omega(E)\, e^{-\beta E}\, dE, \qquad \mathcal{F}(\beta) = -\ln Z(\beta),$$

where $\beta = 1/k_B T$ is the inverse temperature and $\mathcal{F}$ is the dimensionless free energy. And here's the trick the whole detour was for: when there are many particles, the Laplace integral is evaluated by the saddle-point method, and in that limit the relationship between $\mathcal{S}$ and $\mathcal{F}$ turns out — surprise — to be exactly the symmetric identity from before:

$$\mathcal{F}(\beta) + \mathcal{S}(E) = \beta E, \qquad \frac{d\mathcal{S}}{dE} = \beta, \qquad \frac{d\mathcal{F}}{d\beta} = E.$$

Entropy and free energy are Legendre duals of each other, and the conjugate pair of variables is energy and inverse temperature. Why does this matter to physicists? Because a system's energy is hard to control directly, while its temperature is easy — stick it in a thermal bath and dial in $\beta$. The Legendre transform is precisely what swaps the roles of "what's hard to control" and "what's convenient to set." Three souvenirs to take from this section: the partition function $Z$, the inverse temperature $\beta$, and the fact that the log of $Z$ is dual to entropy. Don't throw these away.

Statistics: every distribution has two coordinate systems

The first place in machine learning where Legendre–Fenchel duality stops being decorative and becomes a load-bearing wall is exponential families. It sounds specialized, but it's basically all of classical probability as actually used in practice: Gaussian, Bernoulli, Poisson, categorical — all members of one club with the same membership card:

$$p(x;\theta) = h(x)\,\exp\big(\langle \theta, T(x)\rangle - A(\theta)\big),$$

where $\theta$ are the so-called natural parameters, $T(x)$ are sufficient statistics, and

$$A(\theta) = \ln \int h(x)\, e^{\langle \theta, T(x)\rangle}\, dx$$

is the log-normalizer, better known as the log-partition function. Recognize it? It's the same partition function from physics, having relocated to statistics (the letter even looks similar). $A$ is convex, and its gradient hands you the expected values of the statistics: $\nabla A(\theta) = \mathbb{E}_\theta[T(x)] = \mu$.

This is where duality fully unfolds. Every such distribution has two equally valid coordinate systems: the natural parameters $\theta$ (internal, "energy-like") and the mean parameters $\mu$ (external, observable — the thing you actually measure from data). The dictionary between them is the Legendre–Fenchel transform: the conjugate function $A^{}(\mu)$ turns out to be (up to sign) the negative entropy, and translation runs $\mu = \nabla A(\theta)$ one way and $\theta = \nabla A^{}(\mu)$ the other. Maximum likelihood estimation in these families reduces to moment matching — "make the model's means equal the observed means" — and this isn't a lucky coincidence, it's a direct consequence of duality.

Remember that curvature formula I asked you to hold onto? Here it is again, in a new costume. The Hessian of the log-partition is the covariance of the statistics, also known as Fisher information:

$$\nabla^2 A(\theta) = \operatorname{Cov}_\theta[T(x)] = \mathcal{I}(\theta), \qquad \nabla^2 A(\theta)\cdot \nabla^2 A^{*}(\mu) = I.$$

"Dual curvatures are reciprocal" has turned into "Fisher information and its inverse live in dual coordinates" — and that's the foundation of the natural gradient, the Cramér–Rao bound, and the whole apparatus of variational inference in the Wainwright–Jordan tradition, where an intractable integral gets swapped out for a convex optimization problem. One triangle drawn on a piece of paper, and this is its family tree.

(A quick aside for probability enthusiasts: the rate function from Cramér's theorem — the one governing how exponentially unlikely it is for an empirical average to stray from the truth — is the Legendre–Fenchel transform of the cumulant generating function, $I(a) = \sup_\theta(\theta a - \ln \mathbb{E}\,e^{\theta X})$. So every time you calmly average something and trust that it'll concentrate, Legendre is personally underwriting that trust. End of aside.)

And here's where we've been headed: softmax

Now — the payoff. A large language model, at every step of generation, produces a vector of logits $z \in \mathbb{R}^V$ — one number for each of the $V$ tokens in its vocabulary — and turns it into a probability distribution via softmax:

$$\text{softmax}(z)i = \frac{e^{z_i}}{\sum{j} e^{z_j}}.$$

Softmax always travels with its shadow, the function LogSumExp: $\operatorname{LSE}(z) = \ln \sum_i e^{z_i}$. Usually softmax gets presented as a convenient heuristic: exponentials to make everything positive, a normalizer so it sums to one, nothing more to see here. Well, there's plenty more to see. Three facts.

Fact one. LogSumExp is convex, and its gradient is exactly softmax: $\nabla \operatorname{LSE}(z) = \text{softmax}(z)$. Compare this to $\nabla A(\theta) = \mu$ from the previous section. This isn't an analogy — it's literally the same object: logits are the natural parameters of a categorical distribution, probabilities are its mean parameters, and LogSumExp is its log-partition function. A language model, on every single token, is performing the coordinate change between the two representations of an exponential family.

Fact two, the main one. LogSumExp is the Legendre–Fenchel transform of negative entropy on the probability simplex. Take $\Omega(p) = \sum_i p_i \ln p_i$ (negative entropy), defined on the simplex $\Delta$ (nonnegative coordinates summing to one), and compute its conjugate:

$$\Omega^{*}(z) = \sup_{p \in \Delta}\Big(\langle z, p\rangle - \sum_i p_i \ln p_i\Big) = \operatorname{LSE}(z),$$

with the supremum achieved exactly at $p = \text{softmax}(z)$. Let me rephrase this, because it's worth rephrasing:

$$\text{softmax}(z) = \arg\max_{p \in \Delta}\Big(\langle z, p\rangle + H(p)\Big).$$

Softmax is argmax, bribed by entropy. Without the $H(p)$ term, the problem $\max_{p\in\Delta}\langle z,p\rangle$ has a crude solution: dump all the probability mass on the single best token, giving you a hard, non-differentiable argmax — a corner of the simplex, a nightmare for gradient descent. Entropy drags the solution away from the corner, into the interior of the simplex, and makes it smooth. LogSumExp is the smooth version of max; softmax is the smooth version of argmax; and both are manufactured by the exact same tool: Fenchel conjugation.

Fact three, about the slider. Every language model API has a temperature parameter, and sampling in practice looks like this:

$$p_i = \frac{e^{z_i/\tau}}{\sum_j e^{z_j/\tau}}.$$

This is the Boltzmann–Gibbs distribution, literally, no metaphor involved: logits play the role of negative energies, and $\tau = 1/\beta$ is the same temperature from the thermodynamics section — the souvenir I told you to hang onto. As $\tau \to 0$, the distribution freezes onto a hard argmax — the model becomes a bore who always says the single most likely thing; as $\tau \to \infty$, it melts into uniform noise. And $\tau\operatorname{LSE}(z/\tau) \to \max_i z_i$ as it cools — free energy converging to ground-state energy, straight out of a stat-mech textbook. Every time you nudge the temperature slider, you are, quite literally, setting $\beta$ in a Gibbs ensemble. This is, in my opinion, the single best fact for ruining a coworker's lunch break.

Let's assemble the whole mosaic: the model's vocabulary is a set of states; logits are energies; LogSumExp is the free energy over the vocabulary; softmax is the Gibbs distribution; temperature is inverse beta. The last layer of an LLM is a small Legendre–Fenchel machine that, at every generated token, re-solves the problem "find the distribution that maximizes expected logit plus entropy."

Training: what cross-entropy actually is

One knot left to untie: the loss function. Recall the Fenchel–Young inequality: $f(x) + f^{}(s) \ge \langle s, x\rangle$, a gap that's nonnegative and vanishes exactly on dual pairs. Here's a question worth asking: what if you declared that gap to be your loss function? You'd get the family of **Fenchel–Young losses* (that's their actual name — Blondel, Martins, and Niculae wrote an entire paper building this out):

$$L_{\Omega}(z, y) = \Omega^{*}(z) + \Omega(y) - \langle z, y\rangle \ge 0.$$

Plug in our $\Omega$ = negative entropy, and $y = e_k$, the one-hot vector for the correct token. Then $\Omega(e_k) = 0$ (a distribution concentrated at a single point has zero entropy), $\langle z, e_k\rangle = z_k$, and the loss collapses to

$$L(z, e_k) = \operatorname{LSE}(z) - z_k = -\ln \text{softmax}(z)_k.$$

That's cross-entropy. The exact loss every language model on the planet is trained on. It isn't "just another entry in the loss function zoo" — it's precisely the gap in the Fenchel–Young inequality between logits and target, and minimizing it means pulling the logits and the correct answer into a dual pair. Training an LLM is, if you squint, a multi-billion-dollar industry devoted to shrinking the gap in one inequality from 1949 convex analysis.

And the construction set is modular: swap out entropy for a different convex $\Omega$, get a different loss with a different activation. A quadratic penalty instead of entropy gives you sparsemax — softmax's sparse cousin, which can honestly assign exactly zero probability to unlikely tokens (softmax can't; every one of its outputs is strictly positive, no matter how unlikely). Other choices of $\Omega$ give you structured losses for CRFs and parsing. One slot for a convex function, an entire product line of outputs.

And very briefly, on how all of this actually gets optimized — because duality reaches there too. Mirror descent, and its special case, the natural gradient, work in both coordinate systems at once: you map a point into dual space through the gradient of a convex function, take a step there, and map back through the conjugate gradient; distances in this geometry are measured with a Bregman divergence, built — you've already guessed — from that same convex function. So Fenchel conjugation governs not only what the network computes and what it's trained on, but how it descends the loss landscape in the first place.

The moral

Let's walk back through the clues one last time. A trick from the late eighteenth century — "rewrite a function in terms of its slope" — turned out to be Gibbs free energy. Free energy turned out to be the log-partition function of exponential families. The log-partition turned out to be the LogSumExp in a transformer's final layer. Softmax turned out to be a Gibbs distribution and, simultaneously, a smoothed argmax; temperature turned out to be inverse temperature; cross-entropy turned out to be a Fenchel–Young gap; and the optimizer turns out to walk between dual coordinate systems like they're two rooms in the same apartment.

The best part of this story is that none of it is a stretch. The butler, the gardener, and the stranger on the train didn't turn out to be the same person because the author wanted a tidy ending — they turned out to be the same person because convex duality is the one canonical way to switch between two descriptions of one convex structure: "what we control" and "what responds," energy and temperature, logits and probabilities. Mathematics simply doesn't offer a second way to do it, which is why everyone — from Legendre to the engineers at OpenAI — inevitably runs into the exact same formula, often without knowing about each other at all.

So if you have one free evening and want to invest it at maximum return, spend it on the Legendre–Fenchel transform. One idea, and it covers material from four separate courses.

Sources and further reading

  • R. K. P. Zia, E. F. Redish, S. R. McKay. Making Sense of the Legendre Transform. Am. J. Phys. 77, 614 (2009). arXiv:0806.1147 — the pedagogical paper this piece leans on: the tangent-line geometry, the symmetric form, and the saddle-point derivation.
  • R. T. Rockafellar. Convex Analysis. Princeton, 1970 — the unforgiving classic on conjugation and everything downstream of it.
  • S. Boyd, L. Vandenberghe. Convex Optimization. Cambridge, 2004 — a humane introduction to duality; the chapter on conjugate functions is an evening's read.
  • M. J. Wainwright, M. I. Jordan. Graphical Models, Exponential Families, and Variational Inference. 2008 — log-partition/entropy duality as the engine of variational inference.
  • M. Blondel, A. F. T. Martins, V. Niculae. Learning with Fenchel-Young Losses. JMLR, 2020 — cross-entropy, sparsemax, and structured losses as one parametric family.

Top comments (0)