Jocer Franquiz

Posted on Apr 7

Why Formal Systems Can't Read Their Own Output?

#computerscience #compiling #programming #logic

What do concatenative combinator calculus, mathematicians, and language models have in common? They all hit the same wall: Evaluation is lossy. A formal system can produce expressions but cannot unambiguously interpret them using only its own rules, because reduction destroys the information needed to recover intent.

This post traces a thread that surprised me, starting from a very concrete implementation question about a stack-based evaluator and ending somewhere between information theory and machine learning.

Concatenative Combinator evaluation

Imagine a simple concatenative (stack-based) combinator language. We have primitive operations like swap, drop, dip, call. We can define new words in a dictionary:

true  := [ swap drop ]
false := [ drop ]
not   := [ [ swap drop ] [ drop ] ] dip call

The evaluator reduces expressions by expanding words into their definitions and applying reduction rules. It works fine. But the thing is, once an expression is fully reduced, we're left staring at a pile of primitives. The word not has been dissolved into [ [ swap drop ] [ drop ] ] dip call, and nothing in the output tells us it was ever not.

This is normal. Evaluators reduce. They don't un-reduce.

The Reflection Problem

Now suppose we want the system to have a reflection property (programs can inspect their own structure at runtime and check whether some subexpression matches a word in the dictionary). Not as an external prettifier applied after the fact, but as a capability within the calculus itself. A program can look at the top of the stack and ask: "do I recognize this?"

Structural equality checking is straightforward. We walk two expression trees and compare. The real problem emerges when we try it on a concrete example.

Take this expression sitting on the stack:

[ [ [ swap drop ] [ drop ] ] dip call ]

Consulting the dictionary, we can read this as:

Level 0: just primitives: [ [ [ swap drop ] [ drop ] ] dip call ]
Level 1: partial recognition: [ [ true false ] dip call ]
Level 2: full recognition: [ not ]

All three are correct. The expression is all of these simultaneously.

The matching is easy. The problem is that we get multiple valid matchings at different levels of abstraction, and nothing inside the formal system tells us which one is "right."

But this ambiguity is familiar

This is exactly what happens in natural language. "She is cold" means:

She is emotionally less expressive than others.
Her body temperature is below average.

Both readings are syntactically and semantically valid. The sentence is genuinely, irreducibly ambiguous. Linguists have studied this for decades (Grice's cooperative principle, Sperber and Wilson's relevance theory, for example). And the consensus is the same: we resolve it through context, the surrounding conversation, shared knowledge, the situation, pragmatic inference about the speaker's intent. The resolution comes from outside the sentence itself.

Our combinator expression has the same property. The syntax supports multiple interpretations. The formal system can enumerate them but cannot choose between them. The intent of the original author: "did they mean not, or were they deliberately composing true and false with a branching pattern?" was lost during evaluation.

Evaluation is lossy. It destroys authorial intent. Reflection tries to recover the signal that reduction discarded.

If the system can't resolve the ambiguity from within, what can?

The obvious engineering answer is: don't lose the information in the first place. Instrument the evaluator to retain provenance, record which dictionary words were expanded at each step, and we can trace back to the original expression deterministically. This works in a closed system where we control the evaluator.

But it breaks down quickly in practice: expressions arrive from external sources without history, interop boundaries strip annotations, provenance metadata grows combinatorially with expression size, and legacy systems weren't built to carry it. In the general case (reading an expression we didn't produce) the information is gone.

So what resolves the ambiguity when provenance isn't available? The same thing that resolves "she is cold": learned preferences shaped by experience.

The approach would be:

The system presents all valid interpretations to the user, ranked by some initial heuristic (longest match, most abstract, most frequently chosen).
The user reorders them: "no, in this context I meant not, not [ true false ] branch."
A model learns from these supervised signals.
Over time, the system predicts the preferred interpretation without asking.

This is supervised machine learning. More specifically, it's the same setup that underlies language models: given an ambiguous sequence, predict the most probable interpretation based on patterns learned from human-labeled examples.

We started with combinatory logic and arrived at a setting where a language model is a natural fit.

Proof tells us what's equivalent. Prediction tells us what was meant.

The Information-Theoretic View

Why is this ambiguity fundamental rather than incidental?

Evaluation is a many-to-one function. Multiple distinct source expressions reduce to the same normal form. not, [ true false ] branch, and [ [ swap drop ] [ drop ] ] dip call all collapse to identical primitive sequences. The mapping from source to reduced form is non-injective, it destroys distinctions.

Recovering intent from a reduced expression is therefore an inverse problem: given one output, determine which of many possible inputs produced it. This is the same structure we find in lossy compression, decompilation, and signal reconstruction. The ambiguity isn't a bug in our evaluator. It's a structural property of any system where evaluation collapses distinctions that interpretation needs.

This has a precise formal counterpart. In equality saturation, an e-graph data structure maintains all equivalent representations of an expression simultaneously, exactly the way our dictionary check produces multiple valid readings at different abstraction levels. The egg library (Willsey et al., POPL 2021) formalized this approach, and a key result is that the extraction problem (i.e. choosing the "best" equivalent term from an e-graph according to a cost function) is NP-hard in general.

That hardness result matters here. It tells us that even when we have a perfect representation of all valid interpretations (the e-graph), selecting the right one is fundamentally expensive. Simple cost functions (fewest nodes, shortest expression) give us a tractable approximation, but they don't capture intent. A learned cost model, trained on which interpretations users actually prefer, is a language model by another name. This is what it looks like when a formal system acquires a statistical component not as a convenience, but because the selection problem demands it.

Mathematicians Have the Same Problem

This pattern shows up everywhere, including in how humans do mathematics.

A mathematician works through a long calculation and arrives at some algebraic structure. Formally it's symbols on a page. But after years of training, they look at it and think: "of course! this is a Lie algebra." Or: "this has a Dirac delta function hiding in it."

That recognition is not deduction. They didn't derive it axiomatically in the moment. They pattern-matched against a vast internal library of structures, built up through years of supervised exposure: working problems, reading textbooks, being corrected by professors.

And just like the combinator case, the recognition is ambiguous. The same structure might be a Lie algebra, a tangent space, or an instance of a fiber bundle. The mathematician picks the interpretation most useful in context, and that judgment comes from experience, not from the formalism.

This is why mathematics takes years to learn. Part of that time goes to absorbing the sheer volume of formal content — modern mathematics is vast, and simply learning the definitions, theorems, and proof techniques is itself a major undertaking. But beyond that, what takes years is training the pattern recognition model in the mathematician's head. Learning to look at a page of symbols and feel that something "smells like" a cohomology group or is "obviously" a convolution. Mathematicians use aesthetic and intuitive language because the recognition process operates partly below the level of conscious formal reasoning.

The pedagogy even mirrors supervised learning: a student works through a problem; the professor says "do we see this is a Lie algebra?"; the student didn't see it; now they have a label; next time, they're more likely to recognize it. Over hundreds of examples with feedback, the model trains.

One Common Thread

Combinator calculus, Mathematical practice, and Natural Language are three seemingly unrelated domains, but all exhibit the same structure:

	Combinator Reflection	Mathematics	Natural Language
System	Combinator calculus	Symbolic algebra	Grammar + vocabulary
Expression	Reduced stack	Calculation result	Sentence
Dictionary	Word definitions	Known structures	Word meanings
Ambiguity	Multiple valid readings	Multiple recognizable patterns	Multiple interpretations
Resolution	Heuristics or learned preference model	Formal training + intuition	Pragmatic inference + contextual knowledge

In every case:

The formal content is produced mechanically.
Interpreting it requires recognition that is not mechanical.
Recognition involves ambiguity that the formalism cannot resolve internally.
Resolution requires judgment shaped by experience, whether heuristic, learned, or pragmatic.

The formalism produces. Recognition consumes. And they require fundamentally different mechanisms.

Final remarks

Any sufficiently expressive formal system that tries to understand its own outputs the way its users understand them will eventually benefit from something like a language model inside it. Not as an add-on or a convenience, but as a natural component, because the gap between producing expressions and interpreting them is an information gap. Evaluation destroys distinctions that interpretation needs, and no amount of internal reasoning can recover what was never preserved.

The gap is not a bug in the implementation. It's a structural property of non-injective evaluation. And the way we cope with it in programming, in mathematics, in everyday speech, is always the same: we learn to predict what was meant, because the formal system alone can't recover it.