DEV Community

Cover image for Why Formal Systems Can't Read Their Own Output?
Jocer Franquiz
Jocer Franquiz

Posted on

Why Formal Systems Can't Read Their Own Output?

What do concatenative combinator calculus, mathematicians, and language models have in common? They all hit the same wall: Evaluation is lossy. A formal system can produce expressions but cannot unambiguously interpret them using only its own rules, because reduction destroys the information needed to recover intent.

This post traces a thread that surprised me, starting from a very concrete implementation question about a stack-based evaluator and ending somewhere between information theory and machine learning.

Concatenative Combinator evaluation

Imagine a simple concatenative (stack-based) combinator language. We have primitive operations like swap, drop, dip, call. We can define new words in a dictionary:

true  := [ swap drop ]
false := [ drop ]
not   := [ [ swap drop ] [ drop ] ] dip call
Enter fullscreen mode Exit fullscreen mode

The evaluator reduces expressions by expanding words into their definitions and applying reduction rules. It works fine. But the thing is, once an expression is fully reduced, we're left staring at a pile of primitives. The word not has been dissolved into [ [ swap drop ] [ drop ] ] dip call, and nothing in the output tells us it was ever not.

This is normal. Evaluators reduce. They don't un-reduce.

The Reflection Problem

Now suppose we want the system to have a reflection property (programs can inspect their own structure at runtime and check whether some subexpression matches a word in the dictionary). Not as an external prettifier applied after the fact, but as a capability within the calculus itself. A program can look at the top of the stack and ask: "do I recognize this?"

Structural equality checking is straightforward. We walk two expression trees and compare. The real problem emerges when we try it on a concrete example.

Take this expression sitting on the stack:

[ [ [ swap drop ] [ drop ] ] dip call ]
Enter fullscreen mode Exit fullscreen mode

Consulting the dictionary, we can read this as:

  • Level 0: just primitives: [ [ [ swap drop ] [ drop ] ] dip call ]
  • Level 1: partial recognition: [ [ true false ] dip call ]
  • Level 2: full recognition: [ not ]

All three are correct. The expression is all of these simultaneously.

One expression, three readings — the same reduced form interpreted at different abstraction levels

The matching is easy. The problem is that we get multiple valid matchings at different levels of abstraction, and nothing inside the formal system tells us which one is "right."

But this ambiguity is familiar

This is exactly what happens in natural language. "She is cold" means:

  • She is emotionally less expressive than others.
  • Her body temperature is below average.

Both readings are syntactically and semantically valid. The sentence is genuinely, irreducibly ambiguous. Linguists have studied this for decades (Grice's cooperative principle, Sperber and Wilson's relevance theory, for example). And the consensus is the same: we resolve it through context, the surrounding conversation, shared knowledge, the situation, pragmatic inference about the speaker's intent. The resolution comes from outside the sentence itself.

Our combinator expression has the same property. The syntax supports multiple interpretations. The formal system can enumerate them but cannot choose between them. The intent of the original author: "did they mean not, or were they deliberately composing true and false with a branching pattern?" was lost during evaluation.

Evaluation is lossy. It destroys authorial intent. Reflection tries to recover the signal that reduction discarded.

If the system can't resolve the ambiguity from within, what can?

The obvious engineering answer is: don't lose the information in the first place. Instrument the evaluator to retain provenance, record which dictionary words were expanded at each step, and we can trace back to the original expression deterministically. This works in a closed system where we control the evaluator.

But it breaks down quickly in practice: expressions arrive from external sources without history, interop boundaries strip annotations, provenance metadata grows combinatorially with expression size, and legacy systems weren't built to carry it. In the general case (reading an expression we didn't produce) the information is gone.

So what resolves the ambiguity when provenance isn't available? The same thing that resolves "she is cold": learned preferences shaped by experience.

The approach would be:

  1. The system presents all valid interpretations to the user, ranked by some initial heuristic (longest match, most abstract, most frequently chosen).
  2. The user reorders them: "no, in this context I meant not, not [ true false ] branch."
  3. A model learns from these supervised signals.
  4. Over time, the system predicts the preferred interpretation without asking.

This is supervised machine learning. More specifically, it's the same setup that underlies language models: given an ambiguous sequence, predict the most probable interpretation based on patterns learned from human-labeled examples.

We started with combinatory logic and arrived at a setting where a language model is a natural fit.

Proof tells us what's equivalent. Prediction tells us what was meant.

The Information-Theoretic View

Why is this ambiguity fundamental rather than incidental?

Evaluation is a many-to-one function. Multiple distinct source expressions reduce to the same normal form. not, [ true false ] branch, and [ [ swap drop ] [ drop ] ] dip call all collapse to identical primitive sequences. The mapping from source to reduced form is non-injective, it destroys distinctions.

Recovering intent from a reduced expression is therefore an inverse problem: given one output, determine which of many possible inputs produced it. This is the same structure we find in lossy compression, decompilation, and signal reconstruction. The ambiguity isn't a bug in our evaluator. It's a structural property of any system where evaluation collapses distinctions that interpretation needs.

Evaluation is many-to-one — multiple source expressions collapse to the same reduced form

This has a precise formal counterpart. In equality saturation, an e-graph data structure maintains all equivalent representations of an expression simultaneously, exactly the way our dictionary check produces multiple valid readings at different abstraction levels. The egg library (Willsey et al., POPL 2021) formalized this approach, and a key result is that the extraction problem (i.e. choosing the "best" equivalent term from an e-graph according to a cost function) is NP-hard in general.

That hardness result matters here. It tells us that even when we have a perfect representation of all valid interpretations (the e-graph), selecting the right one is fundamentally expensive. Simple cost functions (fewest nodes, shortest expression) give us a tractable approximation, but they don't capture intent. A learned cost model, trained on which interpretations users actually prefer, is a language model by another name. This is what it looks like when a formal system acquires a statistical component not as a convenience, but because the selection problem demands it.

Mathematicians Have the Same Problem

This pattern shows up everywhere, including in how humans do mathematics.

A mathematician works through a long calculation and arrives at some algebraic structure. Formally it's symbols on a page. But after years of training, they look at it and think: "of course! this is a Lie algebra." Or: "this has a Dirac delta function hiding in it."

That recognition is not deduction. They didn't derive it axiomatically in the moment. They pattern-matched against a vast internal library of structures, built up through years of supervised exposure: working problems, reading textbooks, being corrected by professors.

And just like the combinator case, the recognition is ambiguous. The same structure might be a Lie algebra, a tangent space, or an instance of a fiber bundle. The mathematician picks the interpretation most useful in context, and that judgment comes from experience, not from the formalism.

This is why mathematics takes years to learn. Part of that time goes to absorbing the sheer volume of formal content — modern mathematics is vast, and simply learning the definitions, theorems, and proof techniques is itself a major undertaking. But beyond that, what takes years is training the pattern recognition model in the mathematician's head. Learning to look at a page of symbols and feel that something "smells like" a cohomology group or is "obviously" a convolution. Mathematicians use aesthetic and intuitive language because the recognition process operates partly below the level of conscious formal reasoning.

The pedagogy even mirrors supervised learning: a student works through a problem; the professor says "do we see this is a Lie algebra?"; the student didn't see it; now they have a label; next time, they're more likely to recognize it. Over hundreds of examples with feedback, the model trains.

One Common Thread

Combinator calculus, Mathematical practice, and Natural Language are three seemingly unrelated domains, but all exhibit the same structure:

Combinator Reflection Mathematics Natural Language
System Combinator calculus Symbolic algebra Grammar + vocabulary
Expression Reduced stack Calculation result Sentence
Dictionary Word definitions Known structures Word meanings
Ambiguity Multiple valid readings Multiple recognizable patterns Multiple interpretations
Resolution Heuristics or learned preference model Formal training + intuition Pragmatic inference + contextual knowledge

In every case:

  • The formal content is produced mechanically.
  • Interpreting it requires recognition that is not mechanical.
  • Recognition involves ambiguity that the formalism cannot resolve internally.
  • Resolution requires judgment shaped by experience, whether heuristic, learned, or pragmatic.

The formalism produces. Recognition consumes. And they require fundamentally different mechanisms.

The information gap — production is deterministic, interpretation is ambiguous

Final remarks

Any sufficiently expressive formal system that tries to understand its own outputs the way its users understand them will eventually benefit from something like a language model inside it. Not as an add-on or a convenience, but as a natural component, because the gap between producing expressions and interpreting them is an information gap. Evaluation destroys distinctions that interpretation needs, and no amount of internal reasoning can recover what was never preserved.

The gap is not a bug in the implementation. It's a structural property of non-injective evaluation. And the way we cope with it in programming, in mathematics, in everyday speech, is always the same: we learn to predict what was meant, because the formal system alone can't recover it.


Further Reading

If these connections interest you, here are some threads worth pulling on.

Concatenative Combinatory Logic

Brent Kerby's The Theory of Concatenative Combinators (2002) is the foundational text for understanding how combinator calculus works in a concatenative (stack-based) setting. It establishes the theoretical framework that languages like Joy and Factor build on, and explores complete combinator bases for the concatenative setting. Available at tunes.org/~iepos/joy.html.

For a more recent and rigorous treatment, Foundations of Dawn: The Untyped Concatenative Calculus (2021) by the Dawn language project formalizes the core calculus and explores how data types can be encoded purely from combinators. See dawn-lang.org.

E-Graphs and Equality Saturation

The egg library by Max Willsey et al. is the key reference for equality saturation. Their paper egg: Fast and Extensible Equality Saturation (POPL 2021 Distinguished Paper) explains both the e-graph data structure and the extraction problem discussed in this post. See the egg website and the SIGPLAN blog post.

Self-Reference and Meaning

Douglas Hofstadter's Gödel, Escher, Bach: An Eternal Golden Braid (1979) remains the classic exploration of self-reference, formal systems, and the emergence of meaning. Relevant background for anyone interested in how systems that produce expressions relate to systems that interpret them.

Pattern Recognition in Mathematics

Jacques Hadamard's The Psychology of Invention in the Mathematical Field (1945) is a classic study of how mathematicians actually think, relying on visual imagery, aesthetic judgment, and sub-symbolic pattern recognition rather than purely formal deduction.

George Pólya's How to Solve It (1945) systematizes the heuristic, pattern-matching aspect of mathematical problem-solving that formal systems cannot capture.

More recently, DeepMind's collaboration with mathematicians used ML to suggest conjectures in knot theory and representation theory that the mathematicians then proved (Davies et al., Nature, 2021). A concrete instance of machines performing the same kind of pattern recognition that human mathematicians develop over years of training.

Language Models and Ambiguity

For the linguistic side, the classic reference is Speech and Language Processing by Dan Jurafsky and James Martin, which covers how statistical models handle ambiguity at every level of language. A free draft is available at web.stanford.edu/~jurafsky/slp3/.

On how humans resolve ambiguity without statistical models, H.P. Grice's "Logic and Conversation" (1975) introduced the cooperative principle and conversational implicature — the framework linguists use to explain how listeners infer intent beyond literal meaning. Sperber and Wilson's Relevance: Communication and Cognition (1986) extended this into a cognitive theory of pragmatic interpretation.


What do you think about this topic? See you next time...

Top comments (0)