Day 15 — Symbolic collapse (No.11): when math, logic, and tables turn into “nice prose” — and how to stop it

#webdev #javascript #beginners #ai

TL;DR
your doc looks fine until equations, operators, or table headers show up. then answers sound fluent but drift, and citations land on “similar looking” sections. this is No.11 · Symbolic collapse. the fix is not a reranker band-aid after the fact. you must preserve the symbol channel end-to-end and gate outputs before generation with acceptance targets.

What breaks

LaTeX/MathML gets stripped or rasterized at ingest. only the surrounding prose remains.
tokenizers normalize or drop operators (≤, ≈, ∼, ≠), or reduce them to ASCII guesses.
embeddings capture the vibe around an equation, not the structure inside it.
chunking cuts equations across lines, so retrieval never sees the whole statement.
tables lose header bindings. citations point “near” the row, not the exact cell.

this is not random. it’s structural. the symbolic channel was lost between intake → embedding → retrieval.

Before vs After (why this keeps coming back)

before
most teams patch after generation. tiny regex. “just add a reranker.” the same failure returns next week with a different equation.

after (WFGY way)
install a semantic firewall before generation. keep math blocks as first-class text. encode a symbol channel. add table contracts and operator-set checks. only allow outputs when the semantic state is stable (ΔS, coverage, λ). when you fix it at the reasoning layer, it stays fixed.

WFGY Problem Map: identify as No.11 · Symbolic collapse
WFGY Global Fix Map: apply symbol-aware chunking, dual-channel embeddings, citations with block IDs, and hard acceptance gates

60-second triage

Equation boundary probe
search for a full known equation. if top-k returns only prose, your symbol channel got dropped.
Operator confusion
query two formulas that differ by a single operator. if result sets overlap heavily, your embedding ignores operators.
Table anchor sanity
ask for row X, col Y. if citations stop near the table but don’t bind to a cell, table semantics weren’t preserved.

Minimal fix: keep the symbol channel intact (intake → embed → retrieve)

Don’t strip LaTeX/MathML
persist math blocks as text. store a symbol_text field alongside clean_text. never convert equations to images at ingest.
Dual-channel representation
build embeddings on [clean_text + symbol_text] or maintain two vectors and fuse late. verify ΔS(question, retrieved) ≤ 0.45 on symbol-only queries.
reference: Chunking → Embedding Contract
Equation-aware chunking
chunk on math boundaries. never split one equation across chunks. attach (doc_id, block_id, offsets, block_type=equation|table|prose).
reference: Chunking checklist
Table contracts
enforce table_id, row_key, col_key, cell_value, header_map. retrieval must return cell coordinates; citations must carry cell IDs.
reference: Retrieval traceability
Reranker with operator features
add features for operator sets, variable names, and numeric patterns. demote candidates with mismatched operator sets.
reference: Rerankers
Metric hygiene
don’t mix L2 and cosine; normalize consistently pre-embedding and at query time.
reference: Embedding vs Meaning, Metric Mismatch

Tiny probe you can paste today

import re

def symbol_set(text):
    # toy probe; extend for full TeX/MathML coverage
    keep = r"[=+\-*/<>≤≥≈≠∑∏∫∇→←↔⊂⊆⊃⊇∀∃∈∉∧∨¬]"
    return set(re.findall(keep, text))

def operator_mismatch(query_eq, retrieved_eq):
    q = symbol_set(query_eq)
    r = symbol_set(retrieved_eq)
    return {
        "query_symbols": sorted(q),
        "retrieved_symbols": sorted(r),
        "ok": q == r
    }

print(operator_mismatch("a ≤ b + c", "a < b + c"))
# shows the operator difference at a glance

wire this into your reranker. if ok is false, demote or reject.

Hard fixes when minimal isn’t enough

Symbol-aware tokenizer for the symbol channel (code/math-aware or byte-level). keep operators intact.
TeX normalization (spacing, macro expansion) before hashing + embedding to avoid accidental near-duplicates.
Operator exact-match side index based on operator sequences + variable sets; fuse with your semantic retriever.
Table schema store as its own thing; join at retrieval time. never guess header bindings at generation time.
Eval gates that reject when operator sets don’t match, instead of “explaining around” the mismatch.

WFGY guardrails to turn on

Traceability contract
citations must include block_type and equation/cell IDs. you must round-trip to the exact span.
ΔS + λ probes
measure ΔS on symbol-only queries. run three paraphrases; λ must converge. reset or redirect if it doesn’t.
SCU (Symbolic Constraint Unlock)
forbid cross-section reuse when operator sets differ. stop “similar looking” prose leakage.
Variance clamp
when block_type = equation|table, dampen paraphrase variance to keep the symbolic parts stable.

Acceptance targets (don’t skip these)

ΔS(question, retrieved) ≤ 0.45 on equation-only and table-only queries
operator set + variable names must match between query and retrieved block
citations include block_type and equation/cell IDs, and they round-trip exactly
coverage ≥ 0.70 on symbol-heavy sections
λ convergent across 3 paraphrases that only vary surrounding prose

Where to go next

Full series index
ProblemMap · Article Index

about the approach
this is part of a broader “fix before generation” mindset. the WFGY Problem Map catalogs reproducible failure modes and shows the structural fix that makes them stay fixed. the Global Fix Map expands that into RAG, embeddings, vector stores, multimodal alignment, agents, and ops. when the symbol channel is preserved end-to-end and outputs are gated with ΔS, coverage, and λ, symbolic collapse stops being a whack-a-mole and becomes a solved class.