TL;DR
your doc looks fine until equations, operators, or table headers show up. then answers sound fluent but drift, and citations land on “similar looking” sections. this is No.11 · Symbolic collapse. the fix is not a reranker band-aid after the fact. you must preserve the symbol channel end-to-end and gate outputs before generation with acceptance targets.
What breaks
- LaTeX/MathML gets stripped or rasterized at ingest. only the surrounding prose remains.
- tokenizers normalize or drop operators (≤,≈,∼,≠), or reduce them to ASCII guesses.
- embeddings capture the vibe around an equation, not the structure inside it.
- chunking cuts equations across lines, so retrieval never sees the whole statement.
- tables lose header bindings. citations point “near” the row, not the exact cell.
this is not random. it’s structural. the symbolic channel was lost between intake → embedding → retrieval.
Before vs After (why this keeps coming back)
before
most teams patch after generation. tiny regex. “just add a reranker.” the same failure returns next week with a different equation.
after (WFGY way)
install a semantic firewall before generation. keep math blocks as first-class text. encode a symbol channel. add table contracts and operator-set checks. only allow outputs when the semantic state is stable (ΔS, coverage, λ). when you fix it at the reasoning layer, it stays fixed.
- WFGY Problem Map: identify as No.11 · Symbolic collapse
- WFGY Global Fix Map: apply symbol-aware chunking, dual-channel embeddings, citations with block IDs, and hard acceptance gates
60-second triage
- Equation boundary probe 
 search for a full known equation. if top-k returns only prose, your symbol channel got dropped.
- Operator confusion 
 query two formulas that differ by a single operator. if result sets overlap heavily, your embedding ignores operators.
- Table anchor sanity 
 ask for row X, col Y. if citations stop near the table but don’t bind to a cell, table semantics weren’t preserved.
Minimal fix: keep the symbol channel intact (intake → embed → retrieve)
- Don’t strip LaTeX/MathML 
 persist math blocks as text. store a- symbol_textfield alongside- clean_text. never convert equations to images at ingest.
- Dual-channel representation 
 build embeddings on- [clean_text + symbol_text]or maintain two vectors and fuse late. verify ΔS(question, retrieved) ≤ 0.45 on symbol-only queries.
 reference: Chunking → Embedding Contract
- Equation-aware chunking 
 chunk on math boundaries. never split one equation across chunks. attach- (doc_id, block_id, offsets, block_type=equation|table|prose).
 reference: Chunking checklist
- Table contracts 
 enforce- table_id, row_key, col_key, cell_value, header_map. retrieval must return cell coordinates; citations must carry cell IDs.
 reference: Retrieval traceability
- Reranker with operator features 
 add features for operator sets, variable names, and numeric patterns. demote candidates with mismatched operator sets.
 reference: Rerankers
- Metric hygiene 
 don’t mix L2 and cosine; normalize consistently pre-embedding and at query time.
 reference: Embedding vs Meaning, Metric Mismatch
Tiny probe you can paste today
import re
def symbol_set(text):
    # toy probe; extend for full TeX/MathML coverage
    keep = r"[=+\-*/<>≤≥≈≠∑∏∫∇→←↔⊂⊆⊃⊇∀∃∈∉∧∨¬]"
    return set(re.findall(keep, text))
def operator_mismatch(query_eq, retrieved_eq):
    q = symbol_set(query_eq)
    r = symbol_set(retrieved_eq)
    return {
        "query_symbols": sorted(q),
        "retrieved_symbols": sorted(r),
        "ok": q == r
    }
print(operator_mismatch("a ≤ b + c", "a < b + c"))
# shows the operator difference at a glance
wire this into your reranker. if ok is false, demote or reject.
Hard fixes when minimal isn’t enough
- Symbol-aware tokenizer for the symbol channel (code/math-aware or byte-level). keep operators intact.
- TeX normalization (spacing, macro expansion) before hashing + embedding to avoid accidental near-duplicates.
- Operator exact-match side index based on operator sequences + variable sets; fuse with your semantic retriever.
- Table schema store as its own thing; join at retrieval time. never guess header bindings at generation time.
- Eval gates that reject when operator sets don’t match, instead of “explaining around” the mismatch.
WFGY guardrails to turn on
- Traceability contract 
 citations must include- block_typeand equation/cell IDs. you must round-trip to the exact span.
- ΔS + λ probes 
 measure ΔS on symbol-only queries. run three paraphrases; λ must converge. reset or redirect if it doesn’t.
- SCU (Symbolic Constraint Unlock) 
 forbid cross-section reuse when operator sets differ. stop “similar looking” prose leakage.
- Variance clamp 
 when- block_type = equation|table, dampen paraphrase variance to keep the symbolic parts stable.
Acceptance targets (don’t skip these)
- ΔS(question, retrieved) ≤ 0.45 on equation-only and table-only queries
- operator set + variable names must match between query and retrieved block
- citations include block_typeand equation/cell IDs, and they round-trip exactly
- coverage ≥ 0.70 on symbol-heavy sections
- λ convergent across 3 paraphrases that only vary surrounding prose
Where to go next
- Chunking → Embedding Contract
- Citation First
- Embedding vs Meaning, Metric Mismatch
- Rerankers
- Chunking checklist
- Retrieval traceability
Full series index
ProblemMap · Article Index
about the approach
this is part of a broader “fix before generation” mindset. the WFGY Problem Map catalogs reproducible failure modes and shows the structural fix that makes them stay fixed. the Global Fix Map expands that into RAG, embeddings, vector stores, multimodal alignment, agents, and ops. when the symbol channel is preserved end-to-end and outputs are gated with ΔS, coverage, and λ, symbolic collapse stops being a whack-a-mole and becomes a solved class.
 


 
    
Top comments (0)