TL;DR
your doc looks fine until equations, operators, or table headers show up. then answers sound fluent but drift, and citations land on “similar looking” sections. this is No.11 · Symbolic collapse. the fix is not a reranker band-aid after the fact. you must preserve the symbol channel end-to-end and gate outputs before generation with acceptance targets.
What breaks
- LaTeX/MathML gets stripped or rasterized at ingest. only the surrounding prose remains.
- tokenizers normalize or drop operators (
≤
,≈
,∼
,≠
), or reduce them to ASCII guesses. - embeddings capture the vibe around an equation, not the structure inside it.
- chunking cuts equations across lines, so retrieval never sees the whole statement.
- tables lose header bindings. citations point “near” the row, not the exact cell.
this is not random. it’s structural. the symbolic channel was lost between intake → embedding → retrieval.
Before vs After (why this keeps coming back)
before
most teams patch after generation. tiny regex. “just add a reranker.” the same failure returns next week with a different equation.
after (WFGY way)
install a semantic firewall before generation. keep math blocks as first-class text. encode a symbol channel. add table contracts and operator-set checks. only allow outputs when the semantic state is stable (ΔS, coverage, λ). when you fix it at the reasoning layer, it stays fixed.
- WFGY Problem Map: identify as No.11 · Symbolic collapse
- WFGY Global Fix Map: apply symbol-aware chunking, dual-channel embeddings, citations with block IDs, and hard acceptance gates
60-second triage
Equation boundary probe
search for a full known equation. if top-k returns only prose, your symbol channel got dropped.Operator confusion
query two formulas that differ by a single operator. if result sets overlap heavily, your embedding ignores operators.Table anchor sanity
ask for row X, col Y. if citations stop near the table but don’t bind to a cell, table semantics weren’t preserved.
Minimal fix: keep the symbol channel intact (intake → embed → retrieve)
Don’t strip LaTeX/MathML
persist math blocks as text. store asymbol_text
field alongsideclean_text
. never convert equations to images at ingest.Dual-channel representation
build embeddings on[clean_text + symbol_text]
or maintain two vectors and fuse late. verify ΔS(question, retrieved) ≤ 0.45 on symbol-only queries.
reference: Chunking → Embedding ContractEquation-aware chunking
chunk on math boundaries. never split one equation across chunks. attach(doc_id, block_id, offsets, block_type=equation|table|prose)
.
reference: Chunking checklistTable contracts
enforcetable_id, row_key, col_key, cell_value, header_map
. retrieval must return cell coordinates; citations must carry cell IDs.
reference: Retrieval traceabilityReranker with operator features
add features for operator sets, variable names, and numeric patterns. demote candidates with mismatched operator sets.
reference: RerankersMetric hygiene
don’t mix L2 and cosine; normalize consistently pre-embedding and at query time.
reference: Embedding vs Meaning, Metric Mismatch
Tiny probe you can paste today
import re
def symbol_set(text):
# toy probe; extend for full TeX/MathML coverage
keep = r"[=+\-*/<>≤≥≈≠∑∏∫∇→←↔⊂⊆⊃⊇∀∃∈∉∧∨¬]"
return set(re.findall(keep, text))
def operator_mismatch(query_eq, retrieved_eq):
q = symbol_set(query_eq)
r = symbol_set(retrieved_eq)
return {
"query_symbols": sorted(q),
"retrieved_symbols": sorted(r),
"ok": q == r
}
print(operator_mismatch("a ≤ b + c", "a < b + c"))
# shows the operator difference at a glance
wire this into your reranker. if ok
is false, demote or reject.
Hard fixes when minimal isn’t enough
- Symbol-aware tokenizer for the symbol channel (code/math-aware or byte-level). keep operators intact.
- TeX normalization (spacing, macro expansion) before hashing + embedding to avoid accidental near-duplicates.
- Operator exact-match side index based on operator sequences + variable sets; fuse with your semantic retriever.
- Table schema store as its own thing; join at retrieval time. never guess header bindings at generation time.
- Eval gates that reject when operator sets don’t match, instead of “explaining around” the mismatch.
WFGY guardrails to turn on
Traceability contract
citations must includeblock_type
and equation/cell IDs. you must round-trip to the exact span.ΔS + λ probes
measure ΔS on symbol-only queries. run three paraphrases; λ must converge. reset or redirect if it doesn’t.SCU (Symbolic Constraint Unlock)
forbid cross-section reuse when operator sets differ. stop “similar looking” prose leakage.Variance clamp
whenblock_type = equation|table
, dampen paraphrase variance to keep the symbolic parts stable.
Acceptance targets (don’t skip these)
- ΔS(question, retrieved) ≤ 0.45 on equation-only and table-only queries
- operator set + variable names must match between query and retrieved block
- citations include
block_type
and equation/cell IDs, and they round-trip exactly - coverage ≥ 0.70 on symbol-heavy sections
- λ convergent across 3 paraphrases that only vary surrounding prose
Where to go next
- Chunking → Embedding Contract
- Citation First
- Embedding vs Meaning, Metric Mismatch
- Rerankers
- Chunking checklist
- Retrieval traceability
Full series index
ProblemMap · Article Index
about the approach
this is part of a broader “fix before generation” mindset. the WFGY Problem Map catalogs reproducible failure modes and shows the structural fix that makes them stay fixed. the Global Fix Map expands that into RAG, embeddings, vector stores, multimodal alignment, agents, and ops. when the symbol channel is preserved end-to-end and outputs are gated with ΔS, coverage, and λ, symbolic collapse stops being a whack-a-mole and becomes a solved class.
Top comments (0)