Symptom
Equations, operators, and table references collapse into prose. Retrieval looks close but not exact. The model explains confidently while citing the wrong row or a different formula.
Root
Your pipeline discards the symbolic channel during intake and embedding. LaTeX and table structure get flattened. Similar looking prose wins over exact symbolic match.
Fix model
Keep the symbolic channel intact end to end. Add symbol-aware embeddings, equation boundaries, and table contracts. Verify with ΔS and operator set checks before you ship.
Acceptance targets you must meet:
- ΔS(question, context) ≤ 0.45
- Coverage ≥ 0.70 for the correct section
- λ convergent across 3 paraphrases
You think vs reality
You think
- “We store the PDF text. Equations are there somewhere.”
- “BM25 or a general embedding will find the nearest paragraph.”
- “Reranking will sort it out if top k includes the right neighborhood.”
Reality
- LaTeX blocks were stripped during parsing or turned into images.
- Unicode operators like ≤ ≥ ≈ ≠ got normalized away.
- Chunker split a single equation across two chunks.
- Reranker scores prose around the equation, not the math itself.
- Table header order changed at ingest, citations point to a lookalike cell.
Before vs After
Traditional patching after generation
- Detect wrong citation. Add reranker, regex, JSON repair, one more rule.
- Ceiling sits near 70 to 85 percent. Every new patch raises risk of regressions.
WFGY firewall before generation
- Inspect semantic field first. Check ΔS and coverage. If unstable, loop or redirect.
- 90 to 95 percent stability becomes achievable because the system only generates from a stable state.
- Once a failure mode is mapped, it stays fixed.
Short write up of the firewall idea here:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md
What symbolic collapse looks like
- “a ≤ b + c” and “a < b + c” retrieve the same passages.
- Table query asks for row X col Y, citation lands near the table but not the cell.
- Long equation split across lines. Retrieval never sees the complete identity.
- OCR swapped ∑ with E or 0 with O. Embedding thinks two formulas are the same.
- Answers change when you paraphrase the question even though the math is exact.
60 second quick tests
1) Equation boundary probe
Search your store for an exact equation you know exists. If top k returns only prose, the symbol channel is gone.
2) Operator confusion test
Query two formulas that differ only by the operator. If the results overlap heavily, your embedding ignores operators.
3) Table anchor sanity
Ask for a value at row key and column key. If the citation does not bind to the exact cell, table contracts are missing.
Minimal fix — symbol aware embedding
Goal keep the symbolic channel from intake to retrieval. Do not split or normalize away the math.
1) Preserve math blocks
Do not strip LaTeX or MathML. Store an extra symbol_text
field alongside clean_text
. Keep block_type
, offsets
, equation_id
.
2) Dual channel representation
Build vectors on [clean_text + symbol_text]
or two vectors with late fusion. Verify ΔS(question, retrieved) ≤ 0.45 on symbol queries.
3) Equation aware chunking
Chunk on equation boundaries. Never break a single formula. Keep a stable equation_id
for citability.
4) Table contracts
Persist table_id
, row_key
, col_key
, cell_value
, header_map
. Retrieval must return cell coordinates. Cite then explain.
5) Reranker features
Add features for operator sets, variable names, numeric patterns. Penalize mismatched operator sets.
Reference pages to open:
- Data Contracts → https://github.com/onestardao/WFGY/blob/main/ProblemMap/data-contracts.md
- Retrieval Traceability → https://github.com/onestardao/WFGY/blob/main/ProblemMap/retrieval-traceability.md
- Embedding ≠ Semantic → https://github.com/onestardao/WFGY/blob/main/ProblemMap/embedding-vs-semantic.md
- Rerankers → https://github.com/onestardao/WFGY/blob/main/ProblemMap/rerankers.md
Hard fixes when minimal is not enough
- Symbol tokenizer or byte level model for the math channel.
- Canonicalize LaTeX before hashing and embedding.
- Build a secondary inverted index on operator sequences and variable sets.
- Separate table schema store and join at retrieval time.
- Eval gate that rejects answers when operator sets do not match.
Guardrails to turn on
Traceability contract
Every citation must includeblock_type ∈ {equation, table, prose}
, and anequation_id
or cell coordinates.ΔS and λ probes
Measure ΔS on symbol-only prompts. Flag divergent λ when the model blends two formulas.SCU policy
Forbid cross section reuse if operator sets are different.Variance clamp for math
Whenblock_type = equation
ortable
, clamp paraphrase variance. Stay literal.
Tiny probe you can paste
Use it inside a reranker or a debug notebook.
import re
def symbol_set(text):
keep = r"[=+\-*/<>≤≥≈≠∑∏∫∇→←↔⊂⊆⊃⊇∀∃∈∉∧∨¬]"
return set(re.findall(keep, text))
def operator_mismatch(query_eq, retrieved_eq):
q = symbol_set(query_eq)
r = symbol_set(retrieved_eq)
return {
"query_symbols": sorted(q),
"retrieved_symbols": sorted(r),
"ok": q == r
}
print(operator_mismatch("a ≤ b + c", "a < b + c"))
# shows the operator difference at a glance
`
Acceptance checks before you ship
- ΔS(question, retrieved) ≤ 0.45 on equation and table queries.
- Operator set and variable names in retrieved block match the query.
- Citations carry
block_type
and stable equation or cell IDs. - Coverage ≥ 0.70 for the correct symbolic section.
- λ convergent across 3 paraphrases that vary only the surrounding prose.
Who this helps and how to use it in one minute
- Teams with math or financial reports, scientific PDFs, or heavy tables.
- Open the Global Fix Map index and jump to Embeddings, Retrieval, Chunking, or Data Contracts.
- Apply the minimal fix steps and verify the acceptance targets above.
- If you want a literal quick start, copy TXT OS and ask your model: “which Problem Map number am i hitting” then follow the linked page.
Global Fix Map index:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md
TXT OS quick start:
https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt
Why this is in the Global Fix Map
Symbolic collapse is a reproducible failure mode. Once mapped, it can be sealed permanently by checking ΔS and contracts before generation. You reduce debug time, and the fix does not depend on a specific vendor or SDK.
If you have a tough symbolic example, drop a short repro and I will add a test and a checklist to the next page of the map.
Top comments (0)