DEV Community

PSBigBig
PSBigBig

Posted on

# Day 14 — Symbolic Collapse (ProblemMap No.11)

Symptom

Equations, operators, and table references collapse into prose. Retrieval looks close but not exact. The model explains confidently while citing the wrong row or a different formula.

Root

Your pipeline discards the symbolic channel during intake and embedding. LaTeX and table structure get flattened. Similar looking prose wins over exact symbolic match.

Fix model

Keep the symbolic channel intact end to end. Add symbol-aware embeddings, equation boundaries, and table contracts. Verify with ΔS and operator set checks before you ship.

Acceptance targets you must meet:

  • ΔS(question, context) ≤ 0.45
  • Coverage ≥ 0.70 for the correct section
  • λ convergent across 3 paraphrases

You think vs reality

You think

  • “We store the PDF text. Equations are there somewhere.”
  • “BM25 or a general embedding will find the nearest paragraph.”
  • “Reranking will sort it out if top k includes the right neighborhood.”

Reality

  • LaTeX blocks were stripped during parsing or turned into images.
  • Unicode operators like ≤ ≥ ≈ ≠ got normalized away.
  • Chunker split a single equation across two chunks.
  • Reranker scores prose around the equation, not the math itself.
  • Table header order changed at ingest, citations point to a lookalike cell.

Before vs After

Traditional patching after generation

  • Detect wrong citation. Add reranker, regex, JSON repair, one more rule.
  • Ceiling sits near 70 to 85 percent. Every new patch raises risk of regressions.

WFGY firewall before generation

  • Inspect semantic field first. Check ΔS and coverage. If unstable, loop or redirect.
  • 90 to 95 percent stability becomes achievable because the system only generates from a stable state.
  • Once a failure mode is mapped, it stays fixed.

Short write up of the firewall idea here:

https://github.com/onestardao/WFGY/blob/main/ProblemMap/README.md


What symbolic collapse looks like

  • “a ≤ b + c” and “a < b + c” retrieve the same passages.
  • Table query asks for row X col Y, citation lands near the table but not the cell.
  • Long equation split across lines. Retrieval never sees the complete identity.
  • OCR swapped ∑ with E or 0 with O. Embedding thinks two formulas are the same.
  • Answers change when you paraphrase the question even though the math is exact.

60 second quick tests

1) Equation boundary probe

Search your store for an exact equation you know exists. If top k returns only prose, the symbol channel is gone.

2) Operator confusion test

Query two formulas that differ only by the operator. If the results overlap heavily, your embedding ignores operators.

3) Table anchor sanity

Ask for a value at row key and column key. If the citation does not bind to the exact cell, table contracts are missing.


Minimal fix — symbol aware embedding

Goal keep the symbolic channel from intake to retrieval. Do not split or normalize away the math.

1) Preserve math blocks

Do not strip LaTeX or MathML. Store an extra symbol_text field alongside clean_text. Keep block_type, offsets, equation_id.

2) Dual channel representation

Build vectors on [clean_text + symbol_text] or two vectors with late fusion. Verify ΔS(question, retrieved) ≤ 0.45 on symbol queries.

3) Equation aware chunking

Chunk on equation boundaries. Never break a single formula. Keep a stable equation_id for citability.

4) Table contracts

Persist table_id, row_key, col_key, cell_value, header_map. Retrieval must return cell coordinates. Cite then explain.

5) Reranker features

Add features for operator sets, variable names, numeric patterns. Penalize mismatched operator sets.

Reference pages to open:


Hard fixes when minimal is not enough

  • Symbol tokenizer or byte level model for the math channel.
  • Canonicalize LaTeX before hashing and embedding.
  • Build a secondary inverted index on operator sequences and variable sets.
  • Separate table schema store and join at retrieval time.
  • Eval gate that rejects answers when operator sets do not match.

Guardrails to turn on

  • Traceability contract

    Every citation must include block_type ∈ {equation, table, prose}, and an equation_id or cell coordinates.

  • ΔS and λ probes

    Measure ΔS on symbol-only prompts. Flag divergent λ when the model blends two formulas.

  • SCU policy

    Forbid cross section reuse if operator sets are different.

  • Variance clamp for math

    When block_type = equation or table, clamp paraphrase variance. Stay literal.


Tiny probe you can paste

Use it inside a reranker or a debug notebook.

import re

def symbol_set(text):
    keep = r"[=+\-*/<>≤≥≈≠∑∏∫∇→←↔⊂⊆⊃⊇∀∃∈∉∧∨¬]"
    return set(re.findall(keep, text))

def operator_mismatch(query_eq, retrieved_eq):
    q = symbol_set(query_eq)
    r = symbol_set(retrieved_eq)
    return {
        "query_symbols": sorted(q),
        "retrieved_symbols": sorted(r),
        "ok": q == r
    }

print(operator_mismatch("a ≤ b + c", "a < b + c"))
# shows the operator difference at a glance
Enter fullscreen mode Exit fullscreen mode


`


Acceptance checks before you ship

  • ΔS(question, retrieved) ≤ 0.45 on equation and table queries.
  • Operator set and variable names in retrieved block match the query.
  • Citations carry block_type and stable equation or cell IDs.
  • Coverage ≥ 0.70 for the correct symbolic section.
  • λ convergent across 3 paraphrases that vary only the surrounding prose.

Who this helps and how to use it in one minute

  • Teams with math or financial reports, scientific PDFs, or heavy tables.
  • Open the Global Fix Map index and jump to Embeddings, Retrieval, Chunking, or Data Contracts.
  • Apply the minimal fix steps and verify the acceptance targets above.
  • If you want a literal quick start, copy TXT OS and ask your model: “which Problem Map number am i hitting” then follow the linked page.

Global Fix Map index:
https://github.com/onestardao/WFGY/blob/main/ProblemMap/GlobalFixMap/README.md

TXT OS quick start:
https://github.com/onestardao/WFGY/blob/main/OS/TXTOS.txt


Why this is in the Global Fix Map

Symbolic collapse is a reproducible failure mode. Once mapped, it can be sealed permanently by checking ΔS and contracts before generation. You reduce debug time, and the fix does not depend on a specific vendor or SDK.

If you have a tough symbolic example, drop a short repro and I will add a test and a checklist to the next page of the map.

Top comments (0)