This is a follow-up to my previous post about TRI·TFM Lens. Here I'm sharing the full research data behind the framework.
In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows.
Scale of the Research
| Experiment | Prompts | Repeats | Total Evals | Model |
|---|---|---|---|---|
| Judge calibration v1-v2 (Logs v5-v8) | 40+ | varied | ~190 | Gemini Flash |
| Lexeme experiments (3 batches) | 30+ | 3 | ~90 | Gemini Flash |
| Domain generalization (P1) | 10 | 3 | 30 | Gemini Flash |
| M-axis validation v1 (P2) | 20 | 3 | 46* | Gemini Flash |
| M-axis revalidation v2 (P2) | 20 | 3 | 59* | Gemini Flash |
| M-axis fixed responses (P2v3) | 10 | 5 | 50 | Gemini Flash |
| M-axis extended output (P2v4) | 20 | 3 | 60 | Gemini Flash |
| Cross-model validation (P5) | 10 | 2 | 20 | Gemini Pro |
| Final 100-prompt validation | 100 | 1 | 100 | Gemini Flash |
| Sensitivity analysis (P3) | — | — | 76×4 configs | recomputed |
| Total | ~700+ |
Some runs had JSON parse failures, noted with asterisk
This isn't a cherry-picked demo. It's 6 months of iterative experimentation across 8 prompt categories, 2 languages, 2 models, 5 judge versions, and 4 research phases.
Finding #1: The F-Hierarchy Is Real and Stable
The Fact axis (epistemic grounding) produces a clean three-tier hierarchy that holds across EVERY experiment:
Tier 1 — Verifiable (F > 0.85)
├── Technical: F = 0.91 (code, algorithms, how-to)
└── Factual: F = 0.90 (science, history, medicine)
Tier 2 — Mixed (F = 0.55-0.65)
├── Personal: F = 0.60 (advice, life guidance)
└── Directive: F = 0.61 (persuasion, argumentation)
Tier 3 — Unfalsifiable (F < 0.45)
├── Philosophical: F = 0.43 (meaning, consciousness, free will)
├── Creative: F = 0.42 (poetry, fiction, humor)
├── Ethical: F = 0.40 (moral dilemmas)
└── Other: F = 0.39 (paradoxes, meta-questions)
The gap between Tier 1 and Tier 3: Δ_F = 0.494
Here's the kicker — this gap is identical across experiments:
| Experiment | n | Δ_F |
|---|---|---|
| Domain generalization (5 fields) | 30 | 0.496 |
| Cross-model (Gemini Pro) | 20 | 0.480 |
| Final 100-prompt validation | 100 | 0.494 |
The F-calibration algorithm works. Every time.
Finding #2: F Transfers Across Models, Nothing Else Does
Same 10 prompts, two different models (Gemini Flash vs Pro):
| Axis | Pearson r | What it means |
|---|---|---|
| F (Fact) | 0.963 | Near-identical rankings |
| Bal (Balance) | 0.942 | Formula is model-independent |
| N (Narrative) | 0.742 | Decent agreement |
| M (Depth) | 0.637 | Moderate — content-dependent |
| E (Emotion) | 0.383 | Poor — tone is subjective |
F is objective. E is subjective. Even for AI.
This means: if you build an evaluation system, factual grounding is the axis you can trust across models. Tone assessment requires per-model calibration.
Finding #3: Every Category Has a Unique "Fingerprint"
This is the chart that makes TRI·TFM click. Each category produces a distinctive axis profile:
| Category | E | F | N | M | B | Bal | Personality |
|---|---|---|---|---|---|---|---|
| Technical | 0.74 | 0.91 | 0.85 | 0.82 | +0.02 | 0.90 | The reliable expert |
| Factual | 0.74 | 0.90 | 0.83 | 0.75 | 0.00 | 0.89 | The textbook |
| Personal | 0.79 | 0.60 | 0.82 | 0.65 | 0.00 | 0.81 | The therapist |
| Philosophical | 0.72 | 0.43 | 0.81 | 0.69 | 0.00 | 0.78 | The thinker |
| Ethical | 0.74 | 0.40 | 0.81 | 0.72 | 0.00 | 0.76 | The ethicist |
| Directive | 0.79 | 0.62 | 0.85 | 0.70 | +0.72 | 0.65 | The salesman |
| Creative | 0.85 | 0.42 | 0.83 | 0.43 | +0.06 | 0.62 | The artist |
Look at the patterns:
- Technical = highest F + highest M. The model knows stuff AND explains why.
- Creative = highest E + lowest M. Emotionally resonant but doesn't explain anything. Correct.
- Directive = B=+0.72. The model doesn't even pretend to be neutral when asked to persuade. The Bias axis catches this.
- Ethical = low F (0.40) but high M (0.72). You CAN deeply analyze something unfalsifiable. This proves F and M are independent axes.
Finding #4: Balance Formula Is Weight-Invariant
"Your formula weights are arbitrary" — the obvious critique. Here's the answer:
Tested 4 weight configurations on 76 measurements:
| Config | w_EFNM | w_B | Mean Bal | %STABLE |
|---|---|---|---|---|
| Default | 0.75 | 0.25 | 0.842 | 92% |
| Bias-heavy | 0.60 | 0.40 | 0.870 | 92% |
| EFNM-heavy | 0.85 | 0.15 | 0.824 | 84% |
| Equal | 0.50 | 0.50 | 0.888 | 95% |
Spearman ρ > 0.97 between ALL pairs.
The ranking doesn't change. The "best" responses are always on top, the "worst" always on bottom. The weights shift the scale, not the order.
Finding #5: RLHF Models Compensate (The Negative Result)
This is the most interesting finding and it's a failure.
I created pairs of prompts — shallow ("What is X?") and deep ("Explain the causal chain of why X works at multiple levels"). Expected: deep prompts get much higher M scores.
| Version | PASS rate | Mean Δ_M | What changed |
|---|---|---|---|
| v1 (initial rubric) | 3/10 | 0.073 | — |
| v2 (tightened rubric) | 3/10 | 0.067 | Stricter scoring bands |
| v3 (fixed responses) | 5/5 | 0.384 | Judge-only, hand-crafted |
| v4 (longer output) | 7/10 | 0.263 | gen_tokens 2048→4096 |
The rubric works perfectly on controlled inputs (5/5). But in end-to-end mode, the generator compensates: even "What is photosynthesis?" gets a multi-paragraph explanation with causal chains.
This is an RLHF property, not a framework limitation. Any evaluation system measuring "depth" on instruction-tuned models will hit this wall. The model always tries to be maximally helpful, which means it over-explains everything.
Implication: If you want to measure depth differences, control the generator or compare across models on the same prompt.
Finding #6: Bilingual Robustness
50 English + 50 Russian prompts:
| Axis | EN | RU | Δ |
|---|---|---|---|
| E | 0.761 | 0.770 | +0.009 |
| F | 0.617 | 0.577 | −0.040 |
| N | 0.827 | 0.826 | −0.001 |
| M | 0.688 | 0.665 | −0.024 |
| Bal | 0.777 | 0.769 | −0.008 |
All deltas < 0.05. The framework is language-agnostic.
Finding #7: Domain Generalization
F-hierarchy tested across 5 professional domains:
| Domain | F_factual | F_philosophical | Δ_F | Status |
|---|---|---|---|---|
| Medicine | 0.933 | 0.400 | 0.533 | ✅ |
| Law | 0.893 | 0.400 | 0.493 | ✅ |
| Finance | 0.900 | 0.400 | 0.500 | ✅ |
| Education | 0.900 | 0.400 | 0.500 | ✅ |
| Marketing | 0.853 | 0.400 | 0.453 | ✅ |
5/5. The 3-step F-calibration generalizes across every domain we tested.
Finding #8: Judge Reliability Improved 50x
| Metric | Early versions | Final version |
|---|---|---|
| JSON parse failures | 23% (14/60) | 0% (0/100) |
| σ_bal (test-retest) | 0.058 | <0.025 |
| σ_F (test-retest) | 0.035 | 0.000 |
The fix: increasing judge output tokens from 1024→2048 and using strict response_schema enforcement.
What's Still Broken (Honest Limitations)
L1: No human validation. Everything is LLM-judged. We need 3-5 human annotators scoring the same responses to compute inter-rater agreement. This is the #1 priority.
L2: Same model family. Both Flash and Pro are Gemini. Testing with GPT-4, Claude, and open-source models would strengthen claims.
L3: N-axis compression. N ranges from 0.75-0.95 with σ=0.035. RLHF models always produce well-structured responses. The axis only differentiates on weak models.
L4: E-axis compression. Same issue. E ranges 0.70-0.90. Modern models are always tone-appropriate.
L5: Self-evaluation bias. Same model generates and judges. Cross-family evaluation needed.
The Evolution: 5 Judge Versions in 6 Months
| Version | Date | Key Change | What Broke | What Fixed |
|---|---|---|---|---|
| v1 | Oct 2025 | Initial 4-axis (E/F/N/B) | Ceiling effects, F inflation | — |
| v2 | Jan 2026 | Strict rubric, variance reduction | F still inflated on philosophy | E/N ceilings fixed |
| v2.1 | Feb 2026 | 3-step F calibration + self-check | N unstable on short creative | F inflation eliminated |
| v3.0 | Mar 2026 | Added M-axis (5 axes), Bloom's grounding | M doesn't discriminate in end-to-end | M validated on controlled |
| v3.0+ | Mar 2026 | Tightened M rubric, extended gen tokens | Generator compensation | 7/10 PASS, 99.4% reliability |
Each version was driven by empirical failure, not theoretical design. 47 documented observations across 4 research phases.
Try It Yourself
TRI·TFM Lens Chrome extension is in Web Store review now. Works on ChatGPT and Google Gemini.
The full research paper (12 pages, 6 figures, LaTeX) is available — DM me or check my Zenodo profile.
The original EFMNB methodology that started this: [Zenodo, September 2025]
700+ evaluations. 8 categories. 2 languages. 2 models. 5 judge versions. 47 observations. One framework.
Arseny Perel — Independent AI Researcher
If you want to discuss the methodology, point out flaws, or suggest experiments — comments are open. Negative results are as valuable as positive ones.
Top comments (0)