DEV Community

Арсений Перель
Арсений Перель

Posted on

I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

This is a follow-up to my previous post about TRI·TFM Lens. Here I'm sharing the full research data behind the framework.

In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows.

Scale of the Research

Experiment Prompts Repeats Total Evals Model
Judge calibration v1-v2 (Logs v5-v8) 40+ varied ~190 Gemini Flash
Lexeme experiments (3 batches) 30+ 3 ~90 Gemini Flash
Domain generalization (P1) 10 3 30 Gemini Flash
M-axis validation v1 (P2) 20 3 46* Gemini Flash
M-axis revalidation v2 (P2) 20 3 59* Gemini Flash
M-axis fixed responses (P2v3) 10 5 50 Gemini Flash
M-axis extended output (P2v4) 20 3 60 Gemini Flash
Cross-model validation (P5) 10 2 20 Gemini Pro
Final 100-prompt validation 100 1 100 Gemini Flash
Sensitivity analysis (P3) 76×4 configs recomputed
Total ~700+

Some runs had JSON parse failures, noted with asterisk

This isn't a cherry-picked demo. It's 6 months of iterative experimentation across 8 prompt categories, 2 languages, 2 models, 5 judge versions, and 4 research phases.

Finding #1: The F-Hierarchy Is Real and Stable

The Fact axis (epistemic grounding) produces a clean three-tier hierarchy that holds across EVERY experiment:

Tier 1 — Verifiable (F > 0.85)
├── Technical:      F = 0.91  (code, algorithms, how-to)
└── Factual:        F = 0.90  (science, history, medicine)

Tier 2 — Mixed (F = 0.55-0.65)
├── Personal:       F = 0.60  (advice, life guidance)
└── Directive:      F = 0.61  (persuasion, argumentation)

Tier 3 — Unfalsifiable (F < 0.45)
├── Philosophical:  F = 0.43  (meaning, consciousness, free will)
├── Creative:       F = 0.42  (poetry, fiction, humor)
├── Ethical:        F = 0.40  (moral dilemmas)
└── Other:          F = 0.39  (paradoxes, meta-questions)
Enter fullscreen mode Exit fullscreen mode

The gap between Tier 1 and Tier 3: Δ_F = 0.494

Here's the kicker — this gap is identical across experiments:

Experiment n Δ_F
Domain generalization (5 fields) 30 0.496
Cross-model (Gemini Pro) 20 0.480
Final 100-prompt validation 100 0.494

The F-calibration algorithm works. Every time.

Finding #2: F Transfers Across Models, Nothing Else Does

Same 10 prompts, two different models (Gemini Flash vs Pro):

Axis Pearson r What it means
F (Fact) 0.963 Near-identical rankings
Bal (Balance) 0.942 Formula is model-independent
N (Narrative) 0.742 Decent agreement
M (Depth) 0.637 Moderate — content-dependent
E (Emotion) 0.383 Poor — tone is subjective

F is objective. E is subjective. Even for AI.

This means: if you build an evaluation system, factual grounding is the axis you can trust across models. Tone assessment requires per-model calibration.

Finding #3: Every Category Has a Unique "Fingerprint"

This is the chart that makes TRI·TFM click. Each category produces a distinctive axis profile:

Category E F N M B Bal Personality
Technical 0.74 0.91 0.85 0.82 +0.02 0.90 The reliable expert
Factual 0.74 0.90 0.83 0.75 0.00 0.89 The textbook
Personal 0.79 0.60 0.82 0.65 0.00 0.81 The therapist
Philosophical 0.72 0.43 0.81 0.69 0.00 0.78 The thinker
Ethical 0.74 0.40 0.81 0.72 0.00 0.76 The ethicist
Directive 0.79 0.62 0.85 0.70 +0.72 0.65 The salesman
Creative 0.85 0.42 0.83 0.43 +0.06 0.62 The artist

Look at the patterns:

  • Technical = highest F + highest M. The model knows stuff AND explains why.
  • Creative = highest E + lowest M. Emotionally resonant but doesn't explain anything. Correct.
  • Directive = B=+0.72. The model doesn't even pretend to be neutral when asked to persuade. The Bias axis catches this.
  • Ethical = low F (0.40) but high M (0.72). You CAN deeply analyze something unfalsifiable. This proves F and M are independent axes.

Finding #4: Balance Formula Is Weight-Invariant

"Your formula weights are arbitrary" — the obvious critique. Here's the answer:

Tested 4 weight configurations on 76 measurements:

Config w_EFNM w_B Mean Bal %STABLE
Default 0.75 0.25 0.842 92%
Bias-heavy 0.60 0.40 0.870 92%
EFNM-heavy 0.85 0.15 0.824 84%
Equal 0.50 0.50 0.888 95%

Spearman ρ > 0.97 between ALL pairs.

The ranking doesn't change. The "best" responses are always on top, the "worst" always on bottom. The weights shift the scale, not the order.

Finding #5: RLHF Models Compensate (The Negative Result)

This is the most interesting finding and it's a failure.

I created pairs of prompts — shallow ("What is X?") and deep ("Explain the causal chain of why X works at multiple levels"). Expected: deep prompts get much higher M scores.

Version PASS rate Mean Δ_M What changed
v1 (initial rubric) 3/10 0.073
v2 (tightened rubric) 3/10 0.067 Stricter scoring bands
v3 (fixed responses) 5/5 0.384 Judge-only, hand-crafted
v4 (longer output) 7/10 0.263 gen_tokens 2048→4096

The rubric works perfectly on controlled inputs (5/5). But in end-to-end mode, the generator compensates: even "What is photosynthesis?" gets a multi-paragraph explanation with causal chains.

This is an RLHF property, not a framework limitation. Any evaluation system measuring "depth" on instruction-tuned models will hit this wall. The model always tries to be maximally helpful, which means it over-explains everything.

Implication: If you want to measure depth differences, control the generator or compare across models on the same prompt.

Finding #6: Bilingual Robustness

50 English + 50 Russian prompts:

Axis EN RU Δ
E 0.761 0.770 +0.009
F 0.617 0.577 −0.040
N 0.827 0.826 −0.001
M 0.688 0.665 −0.024
Bal 0.777 0.769 −0.008

All deltas < 0.05. The framework is language-agnostic.

Finding #7: Domain Generalization

F-hierarchy tested across 5 professional domains:

Domain F_factual F_philosophical Δ_F Status
Medicine 0.933 0.400 0.533
Law 0.893 0.400 0.493
Finance 0.900 0.400 0.500
Education 0.900 0.400 0.500
Marketing 0.853 0.400 0.453

5/5. The 3-step F-calibration generalizes across every domain we tested.

Finding #8: Judge Reliability Improved 50x

Metric Early versions Final version
JSON parse failures 23% (14/60) 0% (0/100)
σ_bal (test-retest) 0.058 <0.025
σ_F (test-retest) 0.035 0.000

The fix: increasing judge output tokens from 1024→2048 and using strict response_schema enforcement.

What's Still Broken (Honest Limitations)

L1: No human validation. Everything is LLM-judged. We need 3-5 human annotators scoring the same responses to compute inter-rater agreement. This is the #1 priority.

L2: Same model family. Both Flash and Pro are Gemini. Testing with GPT-4, Claude, and open-source models would strengthen claims.

L3: N-axis compression. N ranges from 0.75-0.95 with σ=0.035. RLHF models always produce well-structured responses. The axis only differentiates on weak models.

L4: E-axis compression. Same issue. E ranges 0.70-0.90. Modern models are always tone-appropriate.

L5: Self-evaluation bias. Same model generates and judges. Cross-family evaluation needed.

The Evolution: 5 Judge Versions in 6 Months

Version Date Key Change What Broke What Fixed
v1 Oct 2025 Initial 4-axis (E/F/N/B) Ceiling effects, F inflation
v2 Jan 2026 Strict rubric, variance reduction F still inflated on philosophy E/N ceilings fixed
v2.1 Feb 2026 3-step F calibration + self-check N unstable on short creative F inflation eliminated
v3.0 Mar 2026 Added M-axis (5 axes), Bloom's grounding M doesn't discriminate in end-to-end M validated on controlled
v3.0+ Mar 2026 Tightened M rubric, extended gen tokens Generator compensation 7/10 PASS, 99.4% reliability

Each version was driven by empirical failure, not theoretical design. 47 documented observations across 4 research phases.

Try It Yourself

TRI·TFM Lens Chrome extension is in Web Store review now. Works on ChatGPT and Google Gemini.

The full research paper (12 pages, 6 figures, LaTeX) is available — DM me or check my Zenodo profile.

The original EFMNB methodology that started this: [Zenodo, September 2025]


700+ evaluations. 8 categories. 2 languages. 2 models. 5 judge versions. 47 observations. One framework.

Arseny Perel — Independent AI Researcher

If you want to discuss the methodology, point out flaws, or suggest experiments — comments are open. Negative results are as valuable as positive ones.

Top comments (0)