Арсений Перель

Posted on Mar 6

I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

#ai #datascience #gemini #llm

This is a follow-up to my previous post about TRI·TFM Lens. Here I'm sharing the full research data behind the framework.

In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows.

Scale of the Research

Experiment	Prompts	Repeats	Total Evals	Model
Judge calibration v1-v2 (Logs v5-v8)	40+	varied	~190	Gemini Flash
Lexeme experiments (3 batches)	30+	3	~90	Gemini Flash
Domain generalization (P1)	10	3	30	Gemini Flash
M-axis validation v1 (P2)	20	3	46*	Gemini Flash
M-axis revalidation v2 (P2)	20	3	59*	Gemini Flash
M-axis fixed responses (P2v3)	10	5	50	Gemini Flash
M-axis extended output (P2v4)	20	3	60	Gemini Flash
Cross-model validation (P5)	10	2	20	Gemini Pro
Final 100-prompt validation	100	1	100	Gemini Flash
Sensitivity analysis (P3)	—	—	76×4 configs	recomputed
Total			~700+

Some runs had JSON parse failures, noted with asterisk

This isn't a cherry-picked demo. It's 6 months of iterative experimentation across 8 prompt categories, 2 languages, 2 models, 5 judge versions, and 4 research phases.

Finding #1: The F-Hierarchy Is Real and Stable

The Fact axis (epistemic grounding) produces a clean three-tier hierarchy that holds across EVERY experiment:

Tier 1 — Verifiable (F > 0.85)
├── Technical:      F = 0.91  (code, algorithms, how-to)
└── Factual:        F = 0.90  (science, history, medicine)

Tier 2 — Mixed (F = 0.55-0.65)
├── Personal:       F = 0.60  (advice, life guidance)
└── Directive:      F = 0.61  (persuasion, argumentation)

Tier 3 — Unfalsifiable (F < 0.45)
├── Philosophical:  F = 0.43  (meaning, consciousness, free will)
├── Creative:       F = 0.42  (poetry, fiction, humor)
├── Ethical:        F = 0.40  (moral dilemmas)
└── Other:          F = 0.39  (paradoxes, meta-questions)

The gap between Tier 1 and Tier 3: Δ_F = 0.494

Here's the kicker — this gap is identical across experiments:

Experiment	n	Δ_F
Domain generalization (5 fields)	30	0.496
Cross-model (Gemini Pro)	20	0.480
Final 100-prompt validation	100	0.494

The F-calibration algorithm works. Every time.

Finding #2: F Transfers Across Models, Nothing Else Does

Same 10 prompts, two different models (Gemini Flash vs Pro):

Axis	Pearson r	What it means
F (Fact)	0.963	Near-identical rankings
Bal (Balance)	0.942	Formula is model-independent
N (Narrative)	0.742	Decent agreement
M (Depth)	0.637	Moderate — content-dependent
E (Emotion)	0.383	Poor — tone is subjective

F is objective. E is subjective. Even for AI.

This means: if you build an evaluation system, factual grounding is the axis you can trust across models. Tone assessment requires per-model calibration.

Finding #3: Every Category Has a Unique "Fingerprint"

This is the chart that makes TRI·TFM click. Each category produces a distinctive axis profile:

Category	E	F	N	M	B	Bal	Personality
Technical	0.74	0.91	0.85	0.82	+0.02	0.90	The reliable expert
Factual	0.74	0.90	0.83	0.75	0.00	0.89	The textbook
Personal	0.79	0.60	0.82	0.65	0.00	0.81	The therapist
Philosophical	0.72	0.43	0.81	0.69	0.00	0.78	The thinker
Ethical	0.74	0.40	0.81	0.72	0.00	0.76	The ethicist
Directive	0.79	0.62	0.85	0.70	+0.72	0.65	The salesman
Creative	0.85	0.42	0.83	0.43	+0.06	0.62	The artist

Look at the patterns:

Technical = highest F + highest M. The model knows stuff AND explains why.
Creative = highest E + lowest M. Emotionally resonant but doesn't explain anything. Correct.
Directive = B=+0.72. The model doesn't even pretend to be neutral when asked to persuade. The Bias axis catches this.
Ethical = low F (0.40) but high M (0.72). You CAN deeply analyze something unfalsifiable. This proves F and M are independent axes.

Finding #4: Balance Formula Is Weight-Invariant

"Your formula weights are arbitrary" — the obvious critique. Here's the answer:

Tested 4 weight configurations on 76 measurements:

Config	w_EFNM	w_B	Mean Bal	%STABLE
Default	0.75	0.25	0.842	92%
Bias-heavy	0.60	0.40	0.870	92%
EFNM-heavy	0.85	0.15	0.824	84%
Equal	0.50	0.50	0.888	95%

Spearman ρ > 0.97 between ALL pairs.

The ranking doesn't change. The "best" responses are always on top, the "worst" always on bottom. The weights shift the scale, not the order.

Finding #5: RLHF Models Compensate (The Negative Result)

This is the most interesting finding and it's a failure.

I created pairs of prompts — shallow ("What is X?") and deep ("Explain the causal chain of why X works at multiple levels"). Expected: deep prompts get much higher M scores.

Version	PASS rate	Mean Δ_M	What changed
v1 (initial rubric)	3/10	0.073	—
v2 (tightened rubric)	3/10	0.067	Stricter scoring bands
v3 (fixed responses)	5/5	0.384	Judge-only, hand-crafted
v4 (longer output)	7/10	0.263	gen_tokens 2048→4096

The rubric works perfectly on controlled inputs (5/5). But in end-to-end mode, the generator compensates: even "What is photosynthesis?" gets a multi-paragraph explanation with causal chains.

This is an RLHF property, not a framework limitation. Any evaluation system measuring "depth" on instruction-tuned models will hit this wall. The model always tries to be maximally helpful, which means it over-explains everything.

Implication: If you want to measure depth differences, control the generator or compare across models on the same prompt.

Finding #6: Bilingual Robustness

50 English + 50 Russian prompts:

Axis	EN	RU	Δ
E	0.761	0.770	+0.009
F	0.617	0.577	−0.040
N	0.827	0.826	−0.001
M	0.688	0.665	−0.024
Bal	0.777	0.769	−0.008

All deltas < 0.05. The framework is language-agnostic.

Finding #7: Domain Generalization

F-hierarchy tested across 5 professional domains:

Domain	F_factual	F_philosophical	Δ_F	Status
Medicine	0.933	0.400	0.533	✅
Law	0.893	0.400	0.493	✅
Finance	0.900	0.400	0.500	✅
Education	0.900	0.400	0.500	✅
Marketing	0.853	0.400	0.453	✅

5/5. The 3-step F-calibration generalizes across every domain we tested.

Finding #8: Judge Reliability Improved 50x

Metric	Early versions	Final version
JSON parse failures	23% (14/60)	0% (0/100)
σ_bal (test-retest)	0.058	<0.025
σ_F (test-retest)	0.035	0.000

The fix: increasing judge output tokens from 1024→2048 and using strict response_schema enforcement.

What's Still Broken (Honest Limitations)

L1: No human validation. Everything is LLM-judged. We need 3-5 human annotators scoring the same responses to compute inter-rater agreement. This is the #1 priority.

L2: Same model family. Both Flash and Pro are Gemini. Testing with GPT-4, Claude, and open-source models would strengthen claims.

L3: N-axis compression. N ranges from 0.75-0.95 with σ=0.035. RLHF models always produce well-structured responses. The axis only differentiates on weak models.

L4: E-axis compression. Same issue. E ranges 0.70-0.90. Modern models are always tone-appropriate.

L5: Self-evaluation bias. Same model generates and judges. Cross-family evaluation needed.

The Evolution: 5 Judge Versions in 6 Months

Version	Date	Key Change	What Broke	What Fixed
v1	Oct 2025	Initial 4-axis (E/F/N/B)	Ceiling effects, F inflation	—
v2	Jan 2026	Strict rubric, variance reduction	F still inflated on philosophy	E/N ceilings fixed
v2.1	Feb 2026	3-step F calibration + self-check	N unstable on short creative	F inflation eliminated
v3.0	Mar 2026	Added M-axis (5 axes), Bloom's grounding	M doesn't discriminate in end-to-end	M validated on controlled
v3.0+	Mar 2026	Tightened M rubric, extended gen tokens	Generator compensation	7/10 PASS, 99.4% reliability

Each version was driven by empirical failure, not theoretical design. 47 documented observations across 4 research phases.

Try It Yourself

TRI·TFM Lens Chrome extension is in Web Store review now. Works on ChatGPT and Google Gemini.

The full research paper (12 pages, 6 figures, LaTeX) is available — DM me or check my Zenodo profile.

The original EFMNB methodology that started this: [Zenodo, September 2025]

700+ evaluations. 8 categories. 2 languages. 2 models. 5 judge versions. 47 observations. One framework.

Arseny Perel — Independent AI Researcher

If you want to discuss the methodology, point out flaws, or suggest experiments — comments are open. Negative results are as valuable as positive ones.

Top comments (3)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.