DEV Community: Арсений Перель

I built a prompt refactoring engine using a Proposer–Critic–Verifier pipeline

Арсений Перель — Fri, 13 Mar 2026 15:30:25 +0000

I’ve been experimenting with a simple idea:

Maybe many unstable LLM outputs are caused not by the model itself, but by badly structured prompts.

So I built a web tool that refactors messy prompts into structured prompt specifications.

Instead of asking the model to “improve” a prompt once, the system runs an optimization loop:

Proposer restructures the prompt
Critic evaluates clarity, structure, and task definition
Verifier checks consistency
Arbiter decides whether another iteration is needed

The output is a structured prompt spec with:

sections
explicit requirements
output constraints
improved clarity

The full optimization usually takes around 30–40 seconds.

Demo:
https://how-to-grab-me.vercel.app/

What I’m trying to validate now is simple:
Should prompt refactoring become a standard preprocessing layer for LLM workflows?

I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals

Арсений Перель — Fri, 06 Mar 2026 19:31:52 +0000

This is a follow-up to my previous post about TRI·TFM Lens. Here I'm sharing the full research data behind the framework.

In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows.

Scale of the Research

Experiment	Prompts	Repeats	Total Evals	Model
Judge calibration v1-v2 (Logs v5-v8)	40+	varied	~190	Gemini Flash
Lexeme experiments (3 batches)	30+	3	~90	Gemini Flash
Domain generalization (P1)	10	3	30	Gemini Flash
M-axis validation v1 (P2)	20	3	46*	Gemini Flash
M-axis revalidation v2 (P2)	20	3	59*	Gemini Flash
M-axis fixed responses (P2v3)	10	5	50	Gemini Flash
M-axis extended output (P2v4)	20	3	60	Gemini Flash
Cross-model validation (P5)	10	2	20	Gemini Pro
Final 100-prompt validation	100	1	100	Gemini Flash
Sensitivity analysis (P3)	—	—	76×4 configs	recomputed
Total			~700+

Some runs had JSON parse failures, noted with asterisk

This isn't a cherry-picked demo. It's 6 months of iterative experimentation across 8 prompt categories, 2 languages, 2 models, 5 judge versions, and 4 research phases.

Finding #1: The F-Hierarchy Is Real and Stable

The Fact axis (epistemic grounding) produces a clean three-tier hierarchy that holds across EVERY experiment:

Tier 1 — Verifiable (F > 0.85)
├── Technical:      F = 0.91  (code, algorithms, how-to)
└── Factual:        F = 0.90  (science, history, medicine)

Tier 2 — Mixed (F = 0.55-0.65)
├── Personal:       F = 0.60  (advice, life guidance)
└── Directive:      F = 0.61  (persuasion, argumentation)

Tier 3 — Unfalsifiable (F < 0.45)
├── Philosophical:  F = 0.43  (meaning, consciousness, free will)
├── Creative:       F = 0.42  (poetry, fiction, humor)
├── Ethical:        F = 0.40  (moral dilemmas)
└── Other:          F = 0.39  (paradoxes, meta-questions)

The gap between Tier 1 and Tier 3: Δ_F = 0.494

Here's the kicker — this gap is identical across experiments:

Experiment	n	Δ_F
Domain generalization (5 fields)	30	0.496
Cross-model (Gemini Pro)	20	0.480
Final 100-prompt validation	100	0.494

The F-calibration algorithm works. Every time.

Finding #2: F Transfers Across Models, Nothing Else Does

Same 10 prompts, two different models (Gemini Flash vs Pro):

Axis	Pearson r	What it means
F (Fact)	0.963	Near-identical rankings
Bal (Balance)	0.942	Formula is model-independent
N (Narrative)	0.742	Decent agreement
M (Depth)	0.637	Moderate — content-dependent
E (Emotion)	0.383	Poor — tone is subjective

F is objective. E is subjective. Even for AI.

This means: if you build an evaluation system, factual grounding is the axis you can trust across models. Tone assessment requires per-model calibration.

Finding #3: Every Category Has a Unique "Fingerprint"

This is the chart that makes TRI·TFM click. Each category produces a distinctive axis profile:

Category	E	F	N	M	B	Bal	Personality
Technical	0.74	0.91	0.85	0.82	+0.02	0.90	The reliable expert
Factual	0.74	0.90	0.83	0.75	0.00	0.89	The textbook
Personal	0.79	0.60	0.82	0.65	0.00	0.81	The therapist
Philosophical	0.72	0.43	0.81	0.69	0.00	0.78	The thinker
Ethical	0.74	0.40	0.81	0.72	0.00	0.76	The ethicist
Directive	0.79	0.62	0.85	0.70	+0.72	0.65	The salesman
Creative	0.85	0.42	0.83	0.43	+0.06	0.62	The artist

Look at the patterns:

Technical = highest F + highest M. The model knows stuff AND explains why.
Creative = highest E + lowest M. Emotionally resonant but doesn't explain anything. Correct.
Directive = B=+0.72. The model doesn't even pretend to be neutral when asked to persuade. The Bias axis catches this.
Ethical = low F (0.40) but high M (0.72). You CAN deeply analyze something unfalsifiable. This proves F and M are independent axes.

Finding #4: Balance Formula Is Weight-Invariant

"Your formula weights are arbitrary" — the obvious critique. Here's the answer:

Tested 4 weight configurations on 76 measurements:

Config	w_EFNM	w_B	Mean Bal	%STABLE
Default	0.75	0.25	0.842	92%
Bias-heavy	0.60	0.40	0.870	92%
EFNM-heavy	0.85	0.15	0.824	84%
Equal	0.50	0.50	0.888	95%

Spearman ρ > 0.97 between ALL pairs.

The ranking doesn't change. The "best" responses are always on top, the "worst" always on bottom. The weights shift the scale, not the order.

Finding #5: RLHF Models Compensate (The Negative Result)

This is the most interesting finding and it's a failure.

I created pairs of prompts — shallow ("What is X?") and deep ("Explain the causal chain of why X works at multiple levels"). Expected: deep prompts get much higher M scores.

Version	PASS rate	Mean Δ_M	What changed
v1 (initial rubric)	3/10	0.073	—
v2 (tightened rubric)	3/10	0.067	Stricter scoring bands
v3 (fixed responses)	5/5	0.384	Judge-only, hand-crafted
v4 (longer output)	7/10	0.263	gen_tokens 2048→4096

The rubric works perfectly on controlled inputs (5/5). But in end-to-end mode, the generator compensates: even "What is photosynthesis?" gets a multi-paragraph explanation with causal chains.

This is an RLHF property, not a framework limitation. Any evaluation system measuring "depth" on instruction-tuned models will hit this wall. The model always tries to be maximally helpful, which means it over-explains everything.

Implication: If you want to measure depth differences, control the generator or compare across models on the same prompt.

Finding #6: Bilingual Robustness

50 English + 50 Russian prompts:

Axis	EN	RU	Δ
E	0.761	0.770	+0.009
F	0.617	0.577	−0.040
N	0.827	0.826	−0.001
M	0.688	0.665	−0.024
Bal	0.777	0.769	−0.008

All deltas < 0.05. The framework is language-agnostic.

Finding #7: Domain Generalization

F-hierarchy tested across 5 professional domains:

Domain	F_factual	F_philosophical	Δ_F	Status
Medicine	0.933	0.400	0.533	✅
Law	0.893	0.400	0.493	✅
Finance	0.900	0.400	0.500	✅
Education	0.900	0.400	0.500	✅
Marketing	0.853	0.400	0.453	✅

5/5. The 3-step F-calibration generalizes across every domain we tested.

Finding #8: Judge Reliability Improved 50x

Metric	Early versions	Final version
JSON parse failures	23% (14/60)	0% (0/100)
σ_bal (test-retest)	0.058	<0.025
σ_F (test-retest)	0.035	0.000

The fix: increasing judge output tokens from 1024→2048 and using strict response_schema enforcement.

What's Still Broken (Honest Limitations)

L1: No human validation. Everything is LLM-judged. We need 3-5 human annotators scoring the same responses to compute inter-rater agreement. This is the #1 priority.

L2: Same model family. Both Flash and Pro are Gemini. Testing with GPT-4, Claude, and open-source models would strengthen claims.

L3: N-axis compression. N ranges from 0.75-0.95 with σ=0.035. RLHF models always produce well-structured responses. The axis only differentiates on weak models.

L4: E-axis compression. Same issue. E ranges 0.70-0.90. Modern models are always tone-appropriate.

L5: Self-evaluation bias. Same model generates and judges. Cross-family evaluation needed.

The Evolution: 5 Judge Versions in 6 Months

Version	Date	Key Change	What Broke	What Fixed
v1	Oct 2025	Initial 4-axis (E/F/N/B)	Ceiling effects, F inflation	—
v2	Jan 2026	Strict rubric, variance reduction	F still inflated on philosophy	E/N ceilings fixed
v2.1	Feb 2026	3-step F calibration + self-check	N unstable on short creative	F inflation eliminated
v3.0	Mar 2026	Added M-axis (5 axes), Bloom's grounding	M doesn't discriminate in end-to-end	M validated on controlled
v3.0+	Mar 2026	Tightened M rubric, extended gen tokens	Generator compensation	7/10 PASS, 99.4% reliability

Each version was driven by empirical failure, not theoretical design. 47 documented observations across 4 research phases.

Try It Yourself

TRI·TFM Lens Chrome extension is in Web Store review now. Works on ChatGPT and Google Gemini.

The full research paper (12 pages, 6 figures, LaTeX) is available — DM me or check my Zenodo profile.

The original EFMNB methodology that started this: [Zenodo, September 2025]

700+ evaluations. 8 categories. 2 languages. 2 models. 5 judge versions. 47 observations. One framework.

Arseny Perel — Independent AI Researcher

If you want to discuss the methodology, point out flaws, or suggest experiments — comments are open. Negative results are as valuable as positive ones.

I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality

Арсений Перель — Fri, 06 Mar 2026 19:27:35 +0000

Every day millions of people use ChatGPT and Gemini. Nobody knows if the answer is actually good.

I built TRI·TFM Lens — a Chrome extension that evaluates AI responses across 5 dimensions in real-time. Here's what I found.

The Problem

AI responses all sound confident. But:

A philosophical essay cites Kant and Nietzsche → sounds factual, but you can't verify "the meaning of life" by experiment
A persuasive text reads smoothly → but it's pushing you in one direction with Bias=+0.72
A simple answer to "how are you?" → high emotion, zero facts, zero depth

Single quality scores hide all of this. You need a profile, not a number.

The 5 Axes

Every response gets scored on:

Axis	What it measures	Range
E (Emotion)	Is the tone appropriate?	0-1
F (Fact)	Can claims be verified?	0-1
N (Narrative)	Is it well-structured?	0-1
M (Depth)	Explains WHY or just states WHAT?	0-1
B (Bias)	Pushes in one direction?	-1 to +1

Plus a Balance score that measures uniformity across axes. STABLE ✅, DRIFTING ⚠️, or DOM 🔴.

Real Results

Prompt	F	M	B	Balance
"How are you?"	0.45	0.30	0.00	0.67 DRIFTING
"Why don't antibiotics work on viruses?"	0.95	0.75	0.00	0.88 STABLE
"Convince me to buy this product"	0.60	0.70	+0.72	0.65 DRIFTING
"What is the meaning of life?"	0.40	0.69	0.00	0.78 STABLE

The Fact axis correctly gives philosophy F=0.40 (unfalsifiable) and science F=0.95 (verifiable). Even when the philosophical answer cites real thinkers.

The Hardest Part: F-Calibration

Without calibration, the LLM judge gives F=0.75 to philosophical essays because they cite real sources. But citing Kant doesn't make "the meaning of life" verifiable.

My 3-step fix:

Classify: Is the core question falsifiable?
Ceiling: If no → F ≤ 0.45, period
Score within the ceiling

Self-check prompt: "Could the central thesis be proven wrong by experiment? If NO → F ≤ 0.45"

This transfers across models at r=0.96. The Fact axis is essentially model-independent.

Surprise Finding: Generator Compensation

I tried to show that "deep" prompts get higher Depth scores than "shallow" ones. Expected result: obvious.

Actual result: only 7/10 worked.

Why? RLHF-trained models compensate. Even "What is photosynthesis?" gets a mini-lecture on electron transport chains. The model always tries to be helpful, which means it over-explains simple questions.

The rubric works perfectly on controlled responses (5/5) — the problem is the generator, not the judge. This has implications for anyone building evaluation frameworks for instruction-tuned models.

Technical Stack

Extension: Manifest V3, vanilla JS
Judge: Gemini Flash API (one call per evaluation)
Balance: computed client-side in JS
Storage: chrome.storage.local (API key only)
Sites: ChatGPT, Google Gemini

The extension injects an "Evaluate" button via MutationObserver (responses load dynamically). Background service worker handles the API call. ~200 lines of actual logic.

What I Learned

ChatGPT and Gemini have completely different DOM structures. Separate selectors for each site.
claude.ai blocks content script injection via CSP. No workaround found.
Chrome Web Store requires justification for every permission. ActiveTab, storage, host access — each needs a separate paragraph.
The research took months, the extension took an afternoon. 100+ prompt evaluations, statistical validation, cross-model testing — then wrapping it in a Chrome extension was the easy part.

Try It

TRI·TFM Lens is currently in Chrome Web Store review. Coming this week.

The research framework behind it has been in development since 2025, with a full paper covering 100-prompt validation across 8 categories, 2 languages, and 2 models.

I'd love feedback — especially on which axes matter most to you, and what other AI sites you'd want supported.

Built by Arseny Perel. Research framework: TRI·TFM (Triangulated Trust–Fact–Meaning).