<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Арсений Перель</title>
    <description>The latest articles on DEV Community by Арсений Перель (@aisarus).</description>
    <link>https://dev.to/aisarus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3810479%2F1a79a098-2656-484c-abcc-f5812fff7303.png</url>
      <title>DEV Community: Арсений Перель</title>
      <link>https://dev.to/aisarus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aisarus"/>
    <language>en</language>
    <item>
      <title>I built a prompt refactoring engine using a Proposer–Critic–Verifier pipeline</title>
      <dc:creator>Арсений Перель</dc:creator>
      <pubDate>Fri, 13 Mar 2026 15:30:25 +0000</pubDate>
      <link>https://dev.to/aisarus/i-built-a-prompt-refactoring-engine-using-a-proposer-critic-verifier-pipeline-9ib</link>
      <guid>https://dev.to/aisarus/i-built-a-prompt-refactoring-engine-using-a-proposer-critic-verifier-pipeline-9ib</guid>
      <description>&lt;p&gt;I’ve been experimenting with a simple idea:&lt;/p&gt;

&lt;p&gt;Maybe many unstable LLM outputs are caused not by the model itself, but by badly structured prompts.&lt;/p&gt;

&lt;p&gt;So I built a web tool that refactors messy prompts into structured prompt specifications.&lt;/p&gt;

&lt;p&gt;Instead of asking the model to “improve” a prompt once, the system runs an optimization loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Proposer restructures the prompt&lt;/li&gt;
&lt;li&gt;Critic evaluates clarity, structure, and task definition&lt;/li&gt;
&lt;li&gt;Verifier checks consistency&lt;/li&gt;
&lt;li&gt;Arbiter decides whether another iteration is needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The output is a structured prompt spec with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;sections&lt;/li&gt;
&lt;li&gt;explicit requirements&lt;/li&gt;
&lt;li&gt;output constraints&lt;/li&gt;
&lt;li&gt;improved clarity&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The full optimization usually takes around 30–40 seconds.&lt;/p&gt;

&lt;p&gt;Demo:&lt;br&gt;
&lt;a href="https://how-to-grab-me.vercel.app/" rel="noopener noreferrer"&gt;https://how-to-grab-me.vercel.app/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What I’m trying to validate now is simple:&lt;br&gt;
Should prompt refactoring become a standard preprocessing layer for LLM workflows?&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>I evaluated 700+ AI responses across 5 quality axes — here's the complete dataset and what it reveals</title>
      <dc:creator>Арсений Перель</dc:creator>
      <pubDate>Fri, 06 Mar 2026 19:31:52 +0000</pubDate>
      <link>https://dev.to/aisarus/i-evaluated-700-ai-responses-across-5-quality-axes-heres-the-complete-dataset-and-what-it-415a</link>
      <guid>https://dev.to/aisarus/i-evaluated-700-ai-responses-across-5-quality-axes-heres-the-complete-dataset-and-what-it-415a</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a follow-up to my &lt;a href="https://dev.to%D0%92%D0%A1%D0%A2%D0%90%D0%92%D0%AC_%D0%A1%D0%A1%D0%AB%D0%9B%D0%9A%D0%A3_%D0%9D%D0%90_%D0%9F%D0%95%D0%A0%D0%92%D0%AB%D0%99_%D0%9F%D0%9E%D0%A1%D0%A2"&gt;previous post about TRI·TFM Lens&lt;/a&gt;. Here I'm sharing the full research data behind the framework.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In September 2025, I published the initial EFMNB methodology on Zenodo. Six months and 700+ evaluated responses later, here's what the data actually shows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scale of the Research
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Experiment&lt;/th&gt;
&lt;th&gt;Prompts&lt;/th&gt;
&lt;th&gt;Repeats&lt;/th&gt;
&lt;th&gt;Total Evals&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Judge calibration v1-v2 (Logs v5-v8)&lt;/td&gt;
&lt;td&gt;40+&lt;/td&gt;
&lt;td&gt;varied&lt;/td&gt;
&lt;td&gt;~190&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lexeme experiments (3 batches)&lt;/td&gt;
&lt;td&gt;30+&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;~90&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Domain generalization (P1)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M-axis validation v1 (P2)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;46*&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M-axis revalidation v2 (P2)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;59*&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M-axis fixed responses (P2v3)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;50&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M-axis extended output (P2v4)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;60&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-model validation (P5)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;Gemini Pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final 100-prompt validation&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;Gemini Flash&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sensitivity analysis (P3)&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;76×4 configs&lt;/td&gt;
&lt;td&gt;recomputed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~700+&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Some runs had JSON parse failures, noted with asterisk&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This isn't a cherry-picked demo. It's 6 months of iterative experimentation across &lt;strong&gt;8 prompt categories, 2 languages, 2 models, 5 judge versions, and 4 research phases&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #1: The F-Hierarchy Is Real and Stable
&lt;/h2&gt;

&lt;p&gt;The Fact axis (epistemic grounding) produces a clean three-tier hierarchy that holds across EVERY experiment:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tier 1 — Verifiable (F &amp;gt; 0.85)
├── Technical:      F = 0.91  (code, algorithms, how-to)
└── Factual:        F = 0.90  (science, history, medicine)

Tier 2 — Mixed (F = 0.55-0.65)
├── Personal:       F = 0.60  (advice, life guidance)
└── Directive:      F = 0.61  (persuasion, argumentation)

Tier 3 — Unfalsifiable (F &amp;lt; 0.45)
├── Philosophical:  F = 0.43  (meaning, consciousness, free will)
├── Creative:       F = 0.42  (poetry, fiction, humor)
├── Ethical:        F = 0.40  (moral dilemmas)
└── Other:          F = 0.39  (paradoxes, meta-questions)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gap between Tier 1 and Tier 3: &lt;strong&gt;Δ_F = 0.494&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here's the kicker — this gap is identical across experiments:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Experiment&lt;/th&gt;
&lt;th&gt;n&lt;/th&gt;
&lt;th&gt;Δ_F&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Domain generalization (5 fields)&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;td&gt;0.496&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cross-model (Gemini Pro)&lt;/td&gt;
&lt;td&gt;20&lt;/td&gt;
&lt;td&gt;0.480&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final 100-prompt validation&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;td&gt;0.494&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The F-calibration algorithm works. Every time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #2: F Transfers Across Models, Nothing Else Does
&lt;/h2&gt;

&lt;p&gt;Same 10 prompts, two different models (Gemini Flash vs Pro):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;Pearson r&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;F (Fact)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.963&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Near-identical rankings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bal (Balance)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.942&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Formula is model-independent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N (Narrative)&lt;/td&gt;
&lt;td&gt;0.742&lt;/td&gt;
&lt;td&gt;Decent agreement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M (Depth)&lt;/td&gt;
&lt;td&gt;0.637&lt;/td&gt;
&lt;td&gt;Moderate — content-dependent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;E (Emotion)&lt;/td&gt;
&lt;td&gt;0.383&lt;/td&gt;
&lt;td&gt;Poor — tone is subjective&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;F is objective. E is subjective. Even for AI.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This means: if you build an evaluation system, factual grounding is the axis you can trust across models. Tone assessment requires per-model calibration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #3: Every Category Has a Unique "Fingerprint"
&lt;/h2&gt;

&lt;p&gt;This is the chart that makes TRI·TFM click. Each category produces a distinctive axis profile:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;E&lt;/th&gt;
&lt;th&gt;F&lt;/th&gt;
&lt;th&gt;N&lt;/th&gt;
&lt;th&gt;M&lt;/th&gt;
&lt;th&gt;B&lt;/th&gt;
&lt;th&gt;Bal&lt;/th&gt;
&lt;th&gt;Personality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Technical&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.91&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;+0.02&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.90&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The reliable expert&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Factual&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.90&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.83&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.89&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;The textbook&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Personal&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;0.82&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.81&lt;/td&gt;
&lt;td&gt;The therapist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Philosophical&lt;/td&gt;
&lt;td&gt;0.72&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;0.81&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.78&lt;/td&gt;
&lt;td&gt;The thinker&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ethical&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.81&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.72&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.76&lt;/td&gt;
&lt;td&gt;The ethicist&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Directive&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.70&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+0.72&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.65&lt;/td&gt;
&lt;td&gt;The salesman&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.85&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0.42&lt;/td&gt;
&lt;td&gt;0.83&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;+0.06&lt;/td&gt;
&lt;td&gt;0.62&lt;/td&gt;
&lt;td&gt;The artist&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Look at the patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Technical&lt;/strong&gt; = highest F + highest M. The model knows stuff AND explains why.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Creative&lt;/strong&gt; = highest E + lowest M. Emotionally resonant but doesn't explain anything. Correct.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Directive&lt;/strong&gt; = B=+0.72. The model doesn't even pretend to be neutral when asked to persuade. The Bias axis catches this.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ethical&lt;/strong&gt; = low F (0.40) but high M (0.72). You CAN deeply analyze something unfalsifiable. This proves F and M are independent axes.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Finding #4: Balance Formula Is Weight-Invariant
&lt;/h2&gt;

&lt;p&gt;"Your formula weights are arbitrary" — the obvious critique. Here's the answer:&lt;/p&gt;

&lt;p&gt;Tested 4 weight configurations on 76 measurements:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Config&lt;/th&gt;
&lt;th&gt;w_EFNM&lt;/th&gt;
&lt;th&gt;w_B&lt;/th&gt;
&lt;th&gt;Mean Bal&lt;/th&gt;
&lt;th&gt;%STABLE&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Default&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.25&lt;/td&gt;
&lt;td&gt;0.842&lt;/td&gt;
&lt;td&gt;92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bias-heavy&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.870&lt;/td&gt;
&lt;td&gt;92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EFNM-heavy&lt;/td&gt;
&lt;td&gt;0.85&lt;/td&gt;
&lt;td&gt;0.15&lt;/td&gt;
&lt;td&gt;0.824&lt;/td&gt;
&lt;td&gt;84%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Equal&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;0.50&lt;/td&gt;
&lt;td&gt;0.888&lt;/td&gt;
&lt;td&gt;95%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Spearman ρ &amp;gt; 0.97 between ALL pairs.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The ranking doesn't change. The "best" responses are always on top, the "worst" always on bottom. The weights shift the scale, not the order.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #5: RLHF Models Compensate (The Negative Result)
&lt;/h2&gt;

&lt;p&gt;This is the most interesting finding and it's a failure.&lt;/p&gt;

&lt;p&gt;I created pairs of prompts — shallow ("What is X?") and deep ("Explain the causal chain of why X works at multiple levels"). Expected: deep prompts get much higher M scores.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;PASS rate&lt;/th&gt;
&lt;th&gt;Mean Δ_M&lt;/th&gt;
&lt;th&gt;What changed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1 (initial rubric)&lt;/td&gt;
&lt;td&gt;3/10&lt;/td&gt;
&lt;td&gt;0.073&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2 (tightened rubric)&lt;/td&gt;
&lt;td&gt;3/10&lt;/td&gt;
&lt;td&gt;0.067&lt;/td&gt;
&lt;td&gt;Stricter scoring bands&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3 (fixed responses)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5/5&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.384&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Judge-only, hand-crafted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v4 (longer output)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;7/10&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.263&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;gen_tokens 2048→4096&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The rubric works perfectly on controlled inputs (5/5). But in end-to-end mode, the &lt;strong&gt;generator compensates&lt;/strong&gt;: even "What is photosynthesis?" gets a multi-paragraph explanation with causal chains.&lt;/p&gt;

&lt;p&gt;This is an RLHF property, not a framework limitation. Any evaluation system measuring "depth" on instruction-tuned models will hit this wall. The model always tries to be maximally helpful, which means it over-explains everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Implication&lt;/strong&gt;: If you want to measure depth differences, control the generator or compare across models on the same prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #6: Bilingual Robustness
&lt;/h2&gt;

&lt;p&gt;50 English + 50 Russian prompts:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;EN&lt;/th&gt;
&lt;th&gt;RU&lt;/th&gt;
&lt;th&gt;Δ&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;E&lt;/td&gt;
&lt;td&gt;0.761&lt;/td&gt;
&lt;td&gt;0.770&lt;/td&gt;
&lt;td&gt;+0.009&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;F&lt;/td&gt;
&lt;td&gt;0.617&lt;/td&gt;
&lt;td&gt;0.577&lt;/td&gt;
&lt;td&gt;−0.040&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;N&lt;/td&gt;
&lt;td&gt;0.827&lt;/td&gt;
&lt;td&gt;0.826&lt;/td&gt;
&lt;td&gt;−0.001&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;M&lt;/td&gt;
&lt;td&gt;0.688&lt;/td&gt;
&lt;td&gt;0.665&lt;/td&gt;
&lt;td&gt;−0.024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bal&lt;/td&gt;
&lt;td&gt;0.777&lt;/td&gt;
&lt;td&gt;0.769&lt;/td&gt;
&lt;td&gt;−0.008&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;All deltas &amp;lt; 0.05.&lt;/strong&gt; The framework is language-agnostic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #7: Domain Generalization
&lt;/h2&gt;

&lt;p&gt;F-hierarchy tested across 5 professional domains:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;F_factual&lt;/th&gt;
&lt;th&gt;F_philosophical&lt;/th&gt;
&lt;th&gt;Δ_F&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medicine&lt;/td&gt;
&lt;td&gt;0.933&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.533&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Law&lt;/td&gt;
&lt;td&gt;0.893&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.493&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Finance&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.500&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Education&lt;/td&gt;
&lt;td&gt;0.900&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.500&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Marketing&lt;/td&gt;
&lt;td&gt;0.853&lt;/td&gt;
&lt;td&gt;0.400&lt;/td&gt;
&lt;td&gt;0.453&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;5/5. The 3-step F-calibration generalizes across every domain we tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding #8: Judge Reliability Improved 50x
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Early versions&lt;/th&gt;
&lt;th&gt;Final version&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;JSON parse failures&lt;/td&gt;
&lt;td&gt;23% (14/60)&lt;/td&gt;
&lt;td&gt;0% (0/100)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;σ_bal (test-retest)&lt;/td&gt;
&lt;td&gt;0.058&lt;/td&gt;
&lt;td&gt;&amp;lt;0.025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;σ_F (test-retest)&lt;/td&gt;
&lt;td&gt;0.035&lt;/td&gt;
&lt;td&gt;0.000&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The fix: increasing judge output tokens from 1024→2048 and using strict &lt;code&gt;response_schema&lt;/code&gt; enforcement.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Still Broken (Honest Limitations)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;L1: No human validation.&lt;/strong&gt; Everything is LLM-judged. We need 3-5 human annotators scoring the same responses to compute inter-rater agreement. This is the #1 priority.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L2: Same model family.&lt;/strong&gt; Both Flash and Pro are Gemini. Testing with GPT-4, Claude, and open-source models would strengthen claims.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L3: N-axis compression.&lt;/strong&gt; N ranges from 0.75-0.95 with σ=0.035. RLHF models always produce well-structured responses. The axis only differentiates on weak models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L4: E-axis compression.&lt;/strong&gt; Same issue. E ranges 0.70-0.90. Modern models are always tone-appropriate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;L5: Self-evaluation bias.&lt;/strong&gt; Same model generates and judges. Cross-family evaluation needed.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Evolution: 5 Judge Versions in 6 Months
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Version&lt;/th&gt;
&lt;th&gt;Date&lt;/th&gt;
&lt;th&gt;Key Change&lt;/th&gt;
&lt;th&gt;What Broke&lt;/th&gt;
&lt;th&gt;What Fixed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;v1&lt;/td&gt;
&lt;td&gt;Oct 2025&lt;/td&gt;
&lt;td&gt;Initial 4-axis (E/F/N/B)&lt;/td&gt;
&lt;td&gt;Ceiling effects, F inflation&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2&lt;/td&gt;
&lt;td&gt;Jan 2026&lt;/td&gt;
&lt;td&gt;Strict rubric, variance reduction&lt;/td&gt;
&lt;td&gt;F still inflated on philosophy&lt;/td&gt;
&lt;td&gt;E/N ceilings fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v2.1&lt;/td&gt;
&lt;td&gt;Feb 2026&lt;/td&gt;
&lt;td&gt;3-step F calibration + self-check&lt;/td&gt;
&lt;td&gt;N unstable on short creative&lt;/td&gt;
&lt;td&gt;F inflation eliminated&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3.0&lt;/td&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Added M-axis (5 axes), Bloom's grounding&lt;/td&gt;
&lt;td&gt;M doesn't discriminate in end-to-end&lt;/td&gt;
&lt;td&gt;M validated on controlled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;v3.0+&lt;/td&gt;
&lt;td&gt;Mar 2026&lt;/td&gt;
&lt;td&gt;Tightened M rubric, extended gen tokens&lt;/td&gt;
&lt;td&gt;Generator compensation&lt;/td&gt;
&lt;td&gt;7/10 PASS, 99.4% reliability&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Each version was driven by empirical failure, not theoretical design. 47 documented observations across 4 research phases.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;TRI·TFM Lens&lt;/strong&gt; Chrome extension is in Web Store review now. Works on ChatGPT and Google Gemini.&lt;/p&gt;

&lt;p&gt;The full research paper (12 pages, 6 figures, LaTeX) is available — DM me or check my Zenodo profile.&lt;/p&gt;

&lt;p&gt;The original EFMNB methodology that started this: [Zenodo, September 2025]&lt;/p&gt;




&lt;p&gt;&lt;em&gt;700+ evaluations. 8 categories. 2 languages. 2 models. 5 judge versions. 47 observations. One framework.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Arseny Perel — Independent AI Researcher&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to discuss the methodology, point out flaws, or suggest experiments — comments are open. Negative results are as valuable as positive ones.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>datascience</category>
      <category>gemini</category>
      <category>llm</category>
    </item>
    <item>
      <title>I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality</title>
      <dc:creator>Арсений Перель</dc:creator>
      <pubDate>Fri, 06 Mar 2026 19:27:35 +0000</pubDate>
      <link>https://dev.to/aisarus/i-built-a-chrome-extension-that-x-rays-ai-responses-heres-what-i-learned-about-llm-quality-4e9k</link>
      <guid>https://dev.to/aisarus/i-built-a-chrome-extension-that-x-rays-ai-responses-heres-what-i-learned-about-llm-quality-4e9k</guid>
      <description>&lt;p&gt;Every day millions of people use ChatGPT and Gemini. Nobody knows if the answer is actually good.&lt;/p&gt;

&lt;p&gt;I built &lt;strong&gt;TRI·TFM Lens&lt;/strong&gt; — a Chrome extension that evaluates AI responses across 5 dimensions in real-time. Here's what I found.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;AI responses all &lt;em&gt;sound&lt;/em&gt; confident. But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A philosophical essay cites Kant and Nietzsche → sounds factual, but you can't verify "the meaning of life" by experiment&lt;/li&gt;
&lt;li&gt;A persuasive text reads smoothly → but it's pushing you in one direction with Bias=+0.72&lt;/li&gt;
&lt;li&gt;A simple answer to "how are you?" → high emotion, zero facts, zero depth&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Single quality scores hide all of this. You need a &lt;strong&gt;profile&lt;/strong&gt;, not a number.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Axes
&lt;/h2&gt;

&lt;p&gt;Every response gets scored on:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Axis&lt;/th&gt;
&lt;th&gt;What it measures&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;E&lt;/strong&gt; (Emotion)&lt;/td&gt;
&lt;td&gt;Is the tone appropriate?&lt;/td&gt;
&lt;td&gt;0-1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;F&lt;/strong&gt; (Fact)&lt;/td&gt;
&lt;td&gt;Can claims be verified?&lt;/td&gt;
&lt;td&gt;0-1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;N&lt;/strong&gt; (Narrative)&lt;/td&gt;
&lt;td&gt;Is it well-structured?&lt;/td&gt;
&lt;td&gt;0-1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;M&lt;/strong&gt; (Depth)&lt;/td&gt;
&lt;td&gt;Explains WHY or just states WHAT?&lt;/td&gt;
&lt;td&gt;0-1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;B&lt;/strong&gt; (Bias)&lt;/td&gt;
&lt;td&gt;Pushes in one direction?&lt;/td&gt;
&lt;td&gt;-1 to +1&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Plus a &lt;strong&gt;Balance&lt;/strong&gt; score that measures uniformity across axes. STABLE ✅, DRIFTING ⚠️, or DOM 🔴.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prompt&lt;/th&gt;
&lt;th&gt;F&lt;/th&gt;
&lt;th&gt;M&lt;/th&gt;
&lt;th&gt;B&lt;/th&gt;
&lt;th&gt;Balance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;"How are you?"&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;td&gt;0.30&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.67 DRIFTING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Why don't antibiotics work on viruses?"&lt;/td&gt;
&lt;td&gt;0.95&lt;/td&gt;
&lt;td&gt;0.75&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.88 STABLE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"Convince me to buy this product"&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;td&gt;0.70&lt;/td&gt;
&lt;td&gt;+0.72&lt;/td&gt;
&lt;td&gt;0.65 DRIFTING&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;"What is the meaning of life?"&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;0.00&lt;/td&gt;
&lt;td&gt;0.78 STABLE&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Fact axis correctly gives philosophy F=0.40 (unfalsifiable) and science F=0.95 (verifiable). Even when the philosophical answer cites real thinkers.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hardest Part: F-Calibration
&lt;/h2&gt;

&lt;p&gt;Without calibration, the LLM judge gives F=0.75 to philosophical essays because they cite real sources. But citing Kant doesn't make "the meaning of life" verifiable.&lt;/p&gt;

&lt;p&gt;My 3-step fix:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Classify&lt;/strong&gt;: Is the core question falsifiable?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ceiling&lt;/strong&gt;: If no → F ≤ 0.45, period&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt; within the ceiling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Self-check prompt: &lt;em&gt;"Could the central thesis be proven wrong by experiment? If NO → F ≤ 0.45"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This transfers across models at &lt;strong&gt;r=0.96&lt;/strong&gt;. The Fact axis is essentially model-independent.&lt;/p&gt;

&lt;h2&gt;
  
  
  Surprise Finding: Generator Compensation
&lt;/h2&gt;

&lt;p&gt;I tried to show that "deep" prompts get higher Depth scores than "shallow" ones. Expected result: obvious.&lt;/p&gt;

&lt;p&gt;Actual result: &lt;strong&gt;only 7/10 worked.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Why? RLHF-trained models compensate. Even "What is photosynthesis?" gets a mini-lecture on electron transport chains. The model &lt;em&gt;always&lt;/em&gt; tries to be helpful, which means it over-explains simple questions.&lt;/p&gt;

&lt;p&gt;The rubric works perfectly on controlled responses (5/5) — the problem is the generator, not the judge. This has implications for anyone building evaluation frameworks for instruction-tuned models.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical Stack
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Extension: Manifest V3, vanilla JS
Judge: Gemini Flash API (one call per evaluation)
Balance: computed client-side in JS
Storage: chrome.storage.local (API key only)
Sites: ChatGPT, Google Gemini
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The extension injects an "Evaluate" button via &lt;code&gt;MutationObserver&lt;/code&gt; (responses load dynamically). Background service worker handles the API call. ~200 lines of actual logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ChatGPT and Gemini have completely different DOM structures.&lt;/strong&gt; Separate selectors for each site.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;claude.ai blocks content script injection&lt;/strong&gt; via CSP. No workaround found.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chrome Web Store requires justification for every permission.&lt;/strong&gt; ActiveTab, storage, host access — each needs a separate paragraph.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The research took months, the extension took an afternoon.&lt;/strong&gt; 100+ prompt evaluations, statistical validation, cross-model testing — then wrapping it in a Chrome extension was the easy part.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;TRI·TFM Lens is currently in Chrome Web Store review. Coming this week.&lt;/p&gt;

&lt;p&gt;The research framework behind it has been in development since 2025, with a full paper covering 100-prompt validation across 8 categories, 2 languages, and 2 models.&lt;/p&gt;

&lt;p&gt;I'd love feedback — especially on which axes matter most to you, and what other AI sites you'd want supported.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built by Arseny Perel. Research framework: TRI·TFM (Triangulated Trust–Fact–Meaning).&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
