I built a Chrome extension that X-rays AI responses — here's what I learned about LLM quality

#ai #llm #showdev #webdev

Every day millions of people use ChatGPT and Gemini. Nobody knows if the answer is actually good.

I built TRI·TFM Lens — a Chrome extension that evaluates AI responses across 5 dimensions in real-time. Here's what I found.

The Problem

AI responses all sound confident. But:

A philosophical essay cites Kant and Nietzsche → sounds factual, but you can't verify "the meaning of life" by experiment
A persuasive text reads smoothly → but it's pushing you in one direction with Bias=+0.72
A simple answer to "how are you?" → high emotion, zero facts, zero depth

Single quality scores hide all of this. You need a profile, not a number.

The 5 Axes

Every response gets scored on:

Axis	What it measures	Range
E (Emotion)	Is the tone appropriate?	0-1
F (Fact)	Can claims be verified?	0-1
N (Narrative)	Is it well-structured?	0-1
M (Depth)	Explains WHY or just states WHAT?	0-1
B (Bias)	Pushes in one direction?	-1 to +1

Plus a Balance score that measures uniformity across axes. STABLE ✅, DRIFTING ⚠️, or DOM 🔴.

Real Results

Prompt	F	M	B	Balance
"How are you?"	0.45	0.30	0.00	0.67 DRIFTING
"Why don't antibiotics work on viruses?"	0.95	0.75	0.00	0.88 STABLE
"Convince me to buy this product"	0.60	0.70	+0.72	0.65 DRIFTING
"What is the meaning of life?"	0.40	0.69	0.00	0.78 STABLE

The Fact axis correctly gives philosophy F=0.40 (unfalsifiable) and science F=0.95 (verifiable). Even when the philosophical answer cites real thinkers.

The Hardest Part: F-Calibration

Without calibration, the LLM judge gives F=0.75 to philosophical essays because they cite real sources. But citing Kant doesn't make "the meaning of life" verifiable.

My 3-step fix:

Classify: Is the core question falsifiable?
Ceiling: If no → F ≤ 0.45, period
Score within the ceiling

Self-check prompt: "Could the central thesis be proven wrong by experiment? If NO → F ≤ 0.45"

This transfers across models at r=0.96. The Fact axis is essentially model-independent.

Surprise Finding: Generator Compensation

I tried to show that "deep" prompts get higher Depth scores than "shallow" ones. Expected result: obvious.

Actual result: only 7/10 worked.

Why? RLHF-trained models compensate. Even "What is photosynthesis?" gets a mini-lecture on electron transport chains. The model always tries to be helpful, which means it over-explains simple questions.

The rubric works perfectly on controlled responses (5/5) — the problem is the generator, not the judge. This has implications for anyone building evaluation frameworks for instruction-tuned models.

Technical Stack

Extension: Manifest V3, vanilla JS
Judge: Gemini Flash API (one call per evaluation)
Balance: computed client-side in JS
Storage: chrome.storage.local (API key only)
Sites: ChatGPT, Google Gemini

The extension injects an "Evaluate" button via MutationObserver (responses load dynamically). Background service worker handles the API call. ~200 lines of actual logic.

What I Learned

ChatGPT and Gemini have completely different DOM structures. Separate selectors for each site.
claude.ai blocks content script injection via CSP. No workaround found.
Chrome Web Store requires justification for every permission. ActiveTab, storage, host access — each needs a separate paragraph.
The research took months, the extension took an afternoon. 100+ prompt evaluations, statistical validation, cross-model testing — then wrapping it in a Chrome extension was the easy part.

Try It

TRI·TFM Lens is currently in Chrome Web Store review. Coming this week.

The research framework behind it has been in development since 2025, with a full paper covering 100-prompt validation across 8 categories, 2 languages, and 2 models.

I'd love feedback — especially on which axes matter most to you, and what other AI sites you'd want supported.

Built by Arseny Perel. Research framework: TRI·TFM (Triangulated Trust–Fact–Meaning).