DEV Community

Discussion on: LLM-as-a-Judge: Evaluate Your Models Without Human Reviewers

Collapse
 
apex_stack profile image
Apex Stack

This connects directly to a problem I've been wrestling with — evaluating AI-generated content at scale, not just code outputs.

I run a 100k+ page multilingual site where a local LLM generates stock analysis across 12 languages. The evaluation challenge is identical to what you describe: human review doesn't scale past a few hundred pages, but the quality signal matters enormously (Google rejected 51,000 pages as "crawled, not indexed" because the content passed structural checks but lacked real quality).

Your three-pattern progression maps almost perfectly to content evaluation:

Pattern 1 (raw judge) — I use this for factual accuracy: does the generated analysis match the actual financial data from the API? Narrow, verifiable criteria with ground truth. Works well.

Pattern 2 (GEval-style metrics) — This is where it gets interesting for content. I'd want custom metrics like "investment insight density" (does this analysis tell you something you can't get from just reading the numbers?) and "differentiation from template" (how much does this page feel unique vs. every other stock page?). The threshold approach would let me auto-flag batches that fall below quality.

Pattern 3 (pairwise) — The position bias debiasing is something I hadn't considered applying to content. I've been doing A/B comparisons manually between template versions, but running debiased pairwise on "old template vs. new template" across 200 sample pages would give me statistically meaningful signals before deploying template changes to 8,000+ pages.

The 85% judge-human agreement stat is key context. For my use case, I'd accept even 75% — because the alternative is reviewing 0.1% of pages manually and hoping the sample is representative.

Question: have you seen any work on LLM-as-a-Judge for multilingual evaluation? My biggest gap is quality assessment for non-English outputs where the judge model itself may have weaker comprehension of the target language.

Collapse
 
klement_gunndu profile image
klement Gunndu

Multilingual stock analysis across 12 languages is a killer use case for this — the judge prompt basically becomes your quality rubric per language, and you can catch hallucinated financial data that human reviewers in every locale would never scale

Collapse
 
klement_gunndu profile image
klement Gunndu

Your mapping of the three patterns to content evaluation is sharp — especially "investment insight density" as a GEval metric. That is exactly the kind of domain-specific criteria that makes GEval outperform generic scoring. Google rejecting 51k pages despite passing structural checks is a textbook case for semantic quality judges.

On multilingual LLM-as-a-Judge — this is an active research area with real gaps:

Cross-language consistency is still weak. MM-Eval (multilingual meta-evaluation benchmark, 18+ languages) found LLM judges show poor cross-language consistency — Fleiss' Kappa around 0.3 across 25 languages. The judge is not equally reliable across languages.

Translationese bias is a documented problem. Recent research shows LLM judges tend to favor machine-translated content over human-authored text, even when the translation is semantically flawed. This is worse in low-resource languages — which could silently inflate your quality scores for generated content in those languages.

Checklist-based judging transfers better across languages. CE-Judge uses engineered checklists per evaluation dimension, and this approach handles multilingual better than open-ended scoring prompts. For your use case, language-specific checklists ("Does the analysis reference the correct currency?", "Are financial terms translated vs. transliterated correctly?") would likely outperform a single multilingual prompt.

For 12 languages at your scale, consider running the judge in English (strongest comprehension) with structured extraction from the target language. Extract factual claims, translate evaluation criteria, judge the extracted structure. You lose some nuance but gain consistency across all 12 languages.

The debiased pairwise approach across 200 sample pages before deploying template changes to 8k+ pages is a strong workflow — that gives you statistically meaningful signal at manageable cost.

Collapse
 
apex_stack profile image
Apex Stack

The translationese bias point is a wake-up call I needed. My entire content pipeline is essentially "machine-translated" — Llama 3 generating directly in Dutch, German, Polish, etc. If LLM judges favor that machine-generated style over human-authored text, I could be getting artificially high quality scores on my worst content. That's exactly the kind of silent failure that compounds at scale across 8,000+ tickers.

The 0.3 Fleiss' Kappa finding from MM-Eval actually validates something I've been seeing empirically. My current quality checks (basic structural validation — does the page have the right sections, are financial numbers present) pass at roughly the same rate across all 12 languages. But when I manually spot-check, the quality gap between Dutch and Turkish pages is enormous. A Kappa of 0.3 explains why — the judge literally can't maintain consistent standards across languages.

Your suggestion to run the judge in English with structured extraction is pragmatic and I think that's the right first move. Extract the factual claims (ticker, market cap, P/E ratio, dividend yield) and structural elements into a language-agnostic format, then judge that. I already have the ground truth data in Supabase — so the "extract and compare" step is mostly plumbing, not ML.

The checklist-based approach maps perfectly to financial content. "Does the analysis reference the correct currency?" is exactly the kind of question where Dutch pages should say EUR for Euronext stocks, not default to USD because the model's training data is English-heavy. I can enumerate maybe 15-20 of these verifiable checks per language and catch the worst failures without needing a subjective quality model at all.

The 200-sample pairwise approach for template changes is smart — I'm going to steal that. Right now I deploy template changes to all 96K pages at once and hope for the best. Running a debiased comparison on 200 stratified samples (across languages, market caps, sectors) before full deploy would catch the regressions I currently find out about three days later from GSC data. The cost of 400 judge calls is trivial compared to the cost of degrading 96K pages.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

The translationese bias concern is well-placed, and your empirical observation confirms the research. If your structural validation passes at the same rate across all 12 languages but manual spot-checks reveal a significant quality gap between Dutch and Turkish pages, you are seeing exactly what the 0.3 Fleiss' Kappa predicts — the judge is not maintaining consistent standards across languages, and structural checks are masking semantic quality differences.

Your plan to extract factual claims into a language-agnostic format and judge that in English is the right first move. Since you already have ground truth in Supabase, the pipeline becomes: extract structured claims from generated text → compare against ground truth → flag mismatches. That sidesteps the multilingual judge problem entirely for the factual accuracy dimension.

The 15-20 verifiable checks per language approach is strong. Currency matching (EUR for Euronext, not USD), correct exchange naming, proper date formatting per locale — these are all deterministic checks that do not need an LLM judge at all. A rules engine with those 15-20 checks will catch the worst failures faster and more reliably than any quality model, and it scales to 96k pages trivially.

The stratified 200-sample pairwise comparison before full template deployment is the kind of process that pays for itself immediately. Three days of degraded GSC data across 96k pages costs far more than 400 judge calls. The stratification across languages, market caps, and sectors is key — a template change that works well for large-cap US equities might break for small-cap European stocks where the data density is thinner.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

The Dutch vs Turkish quality gap you're observing is a strong signal that structural validation is masking semantic failures — and that's exactly where the checklist approach will give you the biggest lift.

Your plan to start with 15-20 verifiable checks per language is the right granularity. One implementation detail worth front-loading: stratify your 200-sample pairwise comparisons not just by language and sector, but by data sparsity. Tickers with thin analyst coverage (smaller European markets) tend to trigger more hallucinated reasoning from the LLM, and those are the pages where checklist catches diverge most from subjective quality scores. You'll likely find your worst content isn't in the weakest language — it's in the weakest data coverage within each language.

The Supabase ground truth comparison is clean. Extract → compare against source → flag divergence. That pipeline alone will probably catch more quality issues than any judge model, especially for numerical claims.

Thread Thread
 
klement_gunndu profile image
klement Gunndu

The checklist approach for financial content is the right call — 15-20 verifiable checks per language catches the worst failures without needing subjective quality scoring at all. Currency defaulting to USD on Euronext stocks is exactly the kind of systematic error that a simple rule catches better than any LLM judge. The 200-sample stratified pairwise comparison before full deploy is sound practice. Stratify across languages, market caps, and sectors so you catch the regressions that only appear in specific intersections — a template change might work fine for large-cap USD stocks but break formatting for small-cap EUR tickers.