A user tested our face-rating tool five times in a row with the same photo. They got scores of 6.2, 7.5, 6.8, 7.1, 5.9. That's a ±0.8 spread on supposedly the same input.
That email was the death of single-LLM scoring for us.
This is a short post about the architecture decision we ended up making — running two parallel scoring tracks and taking the geometric one as an anchor against LLM hallucination.
The variance problem
Subjective face scoring with an LLM is fundamentally non-deterministic. Each call re-samples the latent space. For a deterministic-feeling task like "rate this face 1-10," that variance is a UX killer. Users expect their face to have ONE score, not a probability distribution.
Common fixes that didn't work for us:
- Lower temperature: helped at temperature=0, but the model still varied across calls because internal vector representations differ slightly.
- Self-consistency (5 calls + majority): 5x the API cost for a 30% variance reduction. Not enough.
- Few-shot anchoring with calibration faces: helped on average score but not on individual variance.
The dual-track fix
What worked: stop using LLMs for the parts where geometry is decidable.
We added a parallel geometric track using Mediapipe Face Mesh:
- Canthal tilt (corner-of-eye angle): measurable to ±2 degrees from face landmarks.
- Jaw angle (mandibular angle from chin to ear): consistent across calls.
- Symmetry (Hausdorff distance between left/right halves): pure arithmetic.
These three measures map to a 0-10 sub-score that's deterministic for a given input image. It doesn't capture taste, but it captures geometry.
The LLM track stays — but now it's responsible for the aesthetic-judgment layer: skin quality assessment, hairstyle compatibility, facial harmony perception. Things that genuinely require pattern recognition over training data, not measurement.
The combination
We don't average the two. We compose:
final_score = 0.6 * geometric_score + 0.4 * llm_aesthetic_score
if abs(geometric - llm) > 2.0:
flag_for_review(f"disagreement: G={geometric}, L={llm}")
use_lower_score() # be conservative
The 0.6/0.4 weighting was found empirically — geometric carries more weight because it's the deterministic anchor. The disagreement detection catches edge cases (e.g., the LLM rates someone high on "presence" but geometry is rough — usually a charisma photo we're not equipped to score correctly).
Results
Variance per identical input: from ±0.8 (single LLM) to ±0.5 (dual-track). Not zero, but much closer to what users expect.
Bonus: the geometric scores let us give actionable feedback. "Canthal tilt -3°, consider an angled selfie" beats "your eyes look closed" from a black-box LLM.
What I'd do differently
The 0.6/0.4 weighting should be per-axis, not global. A high-resolution close-up of skin should shift weight toward LLM aesthetic perception. A poorly-lit small selfie should shift toward geometric (because LLM judgment on bad photos is mostly noise).
We're refactoring this now — per-axis dynamic weighting based on photo quality signals.
Try it
If you want to see what dual-track scoring feels like in practice, you can try AI Omoggle — single test from $0.99, no subscription, no photos stored.
I'd genuinely love to hear how other people have tackled the LLM-variance problem in subjective tasks.
Top comments (0)