Why we run two scoring tracks (LLM + Mediapipe) for our AI face-rating tool

#ai #architecture #llm #machinelearning

A user tested our face-rating tool five times in a row with the same photo. They got scores of 6.2, 7.5, 6.8, 7.1, 5.9. That's a ±0.8 spread on supposedly the same input.

That email was the death of single-LLM scoring for us.

This is a short post about the architecture decision we ended up making — running two parallel scoring tracks and taking the geometric one as an anchor against LLM hallucination.

The variance problem

Subjective face scoring with an LLM is fundamentally non-deterministic. Each call re-samples the latent space. For a deterministic-feeling task like "rate this face 1-10," that variance is a UX killer. Users expect their face to have ONE score, not a probability distribution.

Common fixes that didn't work for us:

Lower temperature: helped at temperature=0, but the model still varied across calls because internal vector representations differ slightly.
Self-consistency (5 calls + majority): 5x the API cost for a 30% variance reduction. Not enough.
Few-shot anchoring with calibration faces: helped on average score but not on individual variance.

The dual-track fix

What worked: stop using LLMs for the parts where geometry is decidable.

We added a parallel geometric track using Mediapipe Face Mesh:

Canthal tilt (corner-of-eye angle): measurable to ±2 degrees from face landmarks.
Jaw angle (mandibular angle from chin to ear): consistent across calls.
Symmetry (Hausdorff distance between left/right halves): pure arithmetic.

These three measures map to a 0-10 sub-score that's deterministic for a given input image. It doesn't capture taste, but it captures geometry.

The LLM track stays — but now it's responsible for the aesthetic-judgment layer: skin quality assessment, hairstyle compatibility, facial harmony perception. Things that genuinely require pattern recognition over training data, not measurement.

The combination

We don't average the two. We compose:

final_score = 0.6 * geometric_score + 0.4 * llm_aesthetic_score

if abs(geometric - llm) > 2.0:
    flag_for_review(f"disagreement: G={geometric}, L={llm}")
    use_lower_score()  # be conservative

The 0.6/0.4 weighting was found empirically — geometric carries more weight because it's the deterministic anchor. The disagreement detection catches edge cases (e.g., the LLM rates someone high on "presence" but geometry is rough — usually a charisma photo we're not equipped to score correctly).

Results

Variance per identical input: from ±0.8 (single LLM) to ±0.5 (dual-track). Not zero, but much closer to what users expect.

Bonus: the geometric scores let us give actionable feedback. "Canthal tilt -3°, consider an angled selfie" beats "your eyes look closed" from a black-box LLM.

What I'd do differently

The 0.6/0.4 weighting should be per-axis, not global. A high-resolution close-up of skin should shift weight toward LLM aesthetic perception. A poorly-lit small selfie should shift toward geometric (because LLM judgment on bad photos is mostly noise).

We're refactoring this now — per-axis dynamic weighting based on photo quality signals.

Try it

If you want to see what dual-track scoring feels like in practice, you can try AI Omoggle — single test from $0.99, no subscription, no photos stored.

I'd genuinely love to hear how other people have tackled the LLM-variance problem in subjective tasks.