Originally published on AI Tech Connect.
What you need to know Once your product calls an LLM, you lose the comfort of deterministic tests. The same prompt can return a different answer each run, so an exact-match assertion is useless and a human reviewing every output does not scale. The pragmatic answer the industry has converged on is the LLM-as-a-judge: a second model, given a rubric, that scores or ranks the outputs of the first. As of mid-2026 this is the default evaluation pattern behind most production agent, RAG and chat systems — but it is also where a lot of teams quietly fool themselves. Pick the right shape. Pairwise (A-vs-B) is more reliable for comparing systems; pointwise (direct scoring) gives you the absolute number you need to gate CI. Write the rubric, not vibes. A good judge prompt has an explicit scale,…
Top comments (0)