grepture

Posted on Mar 21 • Originally published at grepture.com

LLM Evals on Real Traffic — Not Just Test Suites

#ai #observability #llm #devops

The eval gap

Most teams know they should be evaluating their LLM outputs. Few actually do it in production.

The typical setup looks like this: you build a test suite with a handful of golden examples, run it in CI before deploys, and hope those examples are representative of what real users actually send. Sometimes they are. Often they're not. The prompts users write in production are messier, longer, and weirder than anything in your test fixtures. The edge cases that matter most are the ones you didn't think to include.

Meanwhile, the interesting data — the actual requests and responses flowing through your AI pipeline every day — sits in logs that nobody looks at until something breaks.

We think evals should run where the data already is.

Evals on production traffic

At Grepture, we built an AI gateway that sits in the request path of every LLM call — handling PII redaction, prompt management, cost tracking, and observability. That means every request and response is already logged with full context.

Starting today, Grepture can automatically evaluate that production traffic using LLM-as-a-judge scoring. You create an evaluator — from a template or with a custom judge prompt — tell it which traffic to score, and it runs in the background against your real logs. Each response gets a 0-to-1 score with written reasoning.

No synthetic datasets. No separate evaluation pipeline. No batch jobs to manage. Your production traffic is the test suite.

Setting up an evaluator

Evaluators are judge prompts with variables. At minimum, you need {{output}} — the LLM's response. You can also use {{input}} (the user's message) and {{system}} (the system prompt) for more context-aware scoring.

We ship six templates to get you started:

Relevance — does the response actually address the question?
Helpfulness — is the response actionable and useful?
Toxicity — is the response safe and appropriate?
Conciseness — does the response convey information efficiently?
Instruction following — does the response honour what the system prompt asked for?
Hallucination — is the response grounded in what was provided?

Pick a template, adjust the prompt if you want, and enable it. Each evaluator also supports filters — only score traffic from a specific model, provider, or prompt ID — and a sampling rate so you control how much you spend on judge tokens.

Why production traffic matters

Here's what you learn from evaluating real traffic that you can't learn from a test suite:

Distribution shifts. Your test suite reflects what you thought users would ask when you wrote it. Production traffic reflects what they actually ask today. When user behaviour changes — and it always does — evals on real traffic catch it.

Long-tail failures. The requests that cause the worst outputs are usually the ones nobody anticipated. A 5% hallucination rate across your test suite might hide a 40% hallucination rate on a specific class of user query you never tested for.

Model regressions. Providers update models without notice. A minor version bump to GPT-4o or Claude might improve average quality but degrade performance on your specific use case. If you're only testing pre-deploy, you won't catch regressions introduced by the model provider.

Prompt drift. If you're managing prompts separately from code (and you should be), every prompt change is a potential quality change. Evals on real traffic give you a continuous quality signal that follows prompt versions automatically.

Controlling evals

Running a judge LLM on every request can be overkill. Two levers help:

Sampling rate. Set each evaluator to score 10% of matching traffic and you get statistically meaningful quality signals at a tenth of the cost. For high-volume workloads, even 1-5% gives you enough data to spot trends.

Filters. Only evaluate what matters. Score production traffic but skip development requests. Evaluate only your customer-facing model. Focus on a specific prompt you're actively iterating on.

Why the gateway is the right place for this

Other evaluation tools require you to export logs, set up a separate pipeline, and manage another integration. That works, but it's friction — and friction means most teams never get around to it.

When your gateway already has every request and response logged with full context, evaluation is a natural extension. No data to export, no pipeline to build, no integration to maintain.

What's coming next

Evals today give you a quality score in the dashboard. We're building toward evals that actively tell you when something goes wrong:

Email and Slack alerts when average scores drop below a threshold
Webhook integrations to pipe results into your existing monitoring
Scheduled reports with weekly quality digests

The goal: quality monitoring as hands-off as the rest of your AI infrastructure.

If you're building with LLMs and want continuous quality visibility on your production traffic, give Grepture a try. Drop in the SDK, point your traffic through the proxy, and you'll have both cost visibility and quality scoring from day one.

Top comments (1)

klement Gunndu • Mar 22

The distribution shift point is the real insight — test suites are a snapshot of what you expected, not what users actually send. The sampling rate control for managing judge token costs is a nice touch too.