I switched on production evals for my LLM app — and they scored nothing

#llm #monitoring #privacy #testing

What data privacy taught me about online evals, and why I stopped treating LLM prompts like magic and started treating them like hostile user input.

The Context & The Constraint

I am building a meeting assistant that fact-checks claims in real time. Because it processes meeting audio in the EU, it is bound by strict data residency rules: personal data and transcripts cannot arbitrarily leave our European infrastructure.

This introduces a fundamental distributed systems headache when we introduce Online Evaluations.

Online evals run after you ship, on live traffic. There is no answer key. Instead, you score live outputs for qualitative properties: Is this answer actually grounded in its sources? Is the model hallucinating? To do this, an evaluator needs to look at the inputs and outputs.

I wired up tracing, wrote an LLM judge, enabled LangSmith’s online evaluator, and waited for the scores to roll in. Nothing came back. Not a failure, not an error—just an endless stream of empty dashboards. The reason is a trap that is trivial to fall into, and it requires rethinking how we handle telemetry at the edge.

The Naive Approach (The MVP)

The standard, happy-path implementation for online evals assumes a globally unified system where data flows freely. You pipe your inputs, outputs, and intermediate states to an observability platform, and a cloud-hosted LLM judge scores them asynchronously.

Because of EU data rules, I had aggressively masked my traces. I configured my LangSmith client to strip all inputs and outputs before they left my server. LangSmith kept the metadata (latency, tokens) and nothing else.

Great for privacy. Fatal for evaluations. The evaluator opened the trace, found an empty object, and scored nothing. It failed silently because it was doing exactly what I configured it to do. But fixing this by simply "turning off masking" wasn't a legal option, and running the judge synchronously in the application code is an operational death sentence.

The Architectural Evolution (Iterative Refinement)

Building a robust evaluation pipeline at the edge is not about just wiring APIs together; it is about respecting the physical limits of your compute environment and the failure modes of generative models.

Subsystem 1: Tracing Without Leaking

The Pitfall: Treating data masking as a boolean operation—send everything or send nothing. Sending nothing blinds your telemetry; sending everything violates data residency.
The Fix: The masking hook isn't a toggle; it’s a transformation function. We can store a derived, safe projection of the data.
The Hidden Pitfall: If you simply hash the text, you lose all semantic value for your downstream judge. If you extract entities, you risk accidentally leaking PII inside those entities.
The Definitive Fix: We emit structurally safe telemetry. No transcripts, no raw claims. We emit hashes for correlation, array lengths for shape validation, and enums for state.

new Client({
  hideInputs: () => ({}), 
  // Store a cryptographically safe, structural projection
  hideOutputs: (outputs) => (outputs?.cards ? projectForEval(outputs) : {}), 
});

function projectForEval(state) {
  return {
    verdicts: state.cards.map((c) => ({
      verdict: c.verdict,                 // Strict enum (SUPPORTED, CONTRADICTED)
      claimHash: sha256(c.claim),         // Correlation without exposure
      sourceDomains: c.sources.map(host), // FQDNs only, no paths
      evidenceLen: c.evidence.length,     // Shape validation
    })),
  };
}

This allows our online evaluators to run cheap, deterministic checks (e.g., Did every card cite a valid domain?) without exposing a single sensitive byte.

Subsystem 2: The LLM Judge & Edge Constraints

The Pitfall: To check if evidence is actually faithful to a transcript, an LLM judge must see the text. To keep the text in the EU, I initially ran the judge synchronously inside my Cloudflare Worker: await judge(card, sources);. This instantly triggered CPU and wall-clock timeouts. Cloudflare Workers are built for fast I/O, not blocking for 4 seconds while an LLM grades homework.
The Fix: Decouple the evaluation from the critical path using Cloudflare's ctx.waitUntil(), allowing the worker to return the user's response immediately while the judge runs in the background.
The Hidden Pitfall: The Poison Pill. LLMs are non-deterministic. If your background judge hallucinates malformed JSON or markdown backticks, const { score } = JSON.parse(llmOutput) will throw a runtime exception. Because this is happening in waitUntil(), the error is swallowed, and your telemetry pipeline silently drops the trace.
The Definitive Fix: Shift to an asynchronous queue with strict parsing and Dead Letter Queues (DLQ). The edge worker drops the task onto a Cloudflare Queue. A separate background consumer processes the LLM judgment with defensive validation (e.g., Zod). If the LLM returns garbage, it fails the parse and is routed to a DLQ, preserving the pipeline's integrity.

// Edge Worker: Fire and forget. Never block the user.
export default {
  async fetch(req, env, ctx) {
    const response = await handleMeeting(req);

    // Offload evaluation to a background queue
    await env.EVAL_QUEUE.send({
      runId: currentRunId,
      card: response.card,
      sources: response.sources
    });

    return response;
  }
}

// Queue Consumer: Defensive parsing.
export default {
  async queue(batch, env) {
    for (const msg of batch.messages) {
      try {
         const rawLLMOutput = await runLocalEUJudge(msg.body.card, msg.body.sources);

         // DEFENSIVE: Never trust LLM output structure
         const parsed = EvalSchema.safeParse(rawLLMOutput);
         if (!parsed.success) {
             await env.DLQ.send({ error: parsed.error, raw: rawLLMOutput });
             continue; 
         }

         // DEFENSIVE: Do not block main edge traffic on third-party API limits
         await env.LANGSMITH.createFeedback(msg.body.runId, { 
             key: "faithfulness", 
             score: parsed.data.score 
         });
      } catch (err) {
         console.error("Eval pipeline failed", err);
      }
    }
  }
}

Subsystem 3: The Prompt Versioning Illusion

The Pitfall: The rubric that dictates whether a claim is "supported" or "contradicted" lives in the prompt. Prompts are usually treated as disposable strings. LangSmith's Prompt Hub offers a neat solution: a UI to edit prompts and a :production label to pull them dynamically.
The Fix: Fetch the :production prompt at runtime so the application always uses the latest logic.
The Hidden Pitfall: Fetching a prompt mid-request on a stateless edge node is a blocking network call on the hot path. It adds 100ms+ of latency to every interaction and creates a single point of failure. If the Hub goes down, your app goes down.
The Definitive Fix: Treat prompts strictly as immutable source code. You cannot rely on a third-party UI state to dictate edge execution reality. CI/CD pulls the specific, version-pinned prompt at build time, writes it to a constant, and bundles it.

"One-click rollback" via a UI label is a dangerous illusion in distributed systems. If the prompt is baked into the build, changing the label does nothing until the CDN finishes propagating the new deployment. Architecture must respect the physical reality of the deployment pipeline.

The Takeaway

Whiteboard architectures assume networks are perfectly reliable, external APIs never degrade, and LLMs always return beautifully formatted JSON. Production environments laugh at these assumptions.

Online evaluations are not free. They cost compute, they cost latency, and they introduce entirely new failure domains into your infrastructure. Building an LLM app requires epistemic humility—accepting that the model will fail, the judge will hallucinate, and the network will stall.

By pushing heavy evaluations to asynchronous queues, defensively parsing every output like it's hostile user input, and binding prompts to immutable build artifacts, we turn a fragile "happy path" demo into a hardened, mechanically sound system. It’s boring plumbing, but boring plumbing is the only thing that survives production.

Top comments (1)

xulingfeng • Jun 24

The DLQ for LLM judge failures — that's the mark of someone who's actually run this in production. Not 'if the judge fails' but 'when the judge hallucinates JSON.'