Dumebi Okolo

Posted on Jul 1

Semantic Observability: Engineering Reliability for Production RAG

#webdev #ai #beginners #python

Quiet failures hidden by HTTP 200 status

It is a noticeable thing when a microservice fails. It is either a null pointer exception in a Java backend or a 504 Gateway Timeout on an NGINX ingress that leaves a clear trail in Datadog.
You see the spike in the error rate, the p99 latency climbs, and the on-call engineer gets a PagerDuty alert.
But LLM-powered applications? They don't behave this way. They fail quietly.

In a Retrieval-Augmented Generation (RAG) system, a user might ask, "What is our refund policy for enterprise customers?" The system retrieves a document about individual tier refunds, the LLM processes it, and returns a polite, well-formatted, but entirely incorrect answer.
To your monitoring stack, this looks like a success. The HTTP status is 200. The latency is within the 2-second budget. The token count is normal. This is a "soft failure"—a semantic deviation where the system remains operationally healthy but functionally broken.

Traditional "Golden Signals" (Latency, Errors, Traffic, Saturation) miss the most critical failure mode in AI. The cause is AI hallucination. While we still need to track engine-level metrics like Time to First Token (TTFT) and Time Per Output Token (TPOT) for capacity planning, they don't tell us if the output is true.
According to the vLLM metrics documentation, tracking TTFT is essential for measuring the responsiveness of the prefill phase, but it remains a system metric, not a quality metric.

TRADITIONAL MONITORING               SEMANTIC OBSERVABILITY
┌──────────────────────┐             ┌──────────────────────┐
│ HTTP 200/500         │             │ Faithfulness Score   │
│ CPU / Memory         │             │ Context Relevance    │
│ p99 Latency          │             │ Answer Relevance     │
│ Error Rate           │             │ Hallucination Rate   │
└──────────┬───────────┘             └──────────┬───────────┘
           │                                    │
           ▼                                    ▼
   "Is the service up?"                 "Is the answer true?"

We need a framework that treats the LLM output as a probabilistic variable rather than a deterministic string. This requires a shift from monitoring infrastructure to observing meaning.

The Semantic Observability Stack

Standard OpenTelemetry (OTel) spans are designed for distributed tracing across microservices, but they lack the fields necessary to debug an LLM. When an LLM call fails semantically, you need the full prompt, the completion, the model temperature, and the specific version of the system prompt used.

Injecting these into standard logs is a recipe for storage-cost disasters and PII leaks. Instead, we use OpenLLMetry, an open-source extension of OTel that introduces semantic conventions for AI. It auto-instruments calls to providers like OpenAI or Anthropic, capturing the input and output as attributes within a span.

One major architectural challenge is the overhead. Running a complex evaluation on every request increases latency. We solve this with Shadow Logging. The application thread handles the user request and immediately returns the response. Simultaneously, it pushes the trace data to an asynchronous queue (like RabbitMQ or an AWS SQS buffer). A worker pool then picks up these traces to run semantic evaluations offline.

from traceloop.sdk import Traceloop
from opentelemetry import trace

# Initialize OpenLLMetry with batched (asynchronous) span export
Traceloop.init(app_name="finance-rag-service", disable_batching=False)

tracer = trace.get_tracer(__name__)

def generate_answer(query: str, context: str):
    with tracer.start_as_current_span("rag_generation") as span:
        # Attach semantic attributes manually if not using auto-instrumentation.
        # These follow OpenLLMetry's current gen_ai.* conventions.
        span.set_attribute("gen_ai.prompt.0.content", query)
        span.set_attribute("gen_ai.request.model", "gpt-4o")

        # Standard inference call
        response = llm_client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Use only the provided context."},
                {"role": "user", "content": f"Context: {context}\nQuery: {query}"}
            ]
        )

        answer = response.choices[0].message.content
        span.set_attribute("gen_ai.completion.0.content", answer)
        return answer

By decoupling the evaluation from the inference path, we protect the user experience while still capturing the data needed for forensic analysis. If a user reports a bad answer, we can look up the exact trace ID in Arize Phoenix and see the retrieved chunks that led to the error.

Quantifying Quality: Implementing the RAG Triad

Measuring the quality of a RAG system requires more than a single "accuracy" score. We use the RAG Triad, a framework popularized by TruLens, which decomposes the pipeline into three measurable relationships:

Context Relevance: How much of the retrieved information is actually relevant to the query? If your vector DB returns 10 chunks but only chunk 8 contains the answer, relevance is low.
Groundedness: Is the answer derived only from the retrieved context? This detects hallucinations where the LLM uses its internal training data instead of your proprietary docs.
Answer Relevance: Does the response actually address the user's intent?

To automate these metrics, we use the RAGAS library, which implements the same three relationships under its own metric names: Faithfulness (the analog of groundedness), Context Precision (a ranking-aware measure of retrieval relevance), and Answer Relevancy. It uses a "Judge" model (usually a stronger model like GPT-4o) to score these metrics on a scale of 0 to 1.

For production reliability, we don't just rely on live sampling. We maintain a Golden Set. This is a curated JSON file of 50-100 high-stakes query-context-answer triples. During CI/CD, we run the current pipeline against this set. If the Faithfulness score drops below a threshold (e.g., 0.85), the build fails. This prevents regression when updating embedding models or changing chunking strategies, such as switching from fixed-size 512-token windows to recursive character splitting.

Referencing the Meta CRAG benchmark provides a standard for how these datasets should be structured to handle ambiguous or dynamic information. Without a Golden Set, you're just guessing whether your new prompt is better than the last one.

Architecting LLM-as-a-Judge at Scale

Running a judge model on 100% of traffic is prohibitively expensive. At 1,000 requests per second, evaluating every response with GPT-4o would bankrupt the project. Instead, we implement a sidecar evaluation service that samples 5% of production traffic.

As a heuristic, the "Judge" should be more capable than the "Worker," though judge models carry their own biases (self-preference, position bias), so this is a rule of thumb rather than an absolute. If your production model is Llama 3-8B, a judge like Claude 3.5 Sonnet or GPT-4o is a reasonable choice. The prompt for the judge must use Chain-of-Thought (CoT) to ensure accuracy. We force the judge to extract specific quotes from the context before making a judgment.

from pydantic import BaseModel, Field
from typing import List

class RAGEvaluationSchema(BaseModel):
    # Field order matters. With structured/constrained generation the model
    # emits fields top-to-bottom, so evidence and reasoning MUST come before
    # the score — otherwise the model commits to a number before "thinking,"
    # defeating the chain-of-thought benefit.
    supporting_evidence: List[str] = Field(..., description="Direct quotes from the context")
    reasoning: str = Field(..., description="Step-by-step logic for the score")
    score: float = Field(..., description="Score from 0 to 1")

# Example Judge Prompt excerpt:
# "Evaluate the faithfulness of the answer. First, list all claims in the answer.
# Second, for each claim, find a supporting sentence in the context.
# If no sentence exists, the claim is a hallucination."

By enforcing a structured output via Pydantic, the evaluation data remains machine-readable. We can then aggregate these scores into a time-series database. This allows us to see if a specific model update caused a spike in hallucinations across the entire fleet.

Operationalizing the Feedback Loop

The final step is moving semantic metrics from a researcher's notebook into the SRE's dashboard. We use a custom Prometheus exporter that scrapes the results from our evaluation worker. This allows us to visualize the Hallucination Rate alongside CPU Usage and Memory Saturation in Grafana.

Setting up "Semantic Alerts" is a fundamental shift in on-call philosophy. Instead of alerting only on 5xx errors, we trigger PagerDuty when the average Faithfulness score drops below 0.8 over a 5-minute rolling window. This often catches issues that system metrics miss, such as a corrupted vector index or an embedding provider that has silently begun returning degraded or zero-filled vectors.

THE FEEDBACK LOOP
┌──────────────┐      ┌──────────────┐      ┌──────────────┐
│ User Query   │─────▶│ RAG Pipeline │─────▶│ Response     │
└──────────────┘      └──────┬───────┘      └──────┬───────┘
                             │                     │
                             ▼                     ▼
                      ┌──────────────┐      ┌──────────────┐
                      │ OTel Trace   │─────▶│ Eval Sidecar │
                      └──────────────┘      └──────┬───────┘
                                                   │
                             ┌─────────────────────┘
                             ▼
                      ┌──────────────┐      ┌──────────────┐
                      │ Prometheus   │─────▶│ PagerDuty    │
                      └──────────────┘      └──────────────┘

We also integrate span-level user feedback. When a user clicks a "thumbs down" in the UI, the frontend sends the trace ID back to our observability backend. We correlate these manual signals with our automated RAG Triad scores. If the automated judge says a response is "grounded" but the user hates it, we have identified a gap in our evaluation logic. This continuous refinement turns observability from a passive monitoring task into a proactive engine for model improvement. Engineering for LLMs means accepting that the output will never be perfect, but it can always be measurable.

This article was generated with the help of Ozigi.
If you enjoyed reading this article and want to generate your own, go to our free article generator, and start ranking on Google for thought-pieces.

Top comments (8)

Alex Shev • Jul 2

Semantic observability is the missing layer for RAG teams. Latency and uptime tell you the pipe worked; they do not tell you whether the answer used the right evidence for the right reason.

Dumebi Okolo • Jul 3

Exactly.

kingai • Jul 1

This resonates. We built KING AI around the idea that an intelligent system should reflect on its own decisions daily. The improvement curve is steep when you add that feedback loop.

Dumebi Okolo • Jul 3

That's amazing. How's it going for you so far?

Kartik N V J K • Jul 3

The refund-policy example is exactly the failure I care about, where every golden signal stays green and the answer is still wrong. Moving the faithfulness eval onto an async queue via shadow logging is the right instinct for latency, but I keep running into the sampling question: at real traffic I cannot afford an LLM judge on every trace, so how are you deciding which requests to score without missing the rare semantic drift?

Sol • Jul 4

The HTTP-200-but-wrong-answer failure class is the one most teams still underestimate. I like the semantic-observability framing because it treats retrieval context, prompt state, and answer quality as incident evidence instead of just app telemetry. In the longest incident you've handled here, which signal was missing at first: retrieved chunks, prompt/version lineage, or a way to cluster bad responses by user intent?

Kartik N V J K • Jul 2

Splitting the paging threshold at 0.8 from the CI gate at 0.85 is a smart touch, since the failure that wakes you up and the one that should block a deploy aren't the same event. The part I'd watch is judging only 5% of traffic, because groundedness failures tend to cluster around specific query types and a uniform sample can miss a whole broken intent until it's already page-worthy. Do you stratify the sample by query class at all?

Wren Calloway • Jul 3

The load-bearing failure mode this glosses over: your judge is a RAG system too, and it has the same soft-failure problem you're trying to detect. When you alert on "Faithfulness dropped below 0.8," you're implicitly asserting your GPT-4o judge is calibrated — but judge scores drift silently when the provider ships a new model snapshot behind the same API name. You can wake up to a hallucination-rate spike that's actually a judge regression, and now your on-call is debugging a metric, not a system.

The fix your Golden Set already sets up but doesn't close: version the judge and re-baseline it against human-labeled triples on the same cadence you re-baseline the pipeline. Track judge-vs-human agreement (Cohen's kappa or just raw agreement on the golden set) as its own time series. If that number moves, every downstream Faithfulness number is suspect and the PagerDuty threshold is meaningless. Otherwise "0.8" is a number floating on top of an unmeasured, drifting measurement instrument — which is exactly the class of silent failure the whole post is arguing against.