It is a noticeable thing when a microservice fails. It is either a null pointer exception in a Java backend or a 504 Gateway Timeout on an NGINX ingress that leaves a clear trail in Datadog.
You see the spike in the error rate, the p99 latency climbs, and the on-call engineer gets a PagerDuty alert.
But LLM-powered applications? They don't behave this way. They fail quietly.
In a Retrieval-Augmented Generation (RAG) system, a user might ask, "What is our refund policy for enterprise customers?" The system retrieves a document about individual tier refunds, the LLM processes it, and returns a polite, well-formatted, but entirely incorrect answer.
To your monitoring stack, this looks like a success. The HTTP status is 200. The latency is within the 2-second budget. The token count is normal. This is a "soft failure"—a semantic deviation where the system remains operationally healthy but functionally broken.
Traditional "Golden Signals" (Latency, Errors, Traffic, Saturation) miss the most critical failure mode in AI. The cause is AI hallucination. While we still need to track engine-level metrics like Time to First Token (TTFT) and Time Per Output Token (TPOT) for capacity planning, they don't tell us if the output is true.
According to the vLLM metrics documentation, tracking TTFT is essential for measuring the responsiveness of the prefill phase, but it remains a system metric, not a quality metric.
TRADITIONAL MONITORING SEMANTIC OBSERVABILITY
┌──────────────────────┐ ┌──────────────────────┐
│ HTTP 200/500 │ │ Faithfulness Score │
│ CPU / Memory │ │ Context Relevance │
│ p99 Latency │ │ Answer Relevance │
│ Error Rate │ │ Hallucination Rate │
└──────────┬───────────┘ └──────────┬───────────┘
│ │
▼ ▼
"Is the service up?" "Is the answer true?"
We need a framework that treats the LLM output as a probabilistic variable rather than a deterministic string. This requires a shift from monitoring infrastructure to observing meaning.
The Semantic Observability Stack
Standard OpenTelemetry (OTel) spans are designed for distributed tracing across microservices, but they lack the fields necessary to debug an LLM. When an LLM call fails semantically, you need the full prompt, the completion, the model temperature, and the specific version of the system prompt used.
Injecting these into standard logs is a recipe for storage-cost disasters and PII leaks. Instead, we use OpenLLMetry, an open-source extension of OTel that introduces semantic conventions for AI. It auto-instruments calls to providers like OpenAI or Anthropic, capturing the input and output as attributes within a span.
One major architectural challenge is the overhead. Running a complex evaluation on every request increases latency. We solve this with Shadow Logging. The application thread handles the user request and immediately returns the response. Simultaneously, it pushes the trace data to an asynchronous queue (like RabbitMQ or an AWS SQS buffer). A worker pool then picks up these traces to run semantic evaluations offline.
from traceloop.sdk import Traceloop
from opentelemetry import trace
# Initialize OpenLLMetry with batched (asynchronous) span export
Traceloop.init(app_name="finance-rag-service", disable_batching=False)
tracer = trace.get_tracer(__name__)
def generate_answer(query: str, context: str):
with tracer.start_as_current_span("rag_generation") as span:
# Attach semantic attributes manually if not using auto-instrumentation.
# These follow OpenLLMetry's current gen_ai.* conventions.
span.set_attribute("gen_ai.prompt.0.content", query)
span.set_attribute("gen_ai.request.model", "gpt-4o")
# Standard inference call
response = llm_client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "Use only the provided context."},
{"role": "user", "content": f"Context: {context}\nQuery: {query}"}
]
)
answer = response.choices[0].message.content
span.set_attribute("gen_ai.completion.0.content", answer)
return answer
By decoupling the evaluation from the inference path, we protect the user experience while still capturing the data needed for forensic analysis. If a user reports a bad answer, we can look up the exact trace ID in Arize Phoenix and see the retrieved chunks that led to the error.
Quantifying Quality: Implementing the RAG Triad
Measuring the quality of a RAG system requires more than a single "accuracy" score. We use the RAG Triad, a framework popularized by TruLens, which decomposes the pipeline into three measurable relationships:
- Context Relevance: How much of the retrieved information is actually relevant to the query? If your vector DB returns 10 chunks but only chunk 8 contains the answer, relevance is low.
- Groundedness: Is the answer derived only from the retrieved context? This detects hallucinations where the LLM uses its internal training data instead of your proprietary docs.
- Answer Relevance: Does the response actually address the user's intent?
To automate these metrics, we use the RAGAS library, which implements the same three relationships under its own metric names: Faithfulness (the analog of groundedness), Context Precision (a ranking-aware measure of retrieval relevance), and Answer Relevancy. It uses a "Judge" model (usually a stronger model like GPT-4o) to score these metrics on a scale of 0 to 1.
For production reliability, we don't just rely on live sampling. We maintain a Golden Set. This is a curated JSON file of 50-100 high-stakes query-context-answer triples. During CI/CD, we run the current pipeline against this set. If the Faithfulness score drops below a threshold (e.g., 0.85), the build fails. This prevents regression when updating embedding models or changing chunking strategies, such as switching from fixed-size 512-token windows to recursive character splitting.
Referencing the Meta CRAG benchmark provides a standard for how these datasets should be structured to handle ambiguous or dynamic information. Without a Golden Set, you're just guessing whether your new prompt is better than the last one.
Architecting LLM-as-a-Judge at Scale
Running a judge model on 100% of traffic is prohibitively expensive. At 1,000 requests per second, evaluating every response with GPT-4o would bankrupt the project. Instead, we implement a sidecar evaluation service that samples 5% of production traffic.
As a heuristic, the "Judge" should be more capable than the "Worker," though judge models carry their own biases (self-preference, position bias), so this is a rule of thumb rather than an absolute. If your production model is Llama 3-8B, a judge like Claude 3.5 Sonnet or GPT-4o is a reasonable choice. The prompt for the judge must use Chain-of-Thought (CoT) to ensure accuracy. We force the judge to extract specific quotes from the context before making a judgment.
from pydantic import BaseModel, Field
from typing import List
class RAGEvaluationSchema(BaseModel):
# Field order matters. With structured/constrained generation the model
# emits fields top-to-bottom, so evidence and reasoning MUST come before
# the score — otherwise the model commits to a number before "thinking,"
# defeating the chain-of-thought benefit.
supporting_evidence: List[str] = Field(..., description="Direct quotes from the context")
reasoning: str = Field(..., description="Step-by-step logic for the score")
score: float = Field(..., description="Score from 0 to 1")
# Example Judge Prompt excerpt:
# "Evaluate the faithfulness of the answer. First, list all claims in the answer.
# Second, for each claim, find a supporting sentence in the context.
# If no sentence exists, the claim is a hallucination."
By enforcing a structured output via Pydantic, the evaluation data remains machine-readable. We can then aggregate these scores into a time-series database. This allows us to see if a specific model update caused a spike in hallucinations across the entire fleet.
Operationalizing the Feedback Loop
The final step is moving semantic metrics from a researcher's notebook into the SRE's dashboard. We use a custom Prometheus exporter that scrapes the results from our evaluation worker. This allows us to visualize the Hallucination Rate alongside CPU Usage and Memory Saturation in Grafana.
Setting up "Semantic Alerts" is a fundamental shift in on-call philosophy. Instead of alerting only on 5xx errors, we trigger PagerDuty when the average Faithfulness score drops below 0.8 over a 5-minute rolling window. This often catches issues that system metrics miss, such as a corrupted vector index or an embedding provider that has silently begun returning degraded or zero-filled vectors.
THE FEEDBACK LOOP
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ User Query │─────▶│ RAG Pipeline │─────▶│ Response │
└──────────────┘ └──────┬───────┘ └──────┬───────┘
│ │
▼ ▼
┌──────────────┐ ┌──────────────┐
│ OTel Trace │─────▶│ Eval Sidecar │
└──────────────┘ └──────┬───────┘
│
┌─────────────────────┘
▼
┌──────────────┐ ┌──────────────┐
│ Prometheus │─────▶│ PagerDuty │
└──────────────┘ └──────────────┘
We also integrate span-level user feedback. When a user clicks a "thumbs down" in the UI, the frontend sends the trace ID back to our observability backend. We correlate these manual signals with our automated RAG Triad scores. If the automated judge says a response is "grounded" but the user hates it, we have identified a gap in our evaluation logic. This continuous refinement turns observability from a passive monitoring task into a proactive engine for model improvement. Engineering for LLMs means accepting that the output will never be perfect, but it can always be measurable.
This article was generated with the help of Ozigi.
If you enjoyed reading this article and want to generate your own, go to our free article generator, and start ranking on Google for thought-pieces.
Top comments (0)