DEV Community

Cover image for Observability in GenAI Systems: What to Log, Measure, and Monitor
Shreekansha
Shreekansha

Posted on • Originally published at Medium

Observability in GenAI Systems: What to Log, Measure, and Monitor

In traditional microservices, observability is a solved problem. We monitor CPU, memory, request latency (p99), and HTTP status codes. If a service returns a 200 OK within 100ms, we generally assume it is healthy.

In Generative AI (GenAI), this assumption is dangerous. A GenAI system can return a 200 OK within 100ms while delivering a high-confidence hallucination, leaking PII, or incurring a 500% cost spike due to an unoptimized retrieval loop. Traditional observability tells you the system is alive; GenAI observability tells you if the system is useful, safe, and economically viable.

1.The Unique Challenges of GenAI Observability

GenAI systems are non-deterministic. The same input can yield different outputs, and "correctness" is a moving target.

  • Semantic Failure: A system can fail semantically (hallucinations) without failing technically (exceptions).

  • Token-Based Economics: Costs are tied to sub-request units (tokens), not just request volume.

  • Contextual State: The payload size (context window) directly impacts latency and cost in a non-linear fashion.

  • Multi-Stage Pipelines: A single user query might involve multiple retrieval steps, reranking, and multiple inference calls, each requiring distinct telemetry.

2.The Observability Architecture

A production-ready observability stack must intercept data at every stage of the pipeline.

[User Request]
       |
       v
+--------------------------+      +--------------------------+
|  Input Guardrail Layer   | ---> | Metric: Injection Block  |
+--------------------------+      | Metric: PII Filter Rate  |
       |                          +--------------------------+
       v
+--------------------------+      +--------------------------+
|  Retrieval (RAG) Layer   | ---> | Metric: Recall @ K       |
| (Vector DB / Search)     |      | Metric: Context Precision|
+--------------------------+      +--------------------------+
       |
       v
+--------------------------+      +--------------------------+
|  Inference Service       | ---> | Metric: TTFT, TPS        |
| (LLM Orchestrator)       |      | Metric: Token Usage/Cost |
+--------------------------+      +--------------------------+
       |
       v
+--------------------------+      +--------------------------+
|  Validation Layer        | ---> | Metric: Hallucination %  |
| (Grounding / Fact Check) |      | Metric: Schema Accuracy  |
+--------------------------+      +--------------------------+
       |
       v
[Safe Response to User]

Enter fullscreen mode Exit fullscreen mode

3.Logging Strategies for Production
Logging in GenAI must be structured to allow for post-hoc semantic analysis. You are not just logging strings; you are logging "traces of thought."

Token Usage and Cost Attribution

Every log entry must track the model name, input tokens, and output tokens. In multi-tenant systems, this must be tagged with a tenant_id.


import time
import uuid
import logging

logger = logging.getLogger("genai_runtime")

def log_inference_event(tenant_id, model_name, prompt, response, usage, latency_ms):
    event = {
        "event_id": str(uuid.uuid4()),
        "timestamp": time.time(),
        "tenant_id": tenant_id,
        "metadata": {
            "model": model_name,
            "latency_ms": latency_ms,
            "usage": {
                "prompt_tokens": usage.get("prompt_tokens"),
                "completion_tokens": usage.get("completion_tokens"),
                "total_tokens": usage.get("total_tokens")
            }
        },
        "payload": {
            "input": prompt,
            "output": response
        }
    }
    # Structure logging for ELK or BigQuery ingestion
    logger.info(json.dumps(event))

Enter fullscreen mode Exit fullscreen mode

Retrieval Metadata
When using RAG, you must log the specific chunks retrieved. This allows you to debug whether a hallucination was caused by the model or by bad data.

  • Chunk IDs: Which documents were pulled?

  • Similarity Scores: How confident was the vector search?

  • Context Window Position: Where was the correct answer located in the prompt? (Models often suffer from "Lost in the Middle").

4.Performance Monitoring: The Latency Breakdown
Standard latency (Total Time) is insufficient. You must break it down into:

  • Time to First Token (TTFT): The delay until the user sees the first character. This is the primary driver of perceived speed.

  • Tokens Per Second (TPS): The "reading speed" of the generation.

  • Retrieval Latency: Time spent querying the Vector DB.

  • Guardrail Latency: Time spent running safety classifiers.

If TTFT is high but TPS is acceptable, your bottleneck is likely the retrieval step or prompt processing. If TPS is low, your model choice or provider is the bottleneck.

5.Cost Monitoring and Anomaly Detection
GenAI costs can be volatile. A recursive agent loop can burn through thousands of dollars in minutes.

  • Per-Tenant Quotas: Hard-limit token usage at the infrastructure level.

  • Trend Analysis: Monitor the cost_per_query. If it spikes, check if users are inputting massive documents or if your retrieval system is returning too many chunks.

  • Model Switching: Monitor for opportunities to route simple queries to smaller, cheaper models.

6.Hallucination Detection Signals
Since you cannot manually verify every response, you need "Proxy Signals" for hallucinations:

  • Self-Correction Rate: How often does your validation layer send a response back for a retry?

  • Semantic Variance: Run the same query twice at temperature > 0. If the answers are wildly different, the model is likely hallucinating.

  • NLI (Natural Language Inference): Use a small, fast model to check if the response is logically "entailed" by the context.

7.Metrics Breakdown: What to Put on the Dashboard

Quality Metrics

  • Faithfulness: Does the response match the context?

  • Answer Relevance: Does the response actually address the user's prompt?

  • Context Precision: Is the retrieved information actually useful for the answer?

Operational Metrics

  • Request Success Rate (Technical 2xx vs 4xx/5xx).

  • Guardrail Trigger Rate: Percentage of requests blocked for safety/PII.

  • Token Burn Rate: Real-time USD spend per hour.

User Interaction Metrics

  • Thumbs Up/Down: Direct feedback linked to the specific trace ID.

  • Edit Distance: If the user manually edits the AI's output, measure the Levenshtein distance between the AI's draft and the final version.

8.Common Production Mistakes

  • Logging Only the Output: Without logging the prompt and the retrieved context, you cannot replicate or fix failures.

  • Ignoring System Prompts: Always log the exact version of the system prompt used. A "minor" prompt tweak can fundamentally change model behavior.

  • No Trace Correlation: Failing to link the Guardrail log, the Retrieval log, and the Inference log. Use a trace_id that persists across the entire pipeline.

  • Over-logging PII: While you need to log for debugging, ensure your logging pipeline itself has PII scrubbing to prevent sensitive user data from ending up in your observability tool (e.g., Datadog/ELK).

9.Engineering Takeaways

Observability in GenAI is not a "nice-to-have" post-launch feature; it is a core architectural component. Your system must be designed to be "interrogatable."

  • Decouple your logic so each step (Retrieval, Inference, Validation) can be timed and measured independently.

  • Use structured, machine-readable logs to enable automated evaluation of model quality.

  • Prioritize TTFT and Cost Attribution to maintain both user satisfaction and business health.

  • Build "Semantic Firewalls" using guardrail metrics to catch non-technical failures before they reach the user.

A mature GenAI platform is characterized by its ability to answer exactly why a model gave a specific answer at a specific cost. If your current logs only show a 200 OK and a timestamp, you are flying blind in a probabilistic storm.

Critical Discussion

If a system achieves 99.9% uptime (200 OKs) but provides semantically incorrect or ungrounded answers in 15% of those cases, should the "Availability" metric be redefined to include semantic grounding, and how does this shift the responsibility from DevOps to AI Platform Engineering?

Top comments (0)