DEV Community

Cover image for Design Recipe: Observability Pyramid for LLM Infrastructure
astronaut
astronaut

Posted on

Design Recipe: Observability Pyramid for LLM Infrastructure

In classic backend systems, we are used to determinism: code either works or crashes with a clear stack trace. In LLM systems, we deal with "soft failures" — the system runs fast and without log errors, but outputs hallucinations or irrelevant context.

As an engineer with a highload and distributed systems background, I like to view the system as a conveyor with measurable efficiency at each stage. For this, I use the Observability Pyramid, where each layer protects the next.

1. System Layer: Telemetry and SRE Basics

Without this layer, the others make no sense. If you don't meet SLAs for availability and speed, response accuracy doesn't matter.

Key Metrics:

  • TTFT (Time to First Token): the main metric for UX
  • TPOT (Time Per Output Token): generation stability
  • Tokens/Sec & Input/Output Ratio: critical for capacity planning and understanding KV-cache load

Engineering Approach: Monitor inference engines (vLLM/TGI) via Prometheus/Grafana and OpenTelemetry (OpenLLMetry).

For details on profiling the engine and finding bottlenecks — see my article:

LLM Engine Telemetry: How to profile models

2. Retrieval Layer: Data Hygiene (RAG Triad)

Most hallucinations stem from poor retrieval. RAG evaluation should be decomposed into three components:

A. Context Precision

How relevant are the retrieved chunks? Noise distracts the model and wastes tokens.

Tools: RAGAS, DeepEval.

B. Context Recall

Does the retrieved set contain the factual answer?

Practice: You need a "golden standard" — a labeled dataset. I use Meta CRAG because it simulates real-world chaos and dynamically changing data.

See my guide on local CRAG evaluation here.

C. Faithfulness

Is the answer derived from the context or hallucinated?

A judge model checks every claim in the response against the provided source.

3. Semantic Layer: LLM-as-a-Judge at Scale

This level checks logic. The main challenge is balancing evaluation quality with cost/speed.

Engineering Best Practices:

  • CI/CD Gating: Full run on a reference dataset. If Faithfulness drops below 0.8 — block deployment (tune the threshold for your domain).
  • Production Sampling: In highload systems, evaluating 100% of traffic via GPT-4o is financial suicide. Use sampling (1–5%). Additionally: implement judge caching (GPT cache, LangChain cache, or vLLM prefix caching). This is especially effective when users ask similar questions — the same prompt+context can be evaluated multiple times, but you pay only once.
  • Specialized Judges: Instead of "naked" small models (which often struggle with logic), use Prometheus-2 or Flow-Judge. They are trained specifically for evaluation tasks, comparable in quality to GPT-4, and can be hosted locally.
  • Out-of-band Eval: In production, evaluation always runs asynchronously to avoid increasing main request latency.

Diagnostic Map: What to Fix?

Metric If Dropped, Problem In: Action Plan
Context Recall Embeddings / Indexing Switch embedding model, implement Hybrid Search (Vector + Keyword)
Context Precision Chunking / Noise Add Reranker (Cross-Encoder), revise Chunking Strategy
Faithfulness Temperature / Context Lower Temperature, strengthen system prompt, check chunk integrity
TTFT (Latency) Hardware / Load Check Cache Hit Rate, enable quantization or PagedAttention

Implementation Plan (Checklist)

  • Instrument (Day 0): Set up export of metrics and traces (vLLM + OpenTelemetry).
  • Golden Set: Collect 50–100 critical cases. Use Meta CRAG structure as reference (details in my article Build Your Own Spaceport: Local RAG Evaluation with Meta CRAG).
  • Automate: Integrate DeepEval/RAGAS into GitHub Actions.
  • Sampling & Feedback: Set up log and user feedback collection (thumbs up/down) for gray-zone analysis in Arize Phoenix or LangSmith.

Conclusion

For an experienced engineer, an LLM system is just another probabilistic node in a distributed architecture. Our job is to surround it with sensors so its behavior becomes predictable — like the trajectory of a rocket on a verified orbit.

Top comments (6)

Collapse
 
alex_ml_ai profile image
Alex

I especially liked the layered approach.in your opinion, is it viable to run automated evaluations (LLM-as-a-judge) continuously in production, or does it add too much overhead in terms of cost and latency?

Collapse
 
astronaut27 profile image
astronaut

Great point! It really depends on your infra. If you have idle GPUs, using pricey APIs for every check is overkill. It’s better to build a 'Golden Dataset' and fine-tune a smaller model (like Llama 3) or get specific llm as a dedicated judge. Also, throwing GPTCache into the mix is a game-changer. If a similar response has been judged before, just pull it from the cache and save the tokens. I think it will be useful for you link

Collapse
 
lennon_developer profile image
Lennon

Yes, it's really expensive. I don't think many companies can afford it.

Collapse
 
astronaut27 profile image
astronaut

Yes, you're right. I tried to describe the solution above in the comment.

Collapse
 
lennon_developer profile image
Lennon

Regarding the Prometheus-2, it's an excellent model, but out of the box it can still be a bit of a pain in narrow niches (medicine, law and etc)

Collapse
 
astronaut27 profile image
astronaut

True. In complex niches, the 'judge' is just a tool, not a silver bullet. The secret lies in Custom Rubrics and providing Ground Truth answers. If you give Prometheus-2 a clear scoring scale and a few domain-specific examples, it hits the mark much more consistently