DEV Community

James Lee
James Lee

Posted on

Full Observability in Istio: Metrics with Prometheus/Grafana + Distributed Tracing with Jaeger

Observability is the ability to understand the internal state of a system from its external outputs. In a microservices architecture — where a single API call may fan out to dozens of services and hundreds of DB queries — observability is not optional. It's survival.

Istio provides three pillars of observability out of the box:

Pillar Tool Best For
Metrics Prometheus + Grafana Trend analysis, QPS, latency, error rate
Distributed Tracing Jaeger Root cause analysis of slow/failed requests
Logging ELK Stack Detailed per-request log investigation

This article covers the first two in depth.


Part 1: Metrics with Prometheus + Grafana

Why Prometheus Over StatsD or InfluxDB?

Three metrics systems are commonly used in modern infrastructure:

StatsD + Graphite    →  UDP push model, lightweight, limited ecosystem
InfluxDB + Telegraf  →  Rich system-level plugins, custom app metrics need manual work
Prometheus           →  Pull model, native K8s/Consul service discovery, full alerting stack
Enter fullscreen mode Exit fullscreen mode

Prometheus wins in cloud-native environments for two reasons:

  1. Pull model — Prometheus scrapes targets over HTTP. This simplifies client code and makes debugging trivial (just curl the /metrics endpoint).
  2. Native service discovery — No static IP lists. Prometheus discovers targets from Kubernetes, Consul, or DNS automatically.

Prometheus Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Prometheus Server                        │
│   ┌──────────────┐    ┌──────────────┐    ┌─────────────┐  │
│   │   Scraper    │    │    TSDB       │    │ Rule Engine │  │
│   │  (pull HTTP) │    │  (storage)   │    │  (alerts)   │  │
│   └──────┬───────┘    └──────────────┘    └──────┬──────┘  │
└──────────┼────────────────────────────────────────┼─────────┘
           │ scrape /metrics                        │ fire alerts
           ▼                                        ▼
┌─────────────────────┐                  ┌──────────────────────┐
│  Service Discovery  │                  │    Alertmanager      │
│  (K8s / Consul/DNS) │                  │  (dedup, group,      │
└─────────────────────┘                  │   notify: DingTalk / │
                                         │   WeCom / Email)     │
┌─────────────────────┐                  └──────────────────────┘
│    PushGateway      │ ← for short-lived jobs (batch, PHP scripts)
└─────────────────────┘
┌─────────────────────┐
│     Exporters       │ ← Node Exporter, Blackbox Exporter, etc.
└─────────────────────┘
Enter fullscreen mode Exit fullscreen mode

PushGateway exists specifically for short-lived processes (batch jobs, PHP scripts) that won't survive long enough to be scraped.

Prometheus Data Model

Every metric is a combination of a name + labels:

negri_http_request_total{client="serviceA", code="200", exported_service="serviceB", path="/ping"}
Enter fullscreen mode Exit fullscreen mode
  • negri_http_request_total — metric name (describes what it measures)
  • {client, code, exported_service, path} — labels (dimensions for filtering/aggregation)

The Four Metric Types

Type Computed By Best For Example
Counter Server-side QPS, total requests (monotonically increasing) negri_http_request_total
Gauge Client-side Instantaneous values, event flags (circuit breaker open/closed) degrade_events{event="eventBreakerOpenStatus"}
Histogram Server-side Latency percentiles (p90/p95/p99), response size distribution negri_http_response_time_us_bucket{le="0.5"}
Summary Client-side Precise percentiles, but less flexible at query time Similar to Histogram

Counter vs Gauge: If your value only goes up → use Counter (cheaper, server-side). If it goes up and down → use Gauge.

Histogram vs Summary: Histogram is more flexible (percentiles calculated at query time in Grafana). Summary is more precise but percentiles are fixed at instrumentation time.


Grafana Dashboard: What to Look At

Istio ships with pre-built Grafana dashboards. Here's how to navigate them effectively.

Services Dashboard

Key filter parameters:

Parameter What It Means
Service The Kubernetes service name (e.g. details.default.svc.cluster.local)
Client Workload Namespace The K8s namespace where the client lives
Client Workload The upstream caller — e.g. productpage-v1 calling details
Service Workload The specific version of the service — e.g. reviews-v1, reviews-v2, reviews-v3

Why Client Workload matters: Traditional microservice SDKs often struggle to reliably inject the caller's service name into metrics. Developers might hardcode the wrong name or copy-paste from another service. Istio eliminates this entirely — the control plane injects the caller identity automatically via the sidecar. No human error possible.

Key panels in the dashboard:

  • QPS (client-side vs server-side) — client-side is more accurate as it includes network latency
  • Success Rate — percentage of non-5xx responses
  • Latency — p50/p90/p99 breakdown

Workload Dashboard

The Workload dashboard adds a critical dimension: inbound vs outbound traffic.

reviews-v3 Workload
├── INBOUND  ← traffic FROM productpage-v1
└── OUTBOUND ← traffic TO ratings.default.svc.cluster.local
Enter fullscreen mode Exit fullscreen mode

This dual-direction view is invaluable for root cause analysis. When reviews-v3 is slow, you can immediately see:

  • Is it slow because incoming traffic is high? (upstream pressure)
  • Or is it slow because outgoing calls to ratings are slow? (downstream dependency)

Part 2: Distributed Tracing with Jaeger

The Three Pillars of Observability — Compared

Metrics  →  "Something is wrong with reviews-v3 (p99 latency spiked to 2s)"
Tracing  →  "The spike is caused by the ratings service call at step 4 of this specific request"
Logging  →  "Here's the exact SQL query and stack trace that caused it"
Enter fullscreen mode Exit fullscreen mode

Each tool answers a different question. Tracing is the bridge between "something is wrong" and "here's exactly why."

How Distributed Tracing Works (Dapper Model)

Tracing originates from Google's Dapper paper. The core concepts:

TraceId: 5f1db306ef459b2f  (unique per request, generated at the gateway)
│
├── Span A  (SpanId: aaa, ParentSpanId: null)  ← root span (productpage)
│     │
│     ├── Span B  (SpanId: bbb, ParentSpanId: aaa)  ← details service call
│     │
│     └── Span C  (SpanId: ccc, ParentSpanId: aaa)  ← reviews service call
│           │
│           └── Span D  (SpanId: ddd, ParentSpanId: ccc)  ← ratings service call
Enter fullscreen mode Exit fullscreen mode

Trace context is propagated via HTTP headers:

Header Purpose
x-request-id Unique request ID for log correlation
x-b3-traceid 64-bit global trace identifier
x-b3-spanid Current span position in the trace tree
x-b3-parentspanid Parent span (absent = root node)
x-b3-sampled Sampling flag (1 = record this trace)

A Real Trace Record

{
  "traceID": "5f1db306ef459b2f",
  "spanID": "5f1db306ef459b2f",
  "parentSpanID": "0",
  "operationName": "/ping",
  "duration": 2065,
  "startTime": 1609241265147010,
  "process": {
    "serviceName": "negri.sidecarserverlistener.myapp",
    "tags": { "hostname": "MacBook-Pro-3.local", "ip": "192.168.1.88" }
  },
  "tags": {
    "http.method": "GET",
    "http.status_code": "200",
    "http.url": "/ping",
    "peer.address": "http://127.0.0.1:8888",
    "span.kind": "server"
  }
}
Enter fullscreen mode Exit fullscreen mode

Jaeger Architecture

Application Code
     │ UDP
     ▼
┌──────────────┐     ┌──────────────────┐     ┌──────────────┐
│ jaeger-client│────▶│  jaeger-agent    │────▶│jaeger-       │
│ (SDK)        │     │  (per-host       │     │collector     │
└──────────────┘     │   daemon)        │     │(validate +   │
                     └──────────────────┘     │ process)     │
                                              └──────┬───────┘
                                                     │
                                                     ▼
                                             ┌──────────────┐
                                             │  jaeger-db   │
                                             │ (Cassandra / │
                                             │  Elasticsearch│
                                             └──────┬───────┘
                                                    │
                                                    ▼
                                             ┌──────────────┐
                                             │ jaeger-query │
                                             │ (UI + API)   │
                                             └──────────────┘
Enter fullscreen mode Exit fullscreen mode

Why jaeger-agent? Same philosophy as Istio's sidecar — it offloads service discovery and routing complexity from the client SDK. The app just sends UDP to localhost and forgets about it.

The One Thing Your Code Must Do

Envoy automatically generates spans and propagates B3 headers to upstream services. But — it cannot read the incoming headers and pass them downstream on your behalf. Your application code must forward the trace headers manually.

Here's the pattern from the Bookinfo sample app (Python):

@app.route('/productpage')
@trace()
def front():
    headers = getForwardHeaders(request)  # extract + forward trace headers
    details = getProductDetails(product_id, headers)  # pass them downstream
Enter fullscreen mode Exit fullscreen mode
def getForwardHeaders(request):
    headers = {}

    # B3 headers — extracted automatically via OpenTracing library
    span = get_current_span()
    carrier = {}
    tracer.inject(span_context=span.context, format=Format.HTTP_HEADERS, carrier=carrier)
    headers.update(carrier)

    # Other headers that must be forwarded manually
    incoming_headers = [
        'x-request-id',
        'x-ot-span-context',
        'traceparent',
        'tracestate',
        'x-cloud-trace-context',
        'grpc-trace-bin',
        'user-agent',
    ]
    for h in incoming_headers:
        val = request.headers.get(h)
        if val is not None:
            headers[h] = val

    return headers
Enter fullscreen mode Exit fullscreen mode

Key insight: Envoy handles span creation and injection automatically. Your app only needs to forward the headers it receives. This is a much lighter instrumentation burden than traditional APM agents.

What Jaeger Shows You

Trace List View — All traces for a service, sortable by duration. Immediately surfaces the slowest requests.

Trace Detail View — Full waterfall of every span in a single request:

productpage  ──────────────────────────────  450ms
  details      ───  45ms
  reviews        ──────────────────  380ms
    ratings        ────────────  310ms  ← bottleneck identified
Enter fullscreen mode Exit fullscreen mode

System Architecture View — Auto-generated service dependency graph derived from trace data. No manual diagram maintenance needed.


Summary: Metrics vs Tracing — When to Use Which

Scenario Use
"Is the service healthy right now?" Metrics (Grafana dashboard)
"Has latency been trending up this week?" Metrics (Prometheus query)
"Why did this specific request take 3 seconds?" Tracing (Jaeger)
"Which service is the bottleneck in my call chain?" Tracing (Jaeger waterfall)
"What exact error did service X return at 14:32?" Logging (ELK)

The power of Istio is that you get all of this without modifying your application — except for the minimal header forwarding shown above. The sidecar handles metric collection, span generation, and header injection automatically.


💻 Explore the full implementation:
github.com/muzinan123/servicemesh

Top comments (0)