James Lee

Posted on May 22

Full Observability in Istio: Metrics with Prometheus/Grafana + Distributed Tracing with Jaeger

#devops #kubernetes #microservices #monitoring

Observability is the ability to understand the internal state of a system from its external outputs. In a microservices architecture — where a single API call may fan out to dozens of services and hundreds of DB queries — observability is not optional. It's survival.

Istio provides three pillars of observability out of the box:

Pillar	Tool	Best For
Metrics	Prometheus + Grafana	Trend analysis, QPS, latency, error rate
Distributed Tracing	Jaeger	Root cause analysis of slow/failed requests
Logging	ELK Stack	Detailed per-request log investigation

This article covers the first two in depth.

Part 1: Metrics with Prometheus + Grafana

Why Prometheus Over StatsD or InfluxDB?

Three metrics systems are commonly used in modern infrastructure:

StatsD + Graphite    →  UDP push model, lightweight, limited ecosystem
InfluxDB + Telegraf  →  Rich system-level plugins, custom app metrics need manual work
Prometheus           →  Pull model, native K8s/Consul service discovery, full alerting stack

Prometheus wins in cloud-native environments for two reasons:

Pull model — Prometheus scrapes targets over HTTP. This simplifies client code and makes debugging trivial (just curl the /metrics endpoint).
Native service discovery — No static IP lists. Prometheus discovers targets from Kubernetes, Consul, or DNS automatically.

Prometheus Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Prometheus Server                        │
│   ┌──────────────┐    ┌──────────────┐    ┌─────────────┐  │
│   │   Scraper    │    │    TSDB       │    │ Rule Engine │  │
│   │  (pull HTTP) │    │  (storage)   │    │  (alerts)   │  │
│   └──────┬───────┘    └──────────────┘    └──────┬──────┘  │
└──────────┼────────────────────────────────────────┼─────────┘
           │ scrape /metrics                        │ fire alerts
           ▼                                        ▼
┌─────────────────────┐                  ┌──────────────────────┐
│  Service Discovery  │                  │    Alertmanager      │
│  (K8s / Consul/DNS) │                  │  (dedup, group,      │
└─────────────────────┘                  │   notify: DingTalk / │
                                         │   WeCom / Email)     │
┌─────────────────────┐                  └──────────────────────┘
│    PushGateway      │ ← for short-lived jobs (batch, PHP scripts)
└─────────────────────┘
┌─────────────────────┐
│     Exporters       │ ← Node Exporter, Blackbox Exporter, etc.
└─────────────────────┘

PushGateway exists specifically for short-lived processes (batch jobs, PHP scripts) that won't survive long enough to be scraped.

Prometheus Data Model

Every metric is a combination of a name + labels:

negri_http_request_total{client="serviceA", code="200", exported_service="serviceB", path="/ping"}

negri_http_request_total — metric name (describes what it measures)
{client, code, exported_service, path} — labels (dimensions for filtering/aggregation)

The Four Metric Types

Type	Computed By	Best For	Example
Counter	Server-side	QPS, total requests (monotonically increasing)	`negri_http_request_total`
Gauge	Client-side	Instantaneous values, event flags (circuit breaker open/closed)	`degrade_events{event="eventBreakerOpenStatus"}`
Histogram	Server-side	Latency percentiles (p90/p95/p99), response size distribution	`negri_http_response_time_us_bucket{le="0.5"}`
Summary	Client-side	Precise percentiles, but less flexible at query time	Similar to Histogram

Counter vs Gauge: If your value only goes up → use Counter (cheaper, server-side). If it goes up and down → use Gauge.

Histogram vs Summary: Histogram is more flexible (percentiles calculated at query time in Grafana). Summary is more precise but percentiles are fixed at instrumentation time.

Grafana Dashboard: What to Look At

Istio ships with pre-built Grafana dashboards. Here's how to navigate them effectively.

Services Dashboard

Key filter parameters:

Parameter	What It Means
`Service`	The Kubernetes service name (e.g. `details.default.svc.cluster.local`)
`Client Workload Namespace`	The K8s namespace where the client lives
`Client Workload`	The upstream caller — e.g. `productpage-v1` calling `details`
`Service Workload`	The specific version of the service — e.g. `reviews-v1`, `reviews-v2`, `reviews-v3`

Why Client Workload matters: Traditional microservice SDKs often struggle to reliably inject the caller's service name into metrics. Developers might hardcode the wrong name or copy-paste from another service. Istio eliminates this entirely — the control plane injects the caller identity automatically via the sidecar. No human error possible.

Key panels in the dashboard:

QPS (client-side vs server-side) — client-side is more accurate as it includes network latency
Success Rate — percentage of non-5xx responses
Latency — p50/p90/p99 breakdown

Workload Dashboard

The Workload dashboard adds a critical dimension: inbound vs outbound traffic.

reviews-v3 Workload
├── INBOUND  ← traffic FROM productpage-v1
└── OUTBOUND ← traffic TO ratings.default.svc.cluster.local

This dual-direction view is invaluable for root cause analysis. When reviews-v3 is slow, you can immediately see:

Is it slow because incoming traffic is high? (upstream pressure)
Or is it slow because outgoing calls to ratings are slow? (downstream dependency)

Part 2: Distributed Tracing with Jaeger

The Three Pillars of Observability — Compared

Metrics  →  "Something is wrong with reviews-v3 (p99 latency spiked to 2s)"
Tracing  →  "The spike is caused by the ratings service call at step 4 of this specific request"
Logging  →  "Here's the exact SQL query and stack trace that caused it"

Each tool answers a different question. Tracing is the bridge between "something is wrong" and "here's exactly why."

How Distributed Tracing Works (Dapper Model)

Tracing originates from Google's Dapper paper. The core concepts:

TraceId: 5f1db306ef459b2f  (unique per request, generated at the gateway)
│
├── Span A  (SpanId: aaa, ParentSpanId: null)  ← root span (productpage)
│     │
│     ├── Span B  (SpanId: bbb, ParentSpanId: aaa)  ← details service call
│     │
│     └── Span C  (SpanId: ccc, ParentSpanId: aaa)  ← reviews service call
│           │
│           └── Span D  (SpanId: ddd, ParentSpanId: ccc)  ← ratings service call

Trace context is propagated via HTTP headers:

Header	Purpose
`x-request-id`	Unique request ID for log correlation
`x-b3-traceid`	64-bit global trace identifier
`x-b3-spanid`	Current span position in the trace tree
`x-b3-parentspanid`	Parent span (absent = root node)
`x-b3-sampled`	Sampling flag (`1` = record this trace)

A Real Trace Record

{
  "traceID": "5f1db306ef459b2f",
  "spanID": "5f1db306ef459b2f",
  "parentSpanID": "0",
  "operationName": "/ping",
  "duration": 2065,
  "startTime": 1609241265147010,
  "process": {
    "serviceName": "negri.sidecarserverlistener.myapp",
    "tags": { "hostname": "MacBook-Pro-3.local", "ip": "192.168.1.88" }
  },
  "tags": {
    "http.method": "GET",
    "http.status_code": "200",
    "http.url": "/ping",
    "peer.address": "http://127.0.0.1:8888",
    "span.kind": "server"
  }
}

Jaeger Architecture

Application Code
     │ UDP
     ▼
┌──────────────┐     ┌──────────────────┐     ┌──────────────┐
│ jaeger-client│────▶│  jaeger-agent    │────▶│jaeger-       │
│ (SDK)        │     │  (per-host       │     │collector     │
└──────────────┘     │   daemon)        │     │(validate +   │
                     └──────────────────┘     │ process)     │
                                              └──────┬───────┘
                                                     │
                                                     ▼
                                             ┌──────────────┐
                                             │  jaeger-db   │
                                             │ (Cassandra / │
                                             │  Elasticsearch│
                                             └──────┬───────┘
                                                    │
                                                    ▼
                                             ┌──────────────┐
                                             │ jaeger-query │
                                             │ (UI + API)   │
                                             └──────────────┘

Why jaeger-agent? Same philosophy as Istio's sidecar — it offloads service discovery and routing complexity from the client SDK. The app just sends UDP to localhost and forgets about it.

The One Thing Your Code Must Do

Envoy automatically generates spans and propagates B3 headers to upstream services. But — it cannot read the incoming headers and pass them downstream on your behalf. Your application code must forward the trace headers manually.

Here's the pattern from the Bookinfo sample app (Python):

@app.route('/productpage')
@trace()
def front():
    headers = getForwardHeaders(request)  # extract + forward trace headers
    details = getProductDetails(product_id, headers)  # pass them downstream

def getForwardHeaders(request):
    headers = {}

    # B3 headers — extracted automatically via OpenTracing library
    span = get_current_span()
    carrier = {}
    tracer.inject(span_context=span.context, format=Format.HTTP_HEADERS, carrier=carrier)
    headers.update(carrier)

    # Other headers that must be forwarded manually
    incoming_headers = [
        'x-request-id',
        'x-ot-span-context',
        'traceparent',
        'tracestate',
        'x-cloud-trace-context',
        'grpc-trace-bin',
        'user-agent',
    ]
    for h in incoming_headers:
        val = request.headers.get(h)
        if val is not None:
            headers[h] = val

    return headers

Key insight: Envoy handles span creation and injection automatically. Your app only needs to forward the headers it receives. This is a much lighter instrumentation burden than traditional APM agents.

What Jaeger Shows You

Trace List View — All traces for a service, sortable by duration. Immediately surfaces the slowest requests.

Trace Detail View — Full waterfall of every span in a single request:

productpage  ──────────────────────────────  450ms
  details      ───  45ms
  reviews        ──────────────────  380ms
    ratings        ────────────  310ms  ← bottleneck identified

System Architecture View — Auto-generated service dependency graph derived from trace data. No manual diagram maintenance needed.

Summary: Metrics vs Tracing — When to Use Which

Scenario	Use
"Is the service healthy right now?"	Metrics (Grafana dashboard)
"Has latency been trending up this week?"	Metrics (Prometheus query)
"Why did this specific request take 3 seconds?"	Tracing (Jaeger)
"Which service is the bottleneck in my call chain?"	Tracing (Jaeger waterfall)
"What exact error did service X return at 14:32?"	Logging (ELK)

The power of Istio is that you get all of this without modifying your application — except for the minimal header forwarding shown above. The sidecar handles metric collection, span generation, and header injection automatically.

💻 Explore the full implementation:
github.com/muzinan123/servicemesh

DEV Community