Observability is the ability to understand the internal state of a system from its external outputs. In a microservices architecture — where a single API call may fan out to dozens of services and hundreds of DB queries — observability is not optional. It's survival.
Istio provides three pillars of observability out of the box:
| Pillar | Tool | Best For |
|---|---|---|
| Metrics | Prometheus + Grafana | Trend analysis, QPS, latency, error rate |
| Distributed Tracing | Jaeger | Root cause analysis of slow/failed requests |
| Logging | ELK Stack | Detailed per-request log investigation |
This article covers the first two in depth.
Part 1: Metrics with Prometheus + Grafana
Why Prometheus Over StatsD or InfluxDB?
Three metrics systems are commonly used in modern infrastructure:
StatsD + Graphite → UDP push model, lightweight, limited ecosystem
InfluxDB + Telegraf → Rich system-level plugins, custom app metrics need manual work
Prometheus → Pull model, native K8s/Consul service discovery, full alerting stack
Prometheus wins in cloud-native environments for two reasons:
-
Pull model — Prometheus scrapes targets over HTTP. This simplifies client code and makes debugging trivial (just
curlthe/metricsendpoint). - Native service discovery — No static IP lists. Prometheus discovers targets from Kubernetes, Consul, or DNS automatically.
Prometheus Architecture
┌─────────────────────────────────────────────────────────────┐
│ Prometheus Server │
│ ┌──────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Scraper │ │ TSDB │ │ Rule Engine │ │
│ │ (pull HTTP) │ │ (storage) │ │ (alerts) │ │
│ └──────┬───────┘ └──────────────┘ └──────┬──────┘ │
└──────────┼────────────────────────────────────────┼─────────┘
│ scrape /metrics │ fire alerts
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ Service Discovery │ │ Alertmanager │
│ (K8s / Consul/DNS) │ │ (dedup, group, │
└─────────────────────┘ │ notify: DingTalk / │
│ WeCom / Email) │
┌─────────────────────┐ └──────────────────────┘
│ PushGateway │ ← for short-lived jobs (batch, PHP scripts)
└─────────────────────┘
┌─────────────────────┐
│ Exporters │ ← Node Exporter, Blackbox Exporter, etc.
└─────────────────────┘
PushGateway exists specifically for short-lived processes (batch jobs, PHP scripts) that won't survive long enough to be scraped.
Prometheus Data Model
Every metric is a combination of a name + labels:
negri_http_request_total{client="serviceA", code="200", exported_service="serviceB", path="/ping"}
-
negri_http_request_total— metric name (describes what it measures) -
{client, code, exported_service, path}— labels (dimensions for filtering/aggregation)
The Four Metric Types
| Type | Computed By | Best For | Example |
|---|---|---|---|
| Counter | Server-side | QPS, total requests (monotonically increasing) | negri_http_request_total |
| Gauge | Client-side | Instantaneous values, event flags (circuit breaker open/closed) | degrade_events{event="eventBreakerOpenStatus"} |
| Histogram | Server-side | Latency percentiles (p90/p95/p99), response size distribution | negri_http_response_time_us_bucket{le="0.5"} |
| Summary | Client-side | Precise percentiles, but less flexible at query time | Similar to Histogram |
Counter vs Gauge: If your value only goes up → use Counter (cheaper, server-side). If it goes up and down → use Gauge.
Histogram vs Summary: Histogram is more flexible (percentiles calculated at query time in Grafana). Summary is more precise but percentiles are fixed at instrumentation time.
Grafana Dashboard: What to Look At
Istio ships with pre-built Grafana dashboards. Here's how to navigate them effectively.
Services Dashboard
Key filter parameters:
| Parameter | What It Means |
|---|---|
Service |
The Kubernetes service name (e.g. details.default.svc.cluster.local) |
Client Workload Namespace |
The K8s namespace where the client lives |
Client Workload |
The upstream caller — e.g. productpage-v1 calling details
|
Service Workload |
The specific version of the service — e.g. reviews-v1, reviews-v2, reviews-v3
|
Why
Client Workloadmatters: Traditional microservice SDKs often struggle to reliably inject the caller's service name into metrics. Developers might hardcode the wrong name or copy-paste from another service. Istio eliminates this entirely — the control plane injects the caller identity automatically via the sidecar. No human error possible.
Key panels in the dashboard:
- QPS (client-side vs server-side) — client-side is more accurate as it includes network latency
- Success Rate — percentage of non-5xx responses
- Latency — p50/p90/p99 breakdown
Workload Dashboard
The Workload dashboard adds a critical dimension: inbound vs outbound traffic.
reviews-v3 Workload
├── INBOUND ← traffic FROM productpage-v1
└── OUTBOUND ← traffic TO ratings.default.svc.cluster.local
This dual-direction view is invaluable for root cause analysis. When reviews-v3 is slow, you can immediately see:
- Is it slow because incoming traffic is high? (upstream pressure)
- Or is it slow because outgoing calls to
ratingsare slow? (downstream dependency)
Part 2: Distributed Tracing with Jaeger
The Three Pillars of Observability — Compared
Metrics → "Something is wrong with reviews-v3 (p99 latency spiked to 2s)"
Tracing → "The spike is caused by the ratings service call at step 4 of this specific request"
Logging → "Here's the exact SQL query and stack trace that caused it"
Each tool answers a different question. Tracing is the bridge between "something is wrong" and "here's exactly why."
How Distributed Tracing Works (Dapper Model)
Tracing originates from Google's Dapper paper. The core concepts:
TraceId: 5f1db306ef459b2f (unique per request, generated at the gateway)
│
├── Span A (SpanId: aaa, ParentSpanId: null) ← root span (productpage)
│ │
│ ├── Span B (SpanId: bbb, ParentSpanId: aaa) ← details service call
│ │
│ └── Span C (SpanId: ccc, ParentSpanId: aaa) ← reviews service call
│ │
│ └── Span D (SpanId: ddd, ParentSpanId: ccc) ← ratings service call
Trace context is propagated via HTTP headers:
| Header | Purpose |
|---|---|
x-request-id |
Unique request ID for log correlation |
x-b3-traceid |
64-bit global trace identifier |
x-b3-spanid |
Current span position in the trace tree |
x-b3-parentspanid |
Parent span (absent = root node) |
x-b3-sampled |
Sampling flag (1 = record this trace) |
A Real Trace Record
{
"traceID": "5f1db306ef459b2f",
"spanID": "5f1db306ef459b2f",
"parentSpanID": "0",
"operationName": "/ping",
"duration": 2065,
"startTime": 1609241265147010,
"process": {
"serviceName": "negri.sidecarserverlistener.myapp",
"tags": { "hostname": "MacBook-Pro-3.local", "ip": "192.168.1.88" }
},
"tags": {
"http.method": "GET",
"http.status_code": "200",
"http.url": "/ping",
"peer.address": "http://127.0.0.1:8888",
"span.kind": "server"
}
}
Jaeger Architecture
Application Code
│ UDP
▼
┌──────────────┐ ┌──────────────────┐ ┌──────────────┐
│ jaeger-client│────▶│ jaeger-agent │────▶│jaeger- │
│ (SDK) │ │ (per-host │ │collector │
└──────────────┘ │ daemon) │ │(validate + │
└──────────────────┘ │ process) │
└──────┬───────┘
│
▼
┌──────────────┐
│ jaeger-db │
│ (Cassandra / │
│ Elasticsearch│
└──────┬───────┘
│
▼
┌──────────────┐
│ jaeger-query │
│ (UI + API) │
└──────────────┘
Why jaeger-agent? Same philosophy as Istio's sidecar — it offloads service discovery and routing complexity from the client SDK. The app just sends UDP to localhost and forgets about it.
The One Thing Your Code Must Do
Envoy automatically generates spans and propagates B3 headers to upstream services. But — it cannot read the incoming headers and pass them downstream on your behalf. Your application code must forward the trace headers manually.
Here's the pattern from the Bookinfo sample app (Python):
@app.route('/productpage')
@trace()
def front():
headers = getForwardHeaders(request) # extract + forward trace headers
details = getProductDetails(product_id, headers) # pass them downstream
def getForwardHeaders(request):
headers = {}
# B3 headers — extracted automatically via OpenTracing library
span = get_current_span()
carrier = {}
tracer.inject(span_context=span.context, format=Format.HTTP_HEADERS, carrier=carrier)
headers.update(carrier)
# Other headers that must be forwarded manually
incoming_headers = [
'x-request-id',
'x-ot-span-context',
'traceparent',
'tracestate',
'x-cloud-trace-context',
'grpc-trace-bin',
'user-agent',
]
for h in incoming_headers:
val = request.headers.get(h)
if val is not None:
headers[h] = val
return headers
Key insight: Envoy handles span creation and injection automatically. Your app only needs to forward the headers it receives. This is a much lighter instrumentation burden than traditional APM agents.
What Jaeger Shows You
Trace List View — All traces for a service, sortable by duration. Immediately surfaces the slowest requests.
Trace Detail View — Full waterfall of every span in a single request:
productpage ────────────────────────────── 450ms
details ─── 45ms
reviews ────────────────── 380ms
ratings ──────────── 310ms ← bottleneck identified
System Architecture View — Auto-generated service dependency graph derived from trace data. No manual diagram maintenance needed.
Summary: Metrics vs Tracing — When to Use Which
| Scenario | Use |
|---|---|
| "Is the service healthy right now?" | Metrics (Grafana dashboard) |
| "Has latency been trending up this week?" | Metrics (Prometheus query) |
| "Why did this specific request take 3 seconds?" | Tracing (Jaeger) |
| "Which service is the bottleneck in my call chain?" | Tracing (Jaeger waterfall) |
| "What exact error did service X return at 14:32?" | Logging (ELK) |
The power of Istio is that you get all of this without modifying your application — except for the minimal header forwarding shown above. The sidecar handles metric collection, span generation, and header injection automatically.
💻 Explore the full implementation:
github.com/muzinan123/servicemesh
Top comments (0)