Introduction
Your application just slowed down. Users are complaining. The CEO is asking what's wrong. You have hundreds of microservices, thousands of containers, and millions of log lines. Where do you even start?
This is the observability problem. Traditional monitoringโchecking if servers are upโisn't enough in cloud-native environments. You need to understand why your system is behaving a certain way, not just that something is wrong.
Observability is about instrumenting your systems to answer any question about their behavior. In this comprehensive guide, we'll explore modern observability practices, focusing on OpenTelemetry as the industry standard for instrumentation.
The Three Pillars of Observability
1. Metrics
Numeric measurements over time:
CPU usage: 45%
Request rate: 1,250 req/sec
Error rate: 0.5%
P95 latency: 450ms
Active users: 12,450
Good for: Dashboards, alerts, trends
Bad for: Understanding why something happened
2. Logs
Discrete events:
{
"timestamp": "2024-01-15T10:23:45Z",
"level": "ERROR",
"service": "payment-api",
"message": "Payment processing failed",
"error": "Connection timeout to payment gateway",
"user_id": 12345,
"amount": 99.99
}
Good for: Debugging, understanding what happened
Bad for: High-cardinality queries, correlation across services
3. Traces
Request journey through distributed system:
User Request โ Frontend (50ms)
โโ> API Gateway (5ms)
โ โโ> Auth Service (20ms)
โ โโ> Product Service (100ms)
โ โ โโ> Database Query (80ms) โ SLOW!
โ โโ> Inventory Service (30ms)
โโ> Response (Total: 205ms)
Good for: Understanding request flow, finding bottlenecks
Bad for: Aggregation, high-level trends
Why Traditional Monitoring Fails
The Kubernetes Problem
Traditional Monitoring (Pre-Kubernetes):
- Fixed servers with static IPs
- Server metrics tell you what's wrong
- SSH to server, check logs
- Simple to debug
Kubernetes:
- Pods come and go every few minutes
- IP addresses change constantly
- Logs disappear when pod dies
- Which pod handled the failing request?
- Impossible to debug with traditional tools
The Microservices Problem
Monolith:
User Request โ Application โ Database
(Easy to trace)
Microservices:
User Request โ API Gateway
โโ> Service A โ Service B โ Service C
โโ> Service D โ Service E
โโ> Service F โ Service G โ Service H โ Service I
Question: "Why is this request slow?"
Traditional monitoring: Can't tell you
Observability: Shows exact bottleneck
OpenTelemetry: The Standard
What is OpenTelemetry?
OpenTelemetry (OTel) is a vendor-neutral, open-source standard for instrumenting applications to generate telemetry data (metrics, logs, traces).
Before OpenTelemetry:
- Proprietary agents for each vendor
- Vendor lock-in
- Different instrumentation for each tool
With OpenTelemetry:
- Single SDK for all telemetry
- Send to any backend
- Standardized across languages
- No vendor lock-in
Architecture
Application โ OpenTelemetry SDK โ OpenTelemetry Collector โ Backend
โ
(Process, filter, route)
โ
โโโโโโโโโโโโโดโโโโโโโโโโโโ
โ โ
Prometheus Jaeger/Tempo
(Metrics) (Traces)
โ โ
Grafana โโโโโโโโโโโโโโโโโโโโ
(Visualization)
Installing OpenTelemetry
Python:
# pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor
# Set up tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
# Export to collector
otlp_exporter = OTLPSpanExporter(
endpoint="http://otel-collector:4317",
insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
app = Flask(__name__)
@app.route('/api/users/<user_id>')
def get_user(user_id):
# Automatically traced!
with tracer.start_as_current_span("get_user") as span:
span.set_attribute("user.id", user_id)
# Database query (auto-instrumented)
user = db.query(User).filter(User.id == user_id).first()
# External API call (auto-instrumented)
orders = requests.get(f"http://order-service/users/{user_id}/orders")
return jsonify({
"user": user,
"orders": orders.json()
})
Node.js:
// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://otel-collector:4317'
}),
instrumentations: [getNodeAutoInstrumentations()]
});
sdk.start();
// Your application code - automatically instrumented!
const express = require('express');
const app = express();
app.get('/api/users/:userId', async (req, res) => {
// Auto-traced!
const user = await User.findById(req.params.userId);
const orders = await fetch(`http://order-service/users/${req.params.userId}/orders`);
res.json({
user,
orders: await orders.json()
});
});
Go:
// go get go.opentelemetry.io/otel
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)
func initTracer() {
exporter, _ := otlptracegrpc.New(
context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
)
otel.SetTracerProvider(tp)
}
func main() {
initTracer()
// Wrap HTTP handler for auto-tracing
handler := http.HandlerFunc(getUserHandler)
wrappedHandler := otelhttp.NewHandler(handler, "get-user")
http.Handle("/api/users/", wrappedHandler)
http.ListenAndServe(":8080", nil)
}
OpenTelemetry Collector
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 10s
send_batch_size: 1024
# Add resource attributes
resource:
attributes:
- key: environment
value: production
action: upsert
# Sample traces (keep 10%)
probabilistic_sampler:
sampling_percentage: 10
exporters:
# Export to Prometheus
prometheus:
endpoint: "0.0.0.0:8889"
# Export to Jaeger
jaeger:
endpoint: jaeger:14250
tls:
insecure: true
# Export to Grafana Tempo
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
# Export to Loki (logs)
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource, probabilistic_sampler]
exporters: [jaeger, otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch, resource]
exporters: [loki]
Distributed Tracing Deep Dive
Trace Context Propagation
How traces work across services:
1. Frontend receives request
trace-id: abc123
span-id: 001
2. Frontend calls Backend
Headers:
traceparent: 00-abc123-001-01
3. Backend creates child span
trace-id: abc123 (same!)
span-id: 002 (new)
parent-id: 001
4. Backend calls Database
Headers:
traceparent: 00-abc123-002-01
5. Database creates child span
trace-id: abc123 (same!)
span-id: 003 (new)
parent-id: 002
Result: Full trace across all services!
Custom Spans
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
@app.route('/api/checkout')
def checkout():
# Parent span (auto-created by Flask instrumentation)
with tracer.start_as_current_span("validate_cart") as span:
span.set_attribute("cart.items", len(cart.items))
validate_cart(cart)
with tracer.start_as_current_span("process_payment") as span:
span.set_attribute("payment.amount", cart.total)
span.set_attribute("payment.method", "credit_card")
try:
charge_id = process_payment(cart.total)
span.set_attribute("payment.charge_id", charge_id)
span.set_status(trace.Status(trace.StatusCode.OK))
except PaymentError as e:
span.set_status(
trace.Status(
trace.StatusCode.ERROR,
str(e)
)
)
span.record_exception(e)
raise
with tracer.start_as_current_span("create_order"):
order = create_order(cart, charge_id)
return {"order_id": order.id}
Sampling Strategies
# Head-based sampling (at span creation)
processors:
# Always sample errors
tail_sampling:
policies:
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Sample 10% of successful requests
- name: success
type: probabilistic
probabilistic:
sampling_percentage: 10
# Always sample slow requests (>1s)
- name: slow
type: latency
latency:
threshold_ms: 1000
# Always sample specific endpoints
- name: critical-endpoints
type: string_attribute
string_attribute:
key: http.route
values:
- /api/checkout
- /api/payment
Metrics with OpenTelemetry
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader
# Set up metrics
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))
meter = metrics.get_meter(__name__)
# Create metrics
request_counter = meter.create_counter(
"http_requests_total",
description="Total HTTP requests"
)
request_duration = meter.create_histogram(
"http_request_duration_seconds",
description="HTTP request duration"
)
active_users = meter.create_up_down_counter(
"active_users",
description="Currently active users"
)
# Use metrics
@app.route('/api/users')
def get_users():
start = time.time()
# Increment counter
request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})
# Business logic
users = User.query.all()
# Record duration
duration = time.time() - start
request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})
return jsonify(users)
@app.route('/api/login', methods=['POST'])
def login():
# User logged in
active_users.add(1)
return {"status": "success"}
@app.route('/api/logout', methods=['POST'])
def logout():
# User logged out
active_users.add(-1)
return {"status": "success"}
Observability Stack
The LGTM Stack (Grafana)
# Loki (Logs), Grafana (Visualization), Tempo (Traces), Mimir (Metrics)
version: '3.8'
services:
# Grafana (Visualization)
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
GF_AUTH_ANONYMOUS_ENABLED: "true"
GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
volumes:
- grafana-storage:/var/lib/grafana
# Loki (Logs)
loki:
image: grafana/loki:latest
ports:
- "3100:3100"
command: -config.file=/etc/loki/local-config.yaml
# Tempo (Traces)
tempo:
image: grafana/tempo:latest
ports:
- "3200:3200" # UI
- "4317:4317" # OTLP gRPC
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./tempo.yaml:/etc/tempo.yaml
# Mimir (Metrics) or Prometheus
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# OpenTelemetry Collector
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus exporter
volumes:
- ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
command: ["--config=/etc/otel-collector-config.yaml"]
Grafana Dashboard Example
{
"dashboard": {
"title": "Application Observability",
"panels": [
{
"title": "Request Rate",
"targets": [{
"expr": "rate(http_requests_total[5m])"
}]
},
{
"title": "Error Rate",
"targets": [{
"expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])"
}]
},
{
"title": "P95 Latency",
"targets": [{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
}]
},
{
"title": "Recent Errors",
"type": "logs",
"targets": [{
"expr": "{level=\"error\"}"
}]
},
{
"title": "Trace Map",
"type": "nodeGraph",
"targets": [{
"query": "traces"
}]
}
]
}
}
Best Practices
1. Structured Logging
import structlog
logger = structlog.get_logger()
# Bad
logger.info(f"User {user_id} purchased {item_name} for ${amount}")
# Good
logger.info(
"purchase_completed",
user_id=user_id,
item_name=item_name,
amount=amount,
payment_method=payment_method
)
2. Correlation IDs
import uuid
@app.before_request
def before_request():
# Generate or extract correlation ID
correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
g.correlation_id = correlation_id
# Add to logs
structlog.contextvars.bind_contextvars(correlation_id=correlation_id)
# Add to traces
span = trace.get_current_span()
span.set_attribute("correlation.id", correlation_id)
@app.after_request
def after_request(response):
# Return correlation ID in response
response.headers['X-Correlation-ID'] = g.correlation_id
return response
3. SLI/SLO Monitoring
# Service Level Indicators/Objectives
SLI (Service Level Indicator): What we measure
- Request success rate
- Request latency P95
- Availability
SLO (Service Level Objective): Target
- 99.9% success rate
- P95 latency < 500ms
- 99.95% availability
Alert: When SLO at risk
- Success rate < 99.9% for 5 minutes
- P95 latency > 500ms for 5 minutes
- Error budget consumed > 80%
4. Cost Management
Observability can be expensive:
1. Sample aggressively
- Keep 100% of errors
- Sample 10% of successful requests
- Sample 1% of health checks
2. Use tiered storage
- Hot: Last 7 days (expensive, fast queries)
- Warm: 8-30 days (cheaper, slower queries)
- Cold: 31-90 days (cheapest, slowest)
- Archive: >90 days (S3, rarely accessed)
3. Set retention policies
- Traces: 30 days
- Metrics: 90 days (1m resolution), 1 year (1h resolution)
- Logs: 7 days (debug), 90 days (error)
Conclusion
Observability is essential for cloud-native applications. OpenTelemetry provides a vendor-neutral standard for instrumenting your applications, giving you the flexibility to choose backends while avoiding vendor lock-in.
Key takeaways:
- Implement all three pillars: Metrics, logs, and traces together provide complete observability
- Use OpenTelemetry: Industry standard, vendor-neutral, future-proof
- Start simple: Auto-instrumentation first, custom spans later
- Sample intelligently: Keep errors, sample successful requests
- Correlate everything: Use trace IDs across metrics, logs, and traces
The investment in observability pays for itself the first time you debug a production issue in minutes instead of hours.
Need help implementing observability? InstaDevOps provides expert consulting for observability, monitoring, and OpenTelemetry implementation. Contact us for a free consultation.
Need Help with Your DevOps Infrastructure?
At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.
Our Services:
- ๐๏ธ AWS Consulting - Cloud architecture, cost optimization, and migration
- โธ๏ธ Kubernetes Management - Production-ready clusters and orchestration
- ๐ CI/CD Pipelines - Automated deployment pipelines that just work
- ๐ Monitoring & Observability - See what's happening in your infrastructure
Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.
๐ Book a Free 15-Min Consultation
Originally published at instadevops.com
Top comments (0)