InstaDevOps

Posted on Nov 27 • Originally published at instadevops.com

Cloud-Native Observability: OpenTelemetry and Beyond

#devops #observability #monitoring #opentelemetry

Introduction

Your application just slowed down. Users are complaining. The CEO is asking what's wrong. You have hundreds of microservices, thousands of containers, and millions of log lines. Where do you even start?

This is the observability problem. Traditional monitoring—checking if servers are up—isn't enough in cloud-native environments. You need to understand why your system is behaving a certain way, not just that something is wrong.

Observability is about instrumenting your systems to answer any question about their behavior. In this comprehensive guide, we'll explore modern observability practices, focusing on OpenTelemetry as the industry standard for instrumentation.

The Three Pillars of Observability

1. Metrics

Numeric measurements over time:

CPU usage: 45%
Request rate: 1,250 req/sec
Error rate: 0.5%
P95 latency: 450ms
Active users: 12,450

Good for: Dashboards, alerts, trends
Bad for: Understanding why something happened

2. Logs

Discrete events:

{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "message": "Payment processing failed",
  "error": "Connection timeout to payment gateway",
  "user_id": 12345,
  "amount": 99.99
}

Good for: Debugging, understanding what happened
Bad for: High-cardinality queries, correlation across services

3. Traces

Request journey through distributed system:

User Request → Frontend (50ms)
  ├─> API Gateway (5ms)
  │   ├─> Auth Service (20ms)
  │   ├─> Product Service (100ms)
  │   │   └─> Database Query (80ms) ← SLOW!
  │   └─> Inventory Service (30ms)
  └─> Response (Total: 205ms)

Good for: Understanding request flow, finding bottlenecks
Bad for: Aggregation, high-level trends

Why Traditional Monitoring Fails

The Kubernetes Problem

Traditional Monitoring (Pre-Kubernetes):
- Fixed servers with static IPs
- Server metrics tell you what's wrong
- SSH to server, check logs
- Simple to debug

Kubernetes:
- Pods come and go every few minutes
- IP addresses change constantly
- Logs disappear when pod dies
- Which pod handled the failing request?
- Impossible to debug with traditional tools

The Microservices Problem

Monolith:
User Request → Application → Database
              (Easy to trace)

Microservices:
User Request → API Gateway
  ├─> Service A → Service B → Service C
  ├─> Service D → Service E
  └─> Service F → Service G → Service H → Service I

Question: "Why is this request slow?"
Traditional monitoring: Can't tell you
Observability: Shows exact bottleneck

OpenTelemetry: The Standard

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral, open-source standard for instrumenting applications to generate telemetry data (metrics, logs, traces).

Before OpenTelemetry:
- Proprietary agents for each vendor
- Vendor lock-in
- Different instrumentation for each tool

With OpenTelemetry:
- Single SDK for all telemetry
- Send to any backend
- Standardized across languages
- No vendor lock-in

Architecture

Application → OpenTelemetry SDK → OpenTelemetry Collector → Backend
                                        ↓
                                  (Process, filter, route)
                                        ↓
                            ┌───────────┴───────────┐
                            │                       │
                       Prometheus             Jaeger/Tempo
                       (Metrics)              (Traces)
                            │                       │
                       Grafana ←──────────────────┘
                     (Visualization)

Installing OpenTelemetry

Python:

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Set up tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to collector
otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4317",
    insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)

@app.route('/api/users/<user_id>')
def get_user(user_id):
    # Automatically traced!
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user.id", user_id)

        # Database query (auto-instrumented)
        user = db.query(User).filter(User.id == user_id).first()

        # External API call (auto-instrumented)
        orders = requests.get(f"http://order-service/users/{user_id}/orders")

        return jsonify({
            "user": user,
            "orders": orders.json()
        })

Node.js:

// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Your application code - automatically instrumented!
const express = require('express');
const app = express();

app.get('/api/users/:userId', async (req, res) => {
  // Auto-traced!
  const user = await User.findById(req.params.userId);
  const orders = await fetch(`http://order-service/users/${req.params.userId}/orders`);

  res.json({
    user,
    orders: await orders.json()
  });
});

Go:

// go get go.opentelemetry.io/otel

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer() {
    exporter, _ := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )

    otel.SetTracerProvider(tp)
}

func main() {
    initTracer()

    // Wrap HTTP handler for auto-tracing
    handler := http.HandlerFunc(getUserHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "get-user")

    http.Handle("/api/users/", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}

OpenTelemetry Collector

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
    - key: environment
      value: production
      action: upsert

  # Sample traces (keep 10%)
  probabilistic_sampler:
    sampling_percentage: 10

exporters:
  # Export to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"

  # Export to Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  # Export to Grafana Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Export to Loki (logs)
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, probabilistic_sampler]
      exporters: [jaeger, otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]

Distributed Tracing Deep Dive

Trace Context Propagation

How traces work across services:

1. Frontend receives request
   trace-id: abc123
   span-id: 001

2. Frontend calls Backend
   Headers: 
     traceparent: 00-abc123-001-01

3. Backend creates child span
   trace-id: abc123 (same!)
   span-id: 002 (new)
   parent-id: 001

4. Backend calls Database
   Headers:
     traceparent: 00-abc123-002-01

5. Database creates child span
   trace-id: abc123 (same!)
   span-id: 003 (new)
   parent-id: 002

Result: Full trace across all services!

Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@app.route('/api/checkout')
def checkout():
    # Parent span (auto-created by Flask instrumentation)

    with tracer.start_as_current_span("validate_cart") as span:
        span.set_attribute("cart.items", len(cart.items))
        validate_cart(cart)

    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", cart.total)
        span.set_attribute("payment.method", "credit_card")

        try:
            charge_id = process_payment(cart.total)
            span.set_attribute("payment.charge_id", charge_id)
            span.set_status(trace.Status(trace.StatusCode.OK))
        except PaymentError as e:
            span.set_status(
                trace.Status(
                    trace.StatusCode.ERROR,
                    str(e)
                )
            )
            span.record_exception(e)
            raise

    with tracer.start_as_current_span("create_order"):
        order = create_order(cart, charge_id)

    return {"order_id": order.id}

Sampling Strategies

# Head-based sampling (at span creation)

processors:
  # Always sample errors
  tail_sampling:
    policies:
    - name: errors
      type: status_code
      status_code:
        status_codes: [ERROR]

    # Sample 10% of successful requests
    - name: success
      type: probabilistic
      probabilistic:
        sampling_percentage: 10

    # Always sample slow requests (>1s)
    - name: slow
      type: latency
      latency:
        threshold_ms: 1000

    # Always sample specific endpoints
    - name: critical-endpoints
      type: string_attribute
      string_attribute:
        key: http.route
        values:
        - /api/checkout
        - /api/payment

Metrics with OpenTelemetry

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Set up metrics
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

meter = metrics.get_meter(__name__)

# Create metrics
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

active_users = meter.create_up_down_counter(
    "active_users",
    description="Currently active users"
)

# Use metrics
@app.route('/api/users')
def get_users():
    start = time.time()

    # Increment counter
    request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})

    # Business logic
    users = User.query.all()

    # Record duration
    duration = time.time() - start
    request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})

    return jsonify(users)

@app.route('/api/login', methods=['POST'])
def login():
    # User logged in
    active_users.add(1)
    return {"status": "success"}

@app.route('/api/logout', methods=['POST'])
def logout():
    # User logged out
    active_users.add(-1)
    return {"status": "success"}

Observability Stack

The LGTM Stack (Grafana)

# Loki (Logs), Grafana (Visualization), Tempo (Traces), Mimir (Metrics)

version: '3.8'

services:
  # Grafana (Visualization)
  grafana:
    image: grafana/grafana:latest
    ports:
    - "3000:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
    - grafana-storage:/var/lib/grafana

  # Loki (Logs)
  loki:
    image: grafana/loki:latest
    ports:
    - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

  # Tempo (Traces)
  tempo:
    image: grafana/tempo:latest
    ports:
    - "3200:3200"  # UI
    - "4317:4317"  # OTLP gRPC
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
    - ./tempo.yaml:/etc/tempo.yaml

  # Mimir (Metrics) or Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
    - "9090:9090"
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
    - "4317:4317"  # OTLP gRPC
    - "4318:4318"  # OTLP HTTP
    - "8889:8889"  # Prometheus exporter
    volumes:
    - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml"]

Grafana Dashboard Example

{
  "dashboard": {
    "title": "Application Observability",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Recent Errors",
        "type": "logs",
        "targets": [{
          "expr": "{level=\"error\"}"
        }]
      },
      {
        "title": "Trace Map",
        "type": "nodeGraph",
        "targets": [{
          "query": "traces"
        }]
      }
    ]
  }
}

Best Practices

1. Structured Logging

import structlog

logger = structlog.get_logger()

# Bad
logger.info(f"User {user_id} purchased {item_name} for ${amount}")

# Good
logger.info(
    "purchase_completed",
    user_id=user_id,
    item_name=item_name,
    amount=amount,
    payment_method=payment_method
)

2. Correlation IDs

import uuid

@app.before_request
def before_request():
    # Generate or extract correlation ID
    correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
    g.correlation_id = correlation_id

    # Add to logs
    structlog.contextvars.bind_contextvars(correlation_id=correlation_id)

    # Add to traces
    span = trace.get_current_span()
    span.set_attribute("correlation.id", correlation_id)

@app.after_request
def after_request(response):
    # Return correlation ID in response
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

3. SLI/SLO Monitoring

# Service Level Indicators/Objectives

SLI (Service Level Indicator): What we measure
- Request success rate
- Request latency P95
- Availability

SLO (Service Level Objective): Target
- 99.9% success rate
- P95 latency < 500ms
- 99.95% availability

Alert: When SLO at risk
- Success rate < 99.9% for 5 minutes
- P95 latency > 500ms for 5 minutes
- Error budget consumed > 80%

4. Cost Management

Observability can be expensive:

1. Sample aggressively
   - Keep 100% of errors
   - Sample 10% of successful requests
   - Sample 1% of health checks

2. Use tiered storage
   - Hot: Last 7 days (expensive, fast queries)
   - Warm: 8-30 days (cheaper, slower queries)
   - Cold: 31-90 days (cheapest, slowest)
   - Archive: >90 days (S3, rarely accessed)

3. Set retention policies
   - Traces: 30 days
   - Metrics: 90 days (1m resolution), 1 year (1h resolution)
   - Logs: 7 days (debug), 90 days (error)

Conclusion

Observability is essential for cloud-native applications. OpenTelemetry provides a vendor-neutral standard for instrumenting your applications, giving you the flexibility to choose backends while avoiding vendor lock-in.

Key takeaways:

Implement all three pillars: Metrics, logs, and traces together provide complete observability
Use OpenTelemetry: Industry standard, vendor-neutral, future-proof
Start simple: Auto-instrumentation first, custom spans later
Sample intelligently: Keep errors, sample successful requests
Correlate everything: Use trace IDs across metrics, logs, and traces

The investment in observability pays for itself the first time you debug a production issue in minutes instead of hours.

Need help implementing observability? InstaDevOps provides expert consulting for observability, monitoring, and OpenTelemetry implementation. Contact us for a free consultation.

Need Help with Your DevOps Infrastructure?

At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.

Our Services:

🏗️ AWS Consulting - Cloud architecture, cost optimization, and migration
☸️ Kubernetes Management - Production-ready clusters and orchestration
🚀 CI/CD Pipelines - Automated deployment pipelines that just work
📊 Monitoring & Observability - See what's happening in your infrastructure

Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.

📅 Book a Free 15-Min Consultation

Originally published at instadevops.com

DEV Community