DEV Community

InstaDevOps
InstaDevOps

Posted on • Originally published at instadevops.com

Cloud-Native Observability: OpenTelemetry and Beyond

Introduction

Your application just slowed down. Users are complaining. The CEO is asking what's wrong. You have hundreds of microservices, thousands of containers, and millions of log lines. Where do you even start?

This is the observability problem. Traditional monitoringโ€”checking if servers are upโ€”isn't enough in cloud-native environments. You need to understand why your system is behaving a certain way, not just that something is wrong.

Observability is about instrumenting your systems to answer any question about their behavior. In this comprehensive guide, we'll explore modern observability practices, focusing on OpenTelemetry as the industry standard for instrumentation.

The Three Pillars of Observability

1. Metrics

Numeric measurements over time:

CPU usage: 45%
Request rate: 1,250 req/sec
Error rate: 0.5%
P95 latency: 450ms
Active users: 12,450
Enter fullscreen mode Exit fullscreen mode

Good for: Dashboards, alerts, trends
Bad for: Understanding why something happened

2. Logs

Discrete events:

{
  "timestamp": "2024-01-15T10:23:45Z",
  "level": "ERROR",
  "service": "payment-api",
  "message": "Payment processing failed",
  "error": "Connection timeout to payment gateway",
  "user_id": 12345,
  "amount": 99.99
}
Enter fullscreen mode Exit fullscreen mode

Good for: Debugging, understanding what happened
Bad for: High-cardinality queries, correlation across services

3. Traces

Request journey through distributed system:

User Request โ†’ Frontend (50ms)
  โ”œโ”€> API Gateway (5ms)
  โ”‚   โ”œโ”€> Auth Service (20ms)
  โ”‚   โ”œโ”€> Product Service (100ms)
  โ”‚   โ”‚   โ””โ”€> Database Query (80ms) โ† SLOW!
  โ”‚   โ””โ”€> Inventory Service (30ms)
  โ””โ”€> Response (Total: 205ms)
Enter fullscreen mode Exit fullscreen mode

Good for: Understanding request flow, finding bottlenecks
Bad for: Aggregation, high-level trends

Why Traditional Monitoring Fails

The Kubernetes Problem

Traditional Monitoring (Pre-Kubernetes):
- Fixed servers with static IPs
- Server metrics tell you what's wrong
- SSH to server, check logs
- Simple to debug

Kubernetes:
- Pods come and go every few minutes
- IP addresses change constantly
- Logs disappear when pod dies
- Which pod handled the failing request?
- Impossible to debug with traditional tools
Enter fullscreen mode Exit fullscreen mode

The Microservices Problem

Monolith:
User Request โ†’ Application โ†’ Database
              (Easy to trace)

Microservices:
User Request โ†’ API Gateway
  โ”œโ”€> Service A โ†’ Service B โ†’ Service C
  โ”œโ”€> Service D โ†’ Service E
  โ””โ”€> Service F โ†’ Service G โ†’ Service H โ†’ Service I

Question: "Why is this request slow?"
Traditional monitoring: Can't tell you
Observability: Shows exact bottleneck
Enter fullscreen mode Exit fullscreen mode

OpenTelemetry: The Standard

What is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral, open-source standard for instrumenting applications to generate telemetry data (metrics, logs, traces).

Before OpenTelemetry:
- Proprietary agents for each vendor
- Vendor lock-in
- Different instrumentation for each tool

With OpenTelemetry:
- Single SDK for all telemetry
- Send to any backend
- Standardized across languages
- No vendor lock-in
Enter fullscreen mode Exit fullscreen mode

Architecture

Application โ†’ OpenTelemetry SDK โ†’ OpenTelemetry Collector โ†’ Backend
                                        โ†“
                                  (Process, filter, route)
                                        โ†“
                            โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                            โ”‚                       โ”‚
                       Prometheus             Jaeger/Tempo
                       (Metrics)              (Traces)
                            โ”‚                       โ”‚
                       Grafana โ†โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                     (Visualization)
Enter fullscreen mode Exit fullscreen mode

Installing OpenTelemetry

Python:

# pip install opentelemetry-api opentelemetry-sdk opentelemetry-instrumentation

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# Set up tracer
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

# Export to collector
otlp_exporter = OTLPSpanExporter(
    endpoint="http://otel-collector:4317",
    insecure=True
)
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Auto-instrument Flask and requests
FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()

app = Flask(__name__)

@app.route('/api/users/<user_id>')
def get_user(user_id):
    # Automatically traced!
    with tracer.start_as_current_span("get_user") as span:
        span.set_attribute("user.id", user_id)

        # Database query (auto-instrumented)
        user = db.query(User).filter(User.id == user_id).first()

        # External API call (auto-instrumented)
        orders = requests.get(f"http://order-service/users/{user_id}/orders")

        return jsonify({
            "user": user,
            "orders": orders.json()
        })
Enter fullscreen mode Exit fullscreen mode

Node.js:

// npm install @opentelemetry/api @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector:4317'
  }),
  instrumentations: [getNodeAutoInstrumentations()]
});

sdk.start();

// Your application code - automatically instrumented!
const express = require('express');
const app = express();

app.get('/api/users/:userId', async (req, res) => {
  // Auto-traced!
  const user = await User.findById(req.params.userId);
  const orders = await fetch(`http://order-service/users/${req.params.userId}/orders`);

  res.json({
    user,
    orders: await orders.json()
  });
});
Enter fullscreen mode Exit fullscreen mode

Go:

// go get go.opentelemetry.io/otel

import (
    "go.opentelemetry.io/otel"
    "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
    "go.opentelemetry.io/otel/sdk/trace"
    "go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
)

func initTracer() {
    exporter, _ := otlptracegrpc.New(
        context.Background(),
        otlptracegrpc.WithEndpoint("otel-collector:4317"),
        otlptracegrpc.WithInsecure(),
    )

    tp := trace.NewTracerProvider(
        trace.WithBatcher(exporter),
    )

    otel.SetTracerProvider(tp)
}

func main() {
    initTracer()

    // Wrap HTTP handler for auto-tracing
    handler := http.HandlerFunc(getUserHandler)
    wrappedHandler := otelhttp.NewHandler(handler, "get-user")

    http.Handle("/api/users/", wrappedHandler)
    http.ListenAndServe(":8080", nil)
}
Enter fullscreen mode Exit fullscreen mode

OpenTelemetry Collector

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024

  # Add resource attributes
  resource:
    attributes:
    - key: environment
      value: production
      action: upsert

  # Sample traces (keep 10%)
  probabilistic_sampler:
    sampling_percentage: 10

exporters:
  # Export to Prometheus
  prometheus:
    endpoint: "0.0.0.0:8889"

  # Export to Jaeger
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true

  # Export to Grafana Tempo
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

  # Export to Loki (logs)
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, probabilistic_sampler]
      exporters: [jaeger, otlp/tempo]

    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]

    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [loki]
Enter fullscreen mode Exit fullscreen mode

Distributed Tracing Deep Dive

Trace Context Propagation

How traces work across services:

1. Frontend receives request
   trace-id: abc123
   span-id: 001

2. Frontend calls Backend
   Headers: 
     traceparent: 00-abc123-001-01

3. Backend creates child span
   trace-id: abc123 (same!)
   span-id: 002 (new)
   parent-id: 001

4. Backend calls Database
   Headers:
     traceparent: 00-abc123-002-01

5. Database creates child span
   trace-id: abc123 (same!)
   span-id: 003 (new)
   parent-id: 002

Result: Full trace across all services!
Enter fullscreen mode Exit fullscreen mode

Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

@app.route('/api/checkout')
def checkout():
    # Parent span (auto-created by Flask instrumentation)

    with tracer.start_as_current_span("validate_cart") as span:
        span.set_attribute("cart.items", len(cart.items))
        validate_cart(cart)

    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", cart.total)
        span.set_attribute("payment.method", "credit_card")

        try:
            charge_id = process_payment(cart.total)
            span.set_attribute("payment.charge_id", charge_id)
            span.set_status(trace.Status(trace.StatusCode.OK))
        except PaymentError as e:
            span.set_status(
                trace.Status(
                    trace.StatusCode.ERROR,
                    str(e)
                )
            )
            span.record_exception(e)
            raise

    with tracer.start_as_current_span("create_order"):
        order = create_order(cart, charge_id)

    return {"order_id": order.id}
Enter fullscreen mode Exit fullscreen mode

Sampling Strategies

# Head-based sampling (at span creation)

processors:
  # Always sample errors
  tail_sampling:
    policies:
    - name: errors
      type: status_code
      status_code:
        status_codes: [ERROR]

    # Sample 10% of successful requests
    - name: success
      type: probabilistic
      probabilistic:
        sampling_percentage: 10

    # Always sample slow requests (>1s)
    - name: slow
      type: latency
      latency:
        threshold_ms: 1000

    # Always sample specific endpoints
    - name: critical-endpoints
      type: string_attribute
      string_attribute:
        key: http.route
        values:
        - /api/checkout
        - /api/payment
Enter fullscreen mode Exit fullscreen mode

Metrics with OpenTelemetry

from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.prometheus import PrometheusMetricReader

# Set up metrics
reader = PrometheusMetricReader()
metrics.set_meter_provider(MeterProvider(metric_readers=[reader]))

meter = metrics.get_meter(__name__)

# Create metrics
request_counter = meter.create_counter(
    "http_requests_total",
    description="Total HTTP requests"
)

request_duration = meter.create_histogram(
    "http_request_duration_seconds",
    description="HTTP request duration"
)

active_users = meter.create_up_down_counter(
    "active_users",
    description="Currently active users"
)

# Use metrics
@app.route('/api/users')
def get_users():
    start = time.time()

    # Increment counter
    request_counter.add(1, {"method": "GET", "endpoint": "/api/users"})

    # Business logic
    users = User.query.all()

    # Record duration
    duration = time.time() - start
    request_duration.record(duration, {"method": "GET", "endpoint": "/api/users"})

    return jsonify(users)

@app.route('/api/login', methods=['POST'])
def login():
    # User logged in
    active_users.add(1)
    return {"status": "success"}

@app.route('/api/logout', methods=['POST'])
def logout():
    # User logged out
    active_users.add(-1)
    return {"status": "success"}
Enter fullscreen mode Exit fullscreen mode

Observability Stack

The LGTM Stack (Grafana)

# Loki (Logs), Grafana (Visualization), Tempo (Traces), Mimir (Metrics)

version: '3.8'

services:
  # Grafana (Visualization)
  grafana:
    image: grafana/grafana:latest
    ports:
    - "3000:3000"
    environment:
      GF_AUTH_ANONYMOUS_ENABLED: "true"
      GF_AUTH_ANONYMOUS_ORG_ROLE: "Admin"
    volumes:
    - grafana-storage:/var/lib/grafana

  # Loki (Logs)
  loki:
    image: grafana/loki:latest
    ports:
    - "3100:3100"
    command: -config.file=/etc/loki/local-config.yaml

  # Tempo (Traces)
  tempo:
    image: grafana/tempo:latest
    ports:
    - "3200:3200"  # UI
    - "4317:4317"  # OTLP gRPC
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
    - ./tempo.yaml:/etc/tempo.yaml

  # Mimir (Metrics) or Prometheus
  prometheus:
    image: prom/prometheus:latest
    ports:
    - "9090:9090"
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml

  # OpenTelemetry Collector
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    ports:
    - "4317:4317"  # OTLP gRPC
    - "4318:4318"  # OTLP HTTP
    - "8889:8889"  # Prometheus exporter
    volumes:
    - ./otel-collector-config.yaml:/etc/otel-collector-config.yaml
    command: ["--config=/etc/otel-collector-config.yaml"]
Enter fullscreen mode Exit fullscreen mode

Grafana Dashboard Example

{
  "dashboard": {
    "title": "Application Observability",
    "panels": [
      {
        "title": "Request Rate",
        "targets": [{
          "expr": "rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "Error Rate",
        "targets": [{
          "expr": "rate(http_requests_total{status=~\"5..\"}[5m]) / rate(http_requests_total[5m])"
        }]
      },
      {
        "title": "P95 Latency",
        "targets": [{
          "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))"
        }]
      },
      {
        "title": "Recent Errors",
        "type": "logs",
        "targets": [{
          "expr": "{level=\"error\"}"
        }]
      },
      {
        "title": "Trace Map",
        "type": "nodeGraph",
        "targets": [{
          "query": "traces"
        }]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Best Practices

1. Structured Logging

import structlog

logger = structlog.get_logger()

# Bad
logger.info(f"User {user_id} purchased {item_name} for ${amount}")

# Good
logger.info(
    "purchase_completed",
    user_id=user_id,
    item_name=item_name,
    amount=amount,
    payment_method=payment_method
)
Enter fullscreen mode Exit fullscreen mode

2. Correlation IDs

import uuid

@app.before_request
def before_request():
    # Generate or extract correlation ID
    correlation_id = request.headers.get('X-Correlation-ID', str(uuid.uuid4()))
    g.correlation_id = correlation_id

    # Add to logs
    structlog.contextvars.bind_contextvars(correlation_id=correlation_id)

    # Add to traces
    span = trace.get_current_span()
    span.set_attribute("correlation.id", correlation_id)

@app.after_request
def after_request(response):
    # Return correlation ID in response
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response
Enter fullscreen mode Exit fullscreen mode

3. SLI/SLO Monitoring

# Service Level Indicators/Objectives

SLI (Service Level Indicator): What we measure
- Request success rate
- Request latency P95
- Availability

SLO (Service Level Objective): Target
- 99.9% success rate
- P95 latency < 500ms
- 99.95% availability

Alert: When SLO at risk
- Success rate < 99.9% for 5 minutes
- P95 latency > 500ms for 5 minutes
- Error budget consumed > 80%
Enter fullscreen mode Exit fullscreen mode

4. Cost Management

Observability can be expensive:

1. Sample aggressively
   - Keep 100% of errors
   - Sample 10% of successful requests
   - Sample 1% of health checks

2. Use tiered storage
   - Hot: Last 7 days (expensive, fast queries)
   - Warm: 8-30 days (cheaper, slower queries)
   - Cold: 31-90 days (cheapest, slowest)
   - Archive: >90 days (S3, rarely accessed)

3. Set retention policies
   - Traces: 30 days
   - Metrics: 90 days (1m resolution), 1 year (1h resolution)
   - Logs: 7 days (debug), 90 days (error)
Enter fullscreen mode Exit fullscreen mode

Conclusion

Observability is essential for cloud-native applications. OpenTelemetry provides a vendor-neutral standard for instrumenting your applications, giving you the flexibility to choose backends while avoiding vendor lock-in.

Key takeaways:

  1. Implement all three pillars: Metrics, logs, and traces together provide complete observability
  2. Use OpenTelemetry: Industry standard, vendor-neutral, future-proof
  3. Start simple: Auto-instrumentation first, custom spans later
  4. Sample intelligently: Keep errors, sample successful requests
  5. Correlate everything: Use trace IDs across metrics, logs, and traces

The investment in observability pays for itself the first time you debug a production issue in minutes instead of hours.

Need help implementing observability? InstaDevOps provides expert consulting for observability, monitoring, and OpenTelemetry implementation. Contact us for a free consultation.


Need Help with Your DevOps Infrastructure?

At InstaDevOps, we specialize in helping startups and scale-ups build production-ready infrastructure without the overhead of a full-time DevOps team.

Our Services:

  • ๐Ÿ—๏ธ AWS Consulting - Cloud architecture, cost optimization, and migration
  • โ˜ธ๏ธ Kubernetes Management - Production-ready clusters and orchestration
  • ๐Ÿš€ CI/CD Pipelines - Automated deployment pipelines that just work
  • ๐Ÿ“Š Monitoring & Observability - See what's happening in your infrastructure

Special Offer: Get a free DevOps audit - 50+ point checklist covering security, performance, and cost optimization.

๐Ÿ“… Book a Free 15-Min Consultation

Originally published at instadevops.com

Top comments (0)