Mohammad Imran for AWS Community Builders

Posted on Apr 6

Production Observability for Kubernetes on AWS using OpenTelemetry Operator

#aws #grafana #opentelemetry #observability

Modern Kubernetes environments are highly dynamic, distributed, and complex. While this enables scalability and flexibility, it also introduces a critical challenge: observability at scale.

In production systems, simply collecting logs or metrics is not enough. You need a unified observability strategy that provides:

Metrics (system health)
Logs (events & debugging)
Traces (request flow across services)

In this blog, we’ll explore how to build a production-grade observability stack on AWS using Kubernetes and the OpenTelemetry Operator, covering architecture, implementation, and best practices.

Why Observability is Critical in Kubernetes

Kubernetes introduces several layers of abstraction:

Pods are ephemeral
Services scale dynamically
Network paths are non-linear
Failures are distributed

Without proper observability, it becomes difficult to:

Identify bottlenecks
Debug latency issues
Trace failures across services
Monitor system health

Observability Architecture Overview

End-to-end observability architecture in Kubernetes using OpenTelemetry Operator, Collector, and Grafana stack.

Architecture Flow

Applications are instrumented using OpenTelemetry
OpenTelemetry Operator injects auto-instrumentation
Telemetry is collected by OpenTelemetry Collector
Data is exported to:
- Prometheus (metrics)
- Loki (logs)
- Tempo (traces)
Grafana visualizes all signals

Key Components of the Stack

OpenTelemetry Operator

Auto-injects agents into pods
Manages collectors as CRDs
Standardizes telemetry pipelines

OpenTelemetry Collector

Receives telemetry
Processes data
Exports to backends

Prometheus (Metrics)

CPU / Memory
Request rate
Error rate
Latency

Grafana Tempo (Traces)

Distributed tracing
Service dependencies
Latency analysis

Loki (Logs)

Log aggregation
Correlation with traces

Grafana

Dashboards
Logs
Traces

Deploying on AWS (EKS-Based Architecture)

Amazon EKS → workloads
OpenTelemetry Operator → instrumentation
OpenTelemetry Collector → pipeline
S3 → storage
Grafana → visualization

Auto-Instrumentation Example

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: java-instrumentation
spec:
  java:
    image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-java

OpenTelemetry Collector Config

apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
  name: otel-collector
spec:
  config: |
    receivers:
      otlp:
        protocols:
          grpc:
          http:

    processors:
      batch:

    exporters:
      prometheus:
        endpoint: "0.0.0.0:8889"
      tempo:
        endpoint: tempo:4317

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [batch]
          exporters: [tempo]

Correlation Workflow

Alert triggered
Check metrics
Inspect traces
Check logs
Identify root cause

Production Best Practices

Use sampling
Scale collectors
Separate pipelines
Monitor collectors
Secure telemetry

Common Pitfalls

No collector
Over-collection
No sampling
No correlation

Real-World Example

User → Frontend → Product → Cart → Checkout → Payment

Observability helps trace issues across services.

Production Debugging Scenario

Let’s look at a real-world scenario to understand how observability helps in production.

Scenario

Users report that the checkout service is slow in a production e-commerce application.

Step 1: Detect the Issue (Metrics)

Grafana dashboard shows:

Increased latency in checkout service
Spike in response time (P95/P99)

This indicates a performance issue but doesn’t reveal the root cause.

Step 2: Trace the Request (Traces)

Using Grafana Tempo:

Identify slow traces
Analyze request flow

Example trace:

Frontend → Cart Service → Checkout Service → Payment Service

Observation:

Checkout service is taking unusually long

Step 3: Drill Down into Spans

Within the trace:

A specific span shows high latency
Database query inside checkout service is slow

Step 4: Inspect Logs

Using Loki:

Filter logs for checkout service
Identify errors or warnings

Finding:

Database timeout errors
Slow query logs

Step 5: Root Cause Identified

Inefficient database query
Missing index

Step 6: Resolution

Optimize query
Add database index
Reduce response latency

Outcome

Latency reduced
System stabilized
Faster incident resolution

Key Insight

This workflow demonstrates the power of correlating metrics, traces, and logs:

Metrics → detect
Traces → locate
Logs → explain

This significantly reduces MTTR (Mean Time to Resolution) in production systems.

Final Thoughts

Combining:

OpenTelemetry Operator
OpenTelemetry Collector
Prometheus, Loki, Tempo
Grafana

enables a scalable, production-grade observability platform on AWS.

Observability is not optional, it is foundational.

DEV Community