DEV Community

Observability for Resilience on Amazon EKS with OpenTelemetry + Datadog (Free Tier)

Building Dashboards That Truly Matter

Resilience in cloud-native applications is not just about restarting pods or running across multiple Availability Zones.

Without deep observability, you don’t know:

  • where latency increases
  • which service degrades first
  • whether autoscaling actually works
  • how long the system takes to recover

In other words: without observability, you test resilience in the dark.

In this article, you will learn how to build a complete observability platform for resilience on Amazon EKS, using only open-source tools and the Datadog free tier.

🧭 What You Will Build

By the end of this article, you will have:

✅ An EKS cluster ready for testing
✅ OpenTelemetry Collector deployed via Helm
✅ Metrics, logs, and traces exported
✅ Datadog configured (free tier)
✅ Dashboards focused on real resilience
✅ A foundation ready for Chaos Engineering

🧠 High-Level Architecture

The architecture follows the modern cloud-native observability pattern:

  • Instrumented (or auto-instrumented) applications
  • OpenTelemetry Collector as the central layer
  • Datadog as the visualization and APM backend
  • CloudWatch as a native AWS complement

👉 Metrics, logs, and traces flow in a unified way

1️⃣ Why Observability Is Fundamental to Resilience

Resilience is not just about staying up.
It is about understanding system behavior under failure.

With proper observability, you can answer questions such as:

Does latency increase during failures?
Chaos tests almost always impact response time. Without metrics, this goes unnoticed.

Does the system fail gracefully?
5xx and 4xx errors show whether the application degrades correctly or completely breaks.

Is the bottleneck code or infrastructure?
CPU, memory, I/O, and network saturation tell the truth.

Where is the bottleneck between microservices?
Distributed traces show exactly where time is spent.

Is Kubernetes reacting properly?
Events, restarts, and scheduling behavior reveal a lot about resilience.

You cannot improve what you cannot observe.

2️⃣ Creating the EKS Cluster with eksctl

For labs, testing, and technical articles, eksctl is fast and efficient:

eksctl create cluster \
  --name observability-eks \
  --region us-east-1 \
  --version 1.30 \
  --nodegroup-name ng-default \
  --node-type t3.medium \
  --nodes 2 \
  --nodes-min 2 \
  --nodes-max 4 \
  --managed
Enter fullscreen mode Exit fullscreen mode

This creates:

  • A functional EKS cluster
  • A managed node group
  • IAM automatically configured
  • kubeconfig ready to use

3️⃣ Minimal Application Instrumentation

Even without deep instrumentation, it is already possible to extract significant value.

📌 Automatic Kubernetes Metrics

  • Collected via kubelet and cAdvisor:
  • CPU and memory per pod
  • Restarts
  • Network usage
  • Scheduling latency

📌 Automatic Tracing (Auto-Instrumentation)

OpenTelemetry supports:

  • Java
  • Node.js
  • Python
  • Go (partial)

Without changing the code, you already get distributed traces.

📌 Structured Logs

Recommended format:

{
  "timestamp": "2025-01-01T12:34:56Z",
  "message": "Order created",
  "trace_id": "abc123",
  "span_id": "def456",
  "service": "checkout"
}
Enter fullscreen mode Exit fullscreen mode

This enables direct correlation between logs and traces.

4️⃣ Deploying the OpenTelemetry Collector with Helm

The OpenTelemetry Collector acts as the central observability layer.

It receives data via OTLP, processes it, and exports it to Datadog.

Installation via Helm

helm install otel-collector ./otel-datadog \
  --namespace observability \
  --create-namespace \
  --set datadog.apiKey=<YOUR_API_KEY>
Enter fullscreen mode Exit fullscreen mode

The Collector starts collecting:

  • Metrics (Prometheus / Kubernetes)
  • Logs
  • Traces
  • Cluster events

5️⃣ Datadog Free Tier

The Datadog Free Tier is surprisingly powerful:

✔ Up to 5 free hosts
✔ APM included
✔ Unlimited dashboards
✔ Automatic Service Map
✔ Basic alerts

This is more than enough for resilience and chaos testing.

6️⃣ Dashboards That Truly Matter for Resilience

This is the key point of the article: what to monitor.

📊 6.1 Service Latency (APM)

Metric:

trace.<service>.request.duration
Enter fullscreen mode Exit fullscreen mode

Helps identify:

  • failure impact
  • progressive degradation
  • bottlenecks between services

🚨6.2 5xx and 4xx Errors

Metrics:

http.server.request.error.count
trace.<service>.errors
Enter fullscreen mode Exit fullscreen mode

A direct indicator of user-perceived failure.

🔥 6.3 CPU and Memory Saturation per Pod

kubernetes.pod.cpu.usage.total
kubernetes.pod.memory.usage
Enter fullscreen mode Exit fullscreen mode

Essential for validating:

  • HPA
  • Karpenter
  • poorly sized requests and limits

⚙️ 6.4 Event Loop (Node.js)

Custom metric:

event_loop_delay
Enter fullscreen mode Exit fullscreen mode

Shows when the application is alive but unusable.

🗺️ 6.5 Service Map

Automatically visualizes:

  • broken dependencies
  • increased latency
  • critical services

One of Datadog’s most powerful features.

🔄 6.6 Kubernetes Events and Restarts

kubernetes.pod.restart.count
kubernetes.event.count
Enter fullscreen mode Exit fullscreen mode

Detects:

  • CrashLoopBackOff
  • OOMKilled
  • readiness failures
  • scheduling issues

📊 Datadog Dashboard — EKS Resilience Observability
How to Use

  1. Datadog → Dashboards
  2. New Dashboard
  3. Import JSON
  4. Paste the content below

Covered Items (Section 6 Checklist)

✔ 6.1 Service latency (APM)
✔ 6.2 5xx and 4xx errors
✔ 6.3 CPU per pod
✔ 6.3 Memory per pod
✔ 6.4 Event Loop (Node.js)
✔ 6.5 Service Map (operational reference)
✔ 6.6 Pod restarts
✔ 6.6 Kubernetes events

🧩 Dashboard JSON

{
  "title": "EKS Resilience Observability",
  "description": "Dashboards focados em resiliência no EKS usando OpenTelemetry + Datadog",
  "layout_type": "ordered",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Latência por Serviço",
        "requests": [
          {
            "q": "avg:trace.*.request.duration{*} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Taxa de Erros 5xx",
        "requests": [
          {
            "q": "sum:http.server.request.error.count{status:5xx} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Taxa de Erros 4xx",
        "requests": [
          {
            "q": "sum:http.server.request.error.count{status:4xx} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - CPU por Pod",
        "requests": [
          {
            "q": "avg:kubernetes.pod.cpu.usage.total{*} by {pod_name,namespace}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Memória por Pod",
        "requests": [
          {
            "q": "avg:kubernetes.pod.memory.usage{*} by {pod_name,namespace}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Node.js - Event Loop Delay",
        "requests": [
          {
            "q": "avg:event_loop_delay{*} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Restarts de Pods",
        "requests": [
          {
            "q": "sum:kubernetes.pod.restart.count{*} by {pod_name,namespace}",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Eventos por Tipo",
        "requests": [
          {
            "q": "sum:kubernetes.event.count{*} by {reason}",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "note",
        "content": "🔎 Use o **Service Map do Datadog (APM → Service Map)** para visualizar dependências, gargalos e falhas de comunicação entre microserviços.",
        "background_color": "blue",
        "font_size": "16"
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

🧠 How This Dashboard Helps with RESILIENCE

  • Signal | What It Validates
  • Latency Real impact of failures
  • 5xx Errors User-perceived failure
  • 4xx Errors Controlled degradation
  • CPU / Memory Bottlenecks and autoscaling
  • Event Loop App alive but degraded
  • Restarts Pod stability
  • Kubernetes Events Root cause
  • Service Map Critical dependencies

7️⃣ Complementing with CloudWatch

Even when using Datadog, CloudWatch remains useful for:

  • control plane logs
  • VPC CNI
  • EKS events
  • cluster scaling

This creates a hybrid and complete observability approach.

8️⃣ Validating Resilience in Practice

With everything observable, you can test:

Node failure

  • pod redistribution
  • latency impact
  • recovery time

Pod failure

  • perceived errors
  • retries
  • fallback

Network failures

  • inter-service timeouts
  • artificial latency

Traffic spikes

  • saturation
  • autoscaling behavior

Now you measure, rather than assume.

9️⃣ Conclusion — Observability Is a Pillar of Resilience

Resilience is not luck.
It is data-driven engineering.

With OpenTelemetry + Datadog, even on the free tier, you get:

✅ deep system visibility
✅ correlation between metrics, logs, and traces
✅ actionable dashboards
✅ a solid foundation for Chaos Engineering
✅ continuous feedback for improvement

If you want to build real resilience on Amazon EKS, the journey starts with observability.

Top comments (0)