Rodrigo Fernandes for AWS Community Builders

Posted on Dec 24, 2025

Observability for Resilience on Amazon EKS with OpenTelemetry + Datadog (Free Tier)

#eks #kubernetes #datadog #aws

Building Dashboards That Truly Matter

Resilience in cloud-native applications is not just about restarting pods or running across multiple Availability Zones.

Without deep observability, you don’t know:

where latency increases
which service degrades first
whether autoscaling actually works
how long the system takes to recover

In other words: without observability, you test resilience in the dark.

In this article, you will learn how to build a complete observability platform for resilience on Amazon EKS, using only open-source tools and the Datadog free tier.

🧭 What You Will Build

By the end of this article, you will have:

✅ An EKS cluster ready for testing
✅ OpenTelemetry Collector deployed via Helm
✅ Metrics, logs, and traces exported
✅ Datadog configured (free tier)
✅ Dashboards focused on real resilience
✅ A foundation ready for Chaos Engineering

🧠 High-Level Architecture

The architecture follows the modern cloud-native observability pattern:

Instrumented (or auto-instrumented) applications
OpenTelemetry Collector as the central layer
Datadog as the visualization and APM backend
CloudWatch as a native AWS complement

👉 Metrics, logs, and traces flow in a unified way

1️⃣ Why Observability Is Fundamental to Resilience

Resilience is not just about staying up.
It is about understanding system behavior under failure.

With proper observability, you can answer questions such as:

✔ Does latency increase during failures?
Chaos tests almost always impact response time. Without metrics, this goes unnoticed.

✔ Does the system fail gracefully?
5xx and 4xx errors show whether the application degrades correctly or completely breaks.

✔ Is the bottleneck code or infrastructure?
CPU, memory, I/O, and network saturation tell the truth.

✔ Where is the bottleneck between microservices?
Distributed traces show exactly where time is spent.

✔ Is Kubernetes reacting properly?
Events, restarts, and scheduling behavior reveal a lot about resilience.

You cannot improve what you cannot observe.

2️⃣ Creating the EKS Cluster with eksctl

For labs, testing, and technical articles, eksctl is fast and efficient:

eksctl create cluster \
  --name observability-eks \
  --region us-east-1 \
  --version 1.30 \
  --nodegroup-name ng-default \
  --node-type t3.medium \
  --nodes 2 \
  --nodes-min 2 \
  --nodes-max 4 \
  --managed

This creates:

A functional EKS cluster
A managed node group
IAM automatically configured
kubeconfig ready to use

3️⃣ Minimal Application Instrumentation

Even without deep instrumentation, it is already possible to extract significant value.

📌 Automatic Kubernetes Metrics

Collected via kubelet and cAdvisor:
CPU and memory per pod
Restarts
Network usage
Scheduling latency

📌 Automatic Tracing (Auto-Instrumentation)

OpenTelemetry supports:

Java
Node.js
Python
Go (partial)

Without changing the code, you already get distributed traces.

📌 Structured Logs

Recommended format:

{
  "timestamp": "2025-01-01T12:34:56Z",
  "message": "Order created",
  "trace_id": "abc123",
  "span_id": "def456",
  "service": "checkout"
}

This enables direct correlation between logs and traces.

4️⃣ Deploying the OpenTelemetry Collector with Helm

The OpenTelemetry Collector acts as the central observability layer.

It receives data via OTLP, processes it, and exports it to Datadog.

Installation via Helm

helm install otel-collector ./otel-datadog \
  --namespace observability \
  --create-namespace \
  --set datadog.apiKey=<YOUR_API_KEY>

The Collector starts collecting:

Metrics (Prometheus / Kubernetes)
Logs
Traces
Cluster events

5️⃣ Datadog Free Tier

The Datadog Free Tier is surprisingly powerful:

✔ Up to 5 free hosts
✔ APM included
✔ Unlimited dashboards
✔ Automatic Service Map
✔ Basic alerts

This is more than enough for resilience and chaos testing.

6️⃣ Dashboards That Truly Matter for Resilience

This is the key point of the article: what to monitor.

📊 6.1 Service Latency (APM)

Metric:

trace.<service>.request.duration

Helps identify:

failure impact
progressive degradation
bottlenecks between services

🚨6.2 5xx and 4xx Errors

Metrics:

http.server.request.error.count
trace.<service>.errors

A direct indicator of user-perceived failure.

🔥 6.3 CPU and Memory Saturation per Pod

kubernetes.pod.cpu.usage.total
kubernetes.pod.memory.usage

Essential for validating:

HPA
Karpenter
poorly sized requests and limits

⚙️ 6.4 Event Loop (Node.js)

Custom metric:

event_loop_delay

Shows when the application is alive but unusable.

🗺️ 6.5 Service Map

Automatically visualizes:

broken dependencies
increased latency
critical services

One of Datadog’s most powerful features.

🔄 6.6 Kubernetes Events and Restarts

kubernetes.pod.restart.count
kubernetes.event.count

Detects:

CrashLoopBackOff
OOMKilled
readiness failures
scheduling issues

📊 Datadog Dashboard — EKS Resilience Observability
How to Use

Datadog → Dashboards
New Dashboard
Import JSON
Paste the content below

✅ Covered Items (Section 6 Checklist)

✔ 6.1 Service latency (APM)
✔ 6.2 5xx and 4xx errors
✔ 6.3 CPU per pod
✔ 6.3 Memory per pod
✔ 6.4 Event Loop (Node.js)
✔ 6.5 Service Map (operational reference)
✔ 6.6 Pod restarts
✔ 6.6 Kubernetes events

🧩 Dashboard JSON

{
  "title": "EKS Resilience Observability",
  "description": "Dashboards focados em resiliência no EKS usando OpenTelemetry + Datadog",
  "layout_type": "ordered",
  "widgets": [
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Latência por Serviço",
        "requests": [
          {
            "q": "avg:trace.*.request.duration{*} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Taxa de Erros 5xx",
        "requests": [
          {
            "q": "sum:http.server.request.error.count{status:5xx} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "APM - Taxa de Erros 4xx",
        "requests": [
          {
            "q": "sum:http.server.request.error.count{status:4xx} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - CPU por Pod",
        "requests": [
          {
            "q": "avg:kubernetes.pod.cpu.usage.total{*} by {pod_name,namespace}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Memória por Pod",
        "requests": [
          {
            "q": "avg:kubernetes.pod.memory.usage{*} by {pod_name,namespace}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Node.js - Event Loop Delay",
        "requests": [
          {
            "q": "avg:event_loop_delay{*} by {service}",
            "display_type": "line"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Restarts de Pods",
        "requests": [
          {
            "q": "sum:kubernetes.pod.restart.count{*} by {pod_name,namespace}",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "timeseries",
        "title": "Kubernetes - Eventos por Tipo",
        "requests": [
          {
            "q": "sum:kubernetes.event.count{*} by {reason}",
            "display_type": "bars"
          }
        ]
      }
    },
    {
      "definition": {
        "type": "note",
        "content": "🔎 Use o **Service Map do Datadog (APM → Service Map)** para visualizar dependências, gargalos e falhas de comunicação entre microserviços.",
        "background_color": "blue",
        "font_size": "16"
      }
    }
  ]
}

🧠 How This Dashboard Helps with RESILIENCE

Signal | What It Validates
Latency Real impact of failures
5xx Errors User-perceived failure
4xx Errors Controlled degradation
CPU / Memory Bottlenecks and autoscaling
Event Loop App alive but degraded
Restarts Pod stability
Kubernetes Events Root cause
Service Map Critical dependencies

7️⃣ Complementing with CloudWatch

Even when using Datadog, CloudWatch remains useful for:

control plane logs
VPC CNI
EKS events
cluster scaling

This creates a hybrid and complete observability approach.

8️⃣ Validating Resilience in Practice

With everything observable, you can test:

✔ Node failure

pod redistribution
latency impact
recovery time

✔ Pod failure

perceived errors
retries
fallback

✔ Network failures

inter-service timeouts
artificial latency

✔ Traffic spikes

saturation
autoscaling behavior

Now you measure, rather than assume.

9️⃣ Conclusion — Observability Is a Pillar of Resilience

Resilience is not luck.
It is data-driven engineering.

With OpenTelemetry + Datadog, even on the free tier, you get:

✅ deep system visibility
✅ correlation between metrics, logs, and traces
✅ actionable dashboards
✅ a solid foundation for Chaos Engineering
✅ continuous feedback for improvement

If you want to build real resilience on Amazon EKS, the journey starts with observability.

DEV Community

Observability for Resilience on Amazon EKS with OpenTelemetry + Datadog (Free Tier)

Top comments (0)