DEV Community: Diya

Deployment using all three Kubernetes probes

Diya — Mon, 25 May 2026 03:28:10 +0000

Full Example YAML

Here’s a deployment using all three Kubernetes probes:

containers:
  - name: api
    image: my-api:latest

    startupProbe:
      httpGet:
        path: /readyz
        port: 5000
      failureThreshold: 20
      periodSeconds: 15

    readinessProbe:
      httpGet:
        path: /readyz
        port: 5000
      initialDelaySeconds: 5
      periodSeconds: 10
      failureThreshold: 3

    livenessProbe:
      httpGet:
        path: /healthz
        port: 5000
      initialDelaySeconds: 30
      periodSeconds: 20
      failureThreshold: 3

Now let’s break down what Kubernetes is actually doing here.

startupProbe

startupProbe:
  httpGet:
    path: /readyz
    port: 5000
  failureThreshold: 20
  periodSeconds: 15

This tells Kubernetes:

Check /readyz every 15 seconds.
Allow 20 failures before killing the container.

Calculation:

15 seconds × 20 failures = 300 seconds

So Kubernetes gives the application:

5 minutes to fully start

before deciding:

“The application failed to start.”

Default Values

If not specified, Kubernetes uses:

periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
successThreshold: 1

Which means by default:

10 seconds × 3 failures = 30 seconds

Your application may only get:

~30 seconds

before Kubernetes decides startup failed.

This is why slow-starting applications often need a custom startupProbe.

Common Real-World Use Cases

Java applications
ML workloads
applications loading huge caches
Python/Gunicorn services
applications waiting for database migrations

The important part:

A startup probe failure itself is NOT the issue.

The issue happens only when failures continue beyond the threshold.

readinessProbe

readinessProbe:
  httpGet:
    path: /readyz
    port: 5000
  initialDelaySeconds: 5
  periodSeconds: 10
  failureThreshold: 3

This tells Kubernetes:

Wait 5 seconds after container start.
Then check /readyz every 10 seconds.
If it fails 3 consecutive times:
remove the pod from Service traffic.

Calculation:

10 seconds × 3 failures = 30 seconds

If the application cannot respond successfully for:

30 continuous seconds

the pod becomes:

NotReady

But importantly:

The container is NOT restarted.

Traffic simply stops flowing to it temporarily.

Default Values

If not configured, Kubernetes defaults to:

initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
successThreshold: 1

This means Kubernetes starts checking almost immediately.

That can become dangerous for applications that:

take time to boot
warm caches
establish DB connections
initialize workers

Important Concept

A readiness failure usually means:

"Do not send traffic right now."

It does NOT mean:

"The application is dead."

This distinction is extremely important in production.

livenessProbe

livenessProbe:
  httpGet:
    path: /healthz
    port: 5000
  initialDelaySeconds: 30
  periodSeconds: 20
  failureThreshold: 3

This tells Kubernetes:

Wait 30 seconds before starting checks.
Then check /healthz every 20 seconds.
If it fails 3 consecutive times:
restart the container.

Calculation:

20 seconds × 3 failures = 60 seconds

If health checks fail continuously for:

60 seconds

Kubernetes assumes:

“The application is unhealthy or stuck.”

and restarts the container automatically.

Default Values

Kubernetes defaults:

initialDelaySeconds: 0
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3
successThreshold: 1

Which effectively means:

10 seconds × 3 failures = 30 seconds

before restart behavior begins.

Common Mistake

Many teams configure aggressive liveness probes like:

timeoutSeconds: 1

During:

CPU spikes
GC pauses
dependency slowness
temporary latency

the application may briefly respond slowly.

This can accidentally trigger unnecessary restarts.

The Most Important Thing to Understand

Many engineers panic immediately when they see:

Readiness probe failed

or:

Liveness probe failed

But probes are designed to fail occasionally.

The real question is:

Did the failures exceed the threshold?

Because Kubernetes only takes action after repeated failures over time.

That’s why these settings matter so much:

failureThreshold
periodSeconds
timeoutSeconds
initialDelaySeconds

Together, they control:

how patient Kubernetes should be
when traffic should stop
when restarts should happen
how tolerant the system should be during spikes

Probe	What Happens on Failure?
startupProbe	Container may be killed if startup takes too long
readinessProbe	Pod stops receiving traffic
livenessProbe	Container gets restarted

Kubernetes probes are not meant to punish applications.

They are safety mechanisms.

The goal is to:

avoid sending traffic to unhealthy pods
restart stuck applications
allow slow startups safely

Once you understand probe thresholds, Kubernetes behavior suddenly becomes much easier to debug.

How Logs Travel From Your EKS Pod to Datadog

Diya — Mon, 25 May 2026 03:04:13 +0000

If you’re running applications on Kubernetes using Amazon EKS and suddenly seeing logs appear in Datadog, you may have wondered:

“How did the logs even get there?”

Your application is running inside a Kubernetes pod.

Datadog is somewhere in the cloud.

Yet somehow every request, every error, and every stack trace magically appears in the Datadog UI.

At first, it feels invisible.

But once you understand the observability pipeline, Kubernetes starts making a lot more sense.

What Is Datadog?

Datadog is an observability platform.

It helps engineers monitor:

Infrastructure
Kubernetes clusters
Applications
Logs
Metrics
Traces
Security events

Think of Datadog as a centralized monitoring brain for your systems.

Instead of SSH-ing into servers and manually checking logs, Datadog collects everything into one searchable place.

You can search:

service:map-service status:error

and instantly see logs from hundreds of Kubernetes pods.

You can also:

Create dashboards
Set alerts
Trace requests
Monitor pod restarts
Watch CPU and memory usage
Detect failures in real time

But here’s the important part:

Your application itself usually does NOT directly communicate with Datadog.

That job belongs to the Datadog Agent.

What Is the Datadog Agent?

The Datadog Agent is the collector.

It runs inside your Kubernetes cluster and continuously gathers:

Logs
Metrics
Traces
Kubernetes metadata
Container information

In Kubernetes, the Datadog Agent is usually deployed as a:

DaemonSet

A DaemonSet means:

“Run one Datadog Agent pod on every Kubernetes node.”

So if your EKS cluster has:

20 worker nodes

Kubernetes automatically creates:

20 Datadog Agent pods

Each agent watches workloads running on its own node.

The Big Question

Here’s what most people wonder:

“My application is inside a pod… so how does Datadog see its logs?”

To understand this, we first need to understand how Kubernetes handles logs internally.

Step 1 Your Application Writes Logs

Inside your container, your application usually writes logs to:

stdout
stderr

For example:

print("hello world")

or:

logger.error("database connection failed")

Or maybe your Gunicorn server prints:

500 Internal Server Error

Your app is simply writing text output.

It doesn’t know anything about Datadog.

It doesn’t know about dashboards.

It doesn’t know about observability.

It is simply talking.

Step 2 Kubernetes Captures Container Logs

Kubernetes containers run through a container runtime like:

containerd
Docker
CRI-O

In EKS today, most clusters use:

containerd

The runtime captures container stdout/stderr and stores it as log files on the Kubernetes node.

Usually under paths like:

/var/log/containers/
/var/log/pods/

At this point:

The logs now physically exist on the EC2 worker node.

This is the key realization:

Your pod logs are not floating magically inside Kubernetes.

They become actual files on the node filesystem.

Step 3 The Datadog Agent Watches Those Logs

Now the Datadog Agent enters the picture.

Because the Agent runs on every node, it can monitor container log files locally.

Conceptually, it does something similar to:

tail -f /var/log/containers/*.log

The Agent continuously watches:

new logs
new containers
restarted pods
Kubernetes metadata

Whenever your application writes a log like:

ERROR database timeout

the Datadog Agent immediately sees it.

Step 4 Metadata Enrichment

This is where Datadog becomes powerful.

The Datadog Agent doesn’t just forward raw text.

It enriches logs with Kubernetes metadata.

Example:

{
  "message": "GET /orders 500",
  "pod_name": "map-service-12345abc-west",
  "namespace": "production",
  "service": "map-service",
  "node": "ip-10-0-10-12",
  "cluster": "eks-prod"
}

Now your logs become searchable.

You can filter by:

pod
namespace
service
cluster
environment
container

Without this enrichment, logs would just be random text.

Step 5 Secure Upload to Datadog Cloud

After enrichment, the Agent securely uploads logs to Datadog’s backend using HTTPS.

The flow now looks like this:

That’s the full hidden journey.

What About Metrics?

Datadog does more than logs.

The Agent also collects metrics from Kubernetes.

For example:

CPU usage
Memory usage
Pod restarts
RabbitMQ queue depth
Redis connections
Network traffic

It gathers metrics from:

kubelet
Kubernetes API
cAdvisor
integrations
DogStatsD

Example:

Kubernetes Node
   ↓
kubelet metrics
   ↓
Datadog Agent
   ↓
Datadog Cloud

This is why you can build dashboards showing:

pod CPU %
node memory
HPA scaling
container restarts

in real time.

What About Traces (APM)?

Datadog APM tracks requests flowing across services.

Example:

Frontend
   ↓
API Gateway
   ↓
ROS Service
   ↓
PostgreSQL

Datadog can measure:

request latency
database queries
failed API calls
slow endpoints

This works using tracing libraries like:

ddtrace (Python)
datadog-js
Java agent

These libraries send trace data to the Datadog Agent.

Then the Agent forwards it to Datadog.

Why Does the Datadog Agent Need Permissions?

This part surprises many engineers.

The Datadog Agent often mounts host system paths like:

/var/log/containers
/proc
/sys/fs/cgroup

Why?

Because Kubernetes isolation prevents containers from normally seeing host resources.

The Agent needs visibility into:

container logs
processes
cgroups
node metrics

Some advanced integrations require even more access.

For example, the Gunicorn integration may require:

hostPID: true
privileged: true
SYS_PTRACE

because the Agent sometimes needs to inspect processes outside its own container namespace.

This is why observability agents can become security discussions between:

platform teams
DevOps engineers
security teams

Monitoring deeply often requires elevated visibility.

The Biggest Realization

Your application is usually just writing logs normally.

The Datadog Agent acts like a collector sitting beside your workloads.

A simple analogy:

Application = person speaking
Datadog Agent = microphone
Datadog Cloud = recording studio
Datadog UI = playback/search system

The application just talks.

The Agent listens.

Once you understand this pipeline, debugging Kubernetes becomes much easier.

When logs disappear, you know exactly where to investigate:

Is the app writing logs?
Is stdout working?
Did the container runtime capture them?
Is the Datadog Agent healthy?
Are hostPath mounts correct?
Is networking blocking uploads?
Is metadata enrichment failing?

Observability suddenly becomes understandable instead of magical.

And honestly, that’s one of the coolest parts of Kubernetes infrastructure.

Behind every “simple dashboard” is an entire hidden pipeline quietly moving data across your cluster 24/7.

👋

Diya — Mon, 14 Oct 2024 03:03:51 +0000

Hi 👋 Hello
it’s more important than ever for us to keep ourselves busy, constantly learning, and building new skills. Age is just a number—what truly matters is our willingness to embrace new challenges, grow, and share our knowledge with others. Continuous learning not only helps us stay ahead in our careers, but it also keeps us away from negative thoughts by focusing our energy on self-improvement and personal growth.

I’m excited to start my journey!