DEV Community: devtocash

SLI vs SLO vs SLA: Real SRE Guide with Examples

devtocash — Thu, 16 Jul 2026 13:07:24 +0000

Introduction

Every engineering team talks about uptime. "We need five nines." "Our SLA is 99.9%." "We hit our SLO this quarter."

But ask most engineers what an SLO actually means — mathematically, operationally, legally — and the confidence drops fast. Ask them what SLI their SLO is based on, and you will get a blank stare or a hand-wavy "uh, latency, I guess."

This confusion has real costs:

Teams define SLOs against meaningless SLIs — like "overall uptime" of a system that has 47 microservices, five of which are critical and the rest are decorative.
They set unrealistic targets because "five nines sounds good" without understanding the reliability budget.
They confuse SLOs (internal reliability targets) with SLAs (external, often legal commitments) and end up over-engineering for contracts that don't require it.

This guide fixes that. You will learn:

The precise definition of SLI, SLO, and SLA — and the differences that matter.
How to choose the right SLIs for real production services.
How to set achievable SLOs with math that works.
How SLAs relate to SLOs (spoiler: they are not the same).
Real Prometheus examples for measuring SLIs and burning down SLOs.
The common mistakes teams make — and how to avoid them.

Let's start with a single analogy that makes everything click.

The Analogy: Speedometer, Speed Limit, Traffic Ticket

Think of a car journey.

SLI is the speedometer. It measures something — how fast you are going right now. It is raw data. "The 95th percentile latency of the checkout endpoint over the last 5 minutes was 342 ms." That is an SLI.
SLO is the speed limit. It says: "95th percentile latency should be under 500 ms over a 30-day rolling window." That is your target. You can choose to drive faster (higher risk) or slower (more cautious).
SLA is the traffic ticket. If you violate the speed limit for too long, you pay a penalty. An SLA says: "If 95th percentile latency exceeds 500 ms for more than 0.1% of the month, we credit the customer 5% of their bill."

The speedometer tells you the current value. The speed limit tells you where you want to be. The ticket tells you what happens if you fail.

SLI: The Raw Measurement

Definition

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the service you care about.

The key words are "carefully defined." A bad SLI definition leads to bad SLOs. A good SLI definition is specific, measurable, and meaningful to users.

The Four Golden Signals

Google's SRE literature defines four golden signals. Every service should have SLIs in at least these categories:

Signal	What it measures	Example SLI
Latency	How long it takes to respond	"95th percentile HTTP response time for GET /api/orders"
Traffic	How much demand is placed on the system	"Requests per second to the web frontend"
Errors	How many requests fail	"Ratio of HTTP 500 responses to total requests"
Saturation	How "full" the system is	"CPU utilization percentage across the cluster"

Most teams stop at latency and errors. That is a good start but incomplete. Saturation, in particular, is a leading indicator — if you only measure it when errors spike, you are always reacting.

Choosing Good SLIs

A good SLI has three properties:

User-visible. Measure what the user experiences, not what the infrastructure is doing. If the database is having replication lag but users are not affected, that is an ops concern, not an SLI. If users are affected (stale data, timeouts), then it becomes an SLI.
Measurable consistently. You need to collect the same measurement the same way every time. "Latency" is not an SLI. "p95 of the last 30 seconds of HTTP request duration measured server-side" is an SLI.
Actionable. If the SLI goes bad, someone should know what to do about it. "Number of times the database is restarted" is measurable but, alone, tells you nothing about what to fix.

Examples of Good vs Bad SLIs

❌ Bad SLI	Why it's bad	✅ Good SLI	Why it's better
"System uptime"	A monolith in a VM is up? That tells you nothing about responsiveness.	"Ratio of successful HTTP requests (2xx) to total requests over 1-minute windows"	Measures what users actually experience.
"Average latency"	Averages hide outliers. 99% of requests in 10ms, 1% in 30s — average is still ~300ms, looks fine.	"p99 HTTP latency over 5-minute rolling windows"	Captures the tail, which is what users feel.
"CPU usage"	CPU at 100% does not necessarily mean poor user experience.	"p99 latency when CPU > 80% vs p99 latency when CPU < 80%"	Ties infrastructure to user experience.

Defining SLIs in Prometheus

Assume you have a service exposing metrics via /metrics. To measure request duration:

# Request duration histogram — already exposed by your instrumentation
http_request_duration_seconds_bucket{job="checkout-service", le="0.1"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.25"}
http_request_duration_seconds_bucket{job="checkout-service", le="0.5"}
http_request_duration_seconds_bucket{job="checkout-service", le="1.0"}
http_request_duration_seconds_bucket{job="checkout-service", le="+Inf"}
http_request_duration_seconds_count{job="checkout-service"}
http_request_duration_seconds_sum{job="checkout-service"}

Your latency SLI at p99 over the last 5 minutes:

histogram_quantile(
  0.99,
  rate(http_request_duration_seconds_bucket{job="checkout-service"}[5m])
)

Your error ratio SLI over the last 5 minutes — the fraction of requests that returned HTTP 5xx:

(
  sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[5m]))
  /
  sum(rate(http_requests_total{job="checkout-service"}[5m]))
)

Your availability SLI — the fraction of 1-minute windows where error ratio was under a threshold (e.g., under 1%):

avg_over_time(
  (
    (
      sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[1m]))
      /
      sum(rate(http_requests_total{job="checkout-service"}[1m]))
    ) < 0.01
  )[5m:]
)

This last one is important. "Availability" measured as "number of good windows / total windows" is the standard approach used by Google (it is called SLI burn rate).

SLO: The Target

Definition

A Service Level Objective (SLO) is a target value or range for an SLI over a specified measurement window.

Example: "p99 latency of the checkout service stays under 500 ms for 99.9% of 1-minute windows in any rolling 30-day period."

That sentence contains:

The SLI (p99 latency)
The threshold (under 500 ms)
The measurement window (1-minute windows)
The compliance period (30 rolling days)
The target (99.9% of windows good)

The Error Budget

The most important concept in SRE. If your SLO says "99.9% good," then 0.1% of measurement windows can be bad. That 0.1% is your error budget.

For a 30-day period:

Total 1-minute windows: 30 days × 1440 minutes/day = 43,200 windows
Allowed bad windows at 99.9%: 43,200 × 0.001 = 43 windows
Allowed bad windows at 99.99%: 43,200 × 0.0001 = 4 windows
Allowed bad windows at 99.999% (five nines): 43,200 × 0.00001 = 0.4 windows (meaning you can barely afford any outage)

SLO Target	Minutes you can be down per month	Realistic?
99% ("one nine")	432 min (7.2 hours)	Easy for most services
99.9% ("three nines")	43 min	Achievable
99.95%	22 min	Good for critical services
99.99% ("four nines")	4.3 min	Hard — requires automation and redundancy
99.999% ("five nines")	26 seconds	Almost impossible without multi-region active-active

The error budget changes team behaviour. When the budget is healthy, teams deploy with confidence. When it is running low, teams become conservative — they throttle deployments, add testing, strengthen canary checks. This is the error budget policy.

Setting SLOs: A Practical Approach

Do not start with 99.9% because "it sounds right." Start with data.

Step 1: Collect your SLIs for at least 2–4 weeks.

Before you set a target, you need to know where you are. Run Prometheus, instrument your services, and let the data accumulate.

# Add OpenTelemetry instrumentation to your app
# Example: Python with OpenTelemetry
pip install opentelemetry-distro opentelemetry-exporter-prometheus

Step 2: Determine the worst acceptable performance.

Ask product owners: "What is the slowest response time that would make you consider the service broken?" Not the ideal speed — the worst acceptable.

For an API: "If p99 latency exceeds 1 second for more than 5 minutes, users complain."
For a payment service: "Any failed transaction is unacceptable — 100% of requests must succeed."
For a background job: "If it doesn't complete within 2 hours, the morning report is late."

Step 3: Add headroom.

Your SLO should be stricter than the absolute worst acceptable. If "p99 under 1 second" is the hard limit, set your SLO at p99 under 800 ms. If zero failed transactions is the ideal, set your SLO at 99.95% success rate (giving you a small error budget to handle bad deployments).

Step 4: Run it for a month and adjust.

The first SLO you set will be wrong. That is normal. Track it for 30 days, see how often you burn through the budget, and adjust.

Monitoring SLOs in Prometheus

You need to track your burn rate — how quickly you are consuming the error budget.

# SLO compliance for p99 latency under 500ms
# Good for 30-day period, evaluating over 1-minute windows

# Step 1: Which 1-minute windows are "bad"?
(
  histogram_quantile(
    0.99,
    rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
  )
  > 0.5  # 500ms
)

# Step 2: Error budget consumed over last 30 days
1 - (
  avg_over_time(
    (
      histogram_quantile(
        0.99,
        rate(http_request_duration_seconds_bucket{job="checkout-service"}[1m])
      )
      <= 0.5
    )[30d:]
  )
)

A value of 0.002 means you have consumed 0.2% of your error budget in the last 30 days. If your SLO is 99.9% (0.1% budget), you are at 200% consumption — in trouble.

Alerting on Error Budget Burn Rate

Do not alert on raw latency spikes. Alert on the rate at which you are burning through your error budget.

Burn rate	What it means	Action
< 0.5x	Budget is being consumed slowly — normal operations.	No alert.
1x	Exactly on target.	Monitor but no action.
2x	Consuming budget twice as fast as planned.	Investigate within 24 hours.
5x	Serious degradation.	Page the on-call engineer within 30 minutes.
10x+	Critical incident.	War room. Immediate response.

Example Prometheus alert rule for a 5x burn rate sustained over 30 minutes:

groups:
  - name: slo_alerts
    rules:
      - alert: HighErrorBudgetBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{job="checkout-service", status=~"5.."}[30m]))
              /
              sum(rate(http_requests_total{job="checkout-service"}[30m]))
            )
          ) < 0.90  # 90% success in last 30 minutes
        for: 5m
        labels:
          severity: page
        annotations:
          summary: "Error budget burning at >5x rate"
          description: "Error rate {{ $value | humanizePercentage }} over last 30 minutes"

SLA: The Contract

Definition

A Service Level Agreement (SLA) is a formal, legally enforceable contract between a service provider and a customer. It specifies:

The SLIs and SLOs the provider commits to.
The measurement methodology.
The penalties or credits if the SLO is not met.

SLAs are external. SLOs are internal. That distinction is critical.

SLA vs SLO: The Key Differences

	SLO	SLA
Audience	Internal engineering team	External customers
Purpose	Guide operational decisions	Contractual commitment
Consequence	Process changes, deployment throttle	Financial penalties, legal liability
Strictness	You can miss an SLO temporarily	Missing an SLA costs real money
Flexibility	Can be adjusted weekly	Hard to change — written into contracts
Measurement	Usually tighter than SLA	Usually looser than SLO

The SLO Margin

Smart teams set their internal SLO stricter than their external SLA. The gap is your safety margin.

SLA to customer:  99.9% availability
Internal SLO:     99.95% availability <-- buffer of 0.05%
                  ^^^^^^^^
                  You have 22 min of downtime allowance

This margin means:

You will miss the internal SLO long before you miss the external SLA.
You have time to react before customers experience a contract violation.
You can keep your infrastructure simpler (and cheaper) than if you had to guarantee the strictest target.

When SLAs Go Wrong

The most common SLA mistake: committing to something you cannot measure.

Example: "We guarantee p99 latency under 100ms." Sounds great. But if you measure latency from your load balancer (inside the data centre) and the customer measures it from their browser in rural Australia — those are different numbers. Your SLA needs to specify exactly where and how latency is measured.

Second mistake: committing to an SLO that is too tight for your architecture. A single-region deployment cannot realistically offer five nines. A database with a single primary cannot survive a failover without a brief blip. Your SLA must reflect your architecture's actual failure modes.

Putting It All Together: A Real Example

Let us walk through a real scenario. You run a payment service called payment-svc that processes credit card transactions.

Step 1: Define Your SLIs

# Payment service SLIs
sli_latency_p99:  "p99 latency of POST /api/charge over 5-minute windows"
sli_error_rate:   "Ratio of HTTP 5xx responses to total requests over 5-minute windows"
sli_throughput:   "Successful transactions per second"
sli_saturation:   "gRPC connection pool utilization percentage"

Step 2: Set Your Internal SLOs

Based on historical data and product requirements:

# Month-rolling SLOs for payment-svc
slo_latency_p99:
  target: 99.9%             # 0.1% bad windows allowed
  threshold: 300ms          # p99 under 300ms

slo_error_rate:
  target: 99.99%            # 0.01% bad windows allowed
  threshold: 0.001          # less than 0.1% error rate per window

slo_uptime:
  target: 99.95%            # based on simple request success count over 30d

Step 3: Define Your External SLA

Based on business requirements and what competitors offer:

# Customer SLA — intentionally looser than internal SLOs
sla_uptime:        99.9%    # 43 minutes downtime per month
sla_error_rate:    99.9%    # 0.1% error rate
penalty:           "5% monthly credit per 0.1% below SLA, max 50% credit"

Notice the gap: internal SLO for error rate is 99.99%, external SLA is 99.9%. That gives the team a 10× margin to absorb incidents before customers are impacted.

Step 4: Measure and Alert

Your monitoring dashboard shows:

Time period	Good windows	Total windows	Compliance
Last 24h	1,439	1,440	99.93%
Last 7d	10,067	10,080	99.87%
Last 30d	43,156	43,200	99.90%

The 7-day window is 99.87% — below the 99.9% SLO. The team knows they need to investigate. But the 30-day SLA target (99.9%) is barely being met. No customer penalty yet, but one more incident will push it under.

The alert rule catches this:

# Alert if 7-day compliance drops below 99.9%
(
  1 - avg_over_time(
    (
      sum(rate(http_requests_total{job="payment-svc", status=~"5.."}[1m]))
      /
      sum(rate(http_requests_total{job="payment-svc"}[1m]))
    ) > 0.001
  [7d:])
)
> 0.001  # More than 0.1% bad windows

The team pages, investigates, finds a newly deployed service that is not properly handling database connection timeouts, rolls back the deployment, and the error budget recovers.

Common Mistakes and How to Avoid Them

Mistake 1: Too Many SLIs

Teams measure everything — p50, p90, p95, p99, p99.9 of every endpoint, error rates by status code, by region, by instance. The result: alert fatigue and no clear picture.

Fix: Pick 3–5 SLIs per critical service. The golden signals are a good starting point. Add more only when you find a specific gap.

Mistake 2: SLOs Based on Averages

"Average latency" and "average availability" hide the real story. A service can have 99.9% "average uptime" over a month while being completely down for individual users.

Fix: Use percentiles (p95, p99) for latency. Use the "good windows" approach for availability — count windows where the service was good, not the average value.

Mistake 3: Identical SLOs for Every Service

A critical payment service and a background report generator should not share the same target. If you set all services at 99.99%, you are either over-engineering the background job or under-engineering the payment service.

Fix: Classify services by criticality — Tier 1 (customer-facing, revenue-critical), Tier 2 (important but not urgent), Tier 3 (internal tools, batch jobs). Set different SLOs per tier.

Tier	Example	SLO target	On-call response
1	Payment service, API gateway	99.95%	15-minute page
2	Reporting service, admin dashboard	99.9%	1-hour page
3	Internal data sync, ETL	99%	Next business day

Mistake 4: Setting SLOs Without Error Budget Policy

An SLO without an error budget policy is just a dashboard number. If you miss it, nothing happens differently. That defeats the entire purpose.

Fix: Write a one-page error budget policy:

Who decides when to stop deployments (typically the SRE lead or on-call).
At what budget level deployments stop (e.g., "deployments frozen when budget < 10% remaining").
How the budget resets (e.g., at the start of each calendar month).

Mistake 5: Confusing SLA with SLO

Committing the same target to customers that you use internally means zero margin for error. One incident = one missed SLA = financial penalties.

Fix: Always set your internal SLO 5–10× tighter than your external SLA. The cost of running slightly better infrastructure is almost always less than the cost of paying SLA penalties.

Actionable Takeaways

Start with 3–5 SLIs per critical service. Use the golden signals (latency, traffic, errors, saturation). Add more only when you find gaps.
Use percentiles, not averages. p99 latency and the "good windows" approach for availability. Averages will lie to you.
Set SLOs based on data, not intuition. Collect SLIs for at least 2 weeks before defining targets. The first SLO you set will be wrong — adjust it.
Always set internal SLOs tighter than external SLAs. The gap is your safety margin. At least 2× on error budgets.
Alert on error budget burn rate, not raw metrics. A latency spike that lasts 30 seconds is noise. A 10× burn rate sustained for 30 minutes is an incident.
Write an error budget policy. Define explicitly: when deployments stop, who decides, and how the budget resets.
Classify services by criticality. Tier 1 (99.95%), Tier 2 (99.9%), Tier 3 (99%). Do not apply one SLO to everything.

Need to implement SLIs in your infrastructure? Check out our *Prometheus Monitoring Setup Guide** and OpenTelemetry Tutorial for production-ready instrumentation.*

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

Kubernetes Pod Stuck in Pending (FailedScheduling): How to Fix It

devtocash — Thu, 16 Jul 2026 01:10:11 +0000

What Pending actually means

A pod stuck in Pending has been accepted by the API server but has no node to run on. The kube-scheduler looked at every node in the cluster, none satisfied the pod's constraints, and it gave up for now — recording exactly why in a FailedScheduling event. Unlike a crash, nothing is wrong with your container yet. The image is never pulled, the process never starts. The problem is entirely about placement.

This is a different failure from the "my container won't stay up" family — ImagePullBackOff, CrashLoopBackOff, and OOMKilled all mean a node accepted the pod and then something broke. Pending means no node would take it in the first place. The good news: like the others, the scheduler tells you the exact reason. You never have to guess. This is the sequence I run to place a stuck pod, usually in a couple of minutes.

Step 1: Read the scheduler's verdict

Don't theorize from a Pending status. Read the event the scheduler already wrote:

kubectl describe pod payments-api-7d9f4c8b6-xk2mn

Scroll to Events at the bottom:

  Warning  FailedScheduling  default-scheduler
    0/5 nodes are available: 2 Insufficient cpu,
    2 node(s) had untolerated taint {dedicated: gpu},
    1 node(s) didn't match Pod's node affinity/selector.

That single line accounts for every node in the cluster and why each one rejected the pod. The scheduler evaluates nodes through filter predicates, and the message is the tally of which predicate failed where. Map the phrase to a root cause:

Message fragment	Root cause	Go to
`Insufficient cpu` / `Insufficient memory`	Requests don't fit any node's free capacity	Step 2
`node(s) had untolerated taint`	Node is cordoned or reserved; pod lacks a toleration	Step 3
`didn't match Pod's node affinity/selector`	`nodeSelector`/affinity points at labels no node has	Step 4
`had volume node affinity conflict` / `unbound ... PersistentVolumeClaims`	Storage can't bind, or PV is zone-locked away from capacity	Step 5
`too many pods`	Node hit its max-pods cap (often ENI/IP limits)	Step 6
`didn't match pod topology spread constraints` / anti-affinity	Spread/anti-affinity rules can't be satisfied	Step 7

Read this first and everything below collapses to one path. Often you'll see several fragments at once — fix them in order of how many nodes each blocks.

Step 2: Insufficient CPU or memory

The most common cause. Insufficient cpu doesn't mean the node is using all its CPU — it means the sum of pod requests already scheduled there leaves less than your pod asks for. Scheduling is based on requests, not live usage. A node at 10% actual CPU can still reject a pod if its requests are already reserved.

First, see what the pod is asking for:

kubectl get pod payments-api-7d9f4c8b6-xk2mn \
  -o jsonpath='{.spec.containers[0].resources.requests}'

{"cpu":"2","memory":"4Gi"}

Then see what's actually free on the nodes:

kubectl describe node ip-10-0-1-42 | grep -A6 "Allocated resources"

  Resource   Requests      Limits
  cpu        3500m (87%)   6 (150%)
  memory     6Gi (78%)     10Gi

If free CPU is below your request on every node, you have two fixes:

The request is oversized. A pod asking for 2 full cores that actually uses 150m is starving the scheduler of placements it could otherwise make. Right-size the request to real usage — the same discipline that keeps the Kubernetes bill down. The VPA in recommendation mode will suggest values from observed usage; see how it fits alongside HPA and KEDA in the pod autoscaling guide.
The cluster is genuinely full. Every node is legitimately packed. You need more capacity — enable the cluster autoscaler (or Karpenter) so a Pending pod that can't fit triggers a new node instead of waiting forever. Confirm it's reacting:

kubectl -n kube-system logs -l app=cluster-autoscaler --tail=50 | grep -i scale

If the autoscaler logs pod didn't trigger scale-up: max node group size reached, you've hit your node-group ceiling — raise the max, or the pod stays Pending indefinitely.

Step 3: Untolerated taints

A message like node(s) had untolerated taint {node.kubernetes.io/unschedulable} means the target nodes are tainted and your pod carries no matching toleration. Taints repel pods unless the pod explicitly tolerates them. See the taint:

kubectl get nodes -o custom-columns=NAME:.metadata.name,TAINTS:.spec.taints

Two very different situations:

The node is cordoned. node.kubernetes.io/unschedulable appears after kubectl cordon (or a drain during maintenance). If the cordon was intentional, that's expected — the pod schedules once you kubectl uncordon the node or new capacity arrives. A pod stuck Pending during a node upgrade is usually this.
The node is deliberately reserved — GPU nodes, a dedicated=gpu:NoSchedule taint, spot-instance pools. If your pod should run there, add a toleration:

spec:
  tolerations:
    - key: "dedicated"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"

A toleration lets a pod land on a tainted node; it does not force it there. Pair it with a nodeSelector or affinity (Step 4) when you want the pod to actually target that pool.

Step 4: Node affinity / selector mismatch

didn't match Pod's node affinity/selector means your nodeSelector or nodeAffinity names a label that no available node carries. Usually a typo, a decommissioned node pool, or a label that was never applied. Check what the pod demands:

kubectl get pod payments-api-7d9f4c8b6-xk2mn -o jsonpath='{.spec.nodeSelector}'

{"disktype":"ssd","topology.kubernetes.io/zone":"us-east-1a"}

Then check which nodes actually have those labels:

kubectl get nodes -l disktype=ssd

If that returns nothing, no node matches. Either the label is wrong on the pod, or it was never set on the nodes:

kubectl label node ip-10-0-1-42 disktype=ssd

A subtle trap: pinning a pod to a single zone with topology.kubernetes.io/zone means it can only schedule in that zone. If that zone is full and your autoscaler scales a different zone, the pod stays Pending forever. Prefer preferredDuringScheduling over requiredDuringScheduling unless the constraint is truly hard — a preference degrades gracefully instead of deadlocking.

Step 5: Unbound volumes and zone conflicts

pod has unbound immediate PersistentVolumeClaims or volume node affinity conflict means storage is blocking placement. Storage and scheduling are coupled: a pod can only run where its volume can attach. Check the PVC:

kubectl get pvc

NAME            STATUS    VOLUME   CAPACITY   STORAGECLASS
data-payments   Pending                       gp3

A Pending PVC means no PersistentVolume satisfied the claim. Common causes:

No matching StorageClass or no dynamic provisioner. kubectl get storageclass — if there's no default class and the PVC names none, nothing provisions the volume. Set a default or name a valid class.
volume node affinity conflict — the PV already exists in us-east-1a, but the only node with free capacity is in us-east-1b, and an EBS volume can't cross zones. Fix by setting volumeBindingMode: WaitForFirstConsumer on the StorageClass so the volume is provisioned after the scheduler picks the node, in the same zone — instead of binding first and pinning the pod to the wrong zone.

Step 6: Node is out of pod slots (`too many pods`)

too many pods means the node hit its max-pods limit even though it has spare CPU and memory. On AWS EKS with the VPC CNI, the ceiling is often driven by how many IP addresses the instance's ENIs can hold — a small instance type may cap at 8–17 pods regardless of how much RAM it has. Check the node's capacity and current count:

kubectl get node ip-10-0-1-42 -o jsonpath='{.status.capacity.pods}'
kubectl get pods --all-namespaces --field-selector spec.nodeName=ip-10-0-1-42 \
  --no-headers | wc -l

If the count equals capacity, the node is full of pods, not resources. Fixes: use larger instance types (more ENIs = more IPs = more pods), enable prefix delegation on the VPC CNI to pack far more IPs per node, or add nodes. This is an easy one to miss because kubectl top node shows the node as nearly idle.

Step 7: Topology spread and anti-affinity deadlocks

didn't match pod topology spread constraints or an anti-affinity rejection means your own placement rules can't be satisfied. A classic self-inflicted deadlock: a podAntiAffinity with requiredDuringScheduling that says "no two replicas on the same node" while you have 5 replicas and only 3 nodes. The 4th and 5th can never place.

affinity:
  podAntiAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:   # hard rule — deadlocks
      - topologyKey: kubernetes.io/hostname

Switch required to preferred so replicas spread when possible but still schedule when they can't, or add nodes so the hard rule is satisfiable. The same reasoning applies to topologySpreadConstraints with whenUnsatisfiable: DoNotSchedule — that's a hard gate; ScheduleAnyway degrades gracefully.

A repeatable checklist

When a pod is stuck Pending, run this in order:

kubectl describe pod <pod> → read the FailedScheduling event; note every N node(s) ... fragment.
Insufficient cpu/memory → right-size requests or enable the cluster autoscaler; check it isn't at max node-group size (Step 2).
untolerated taint → uncordon the node, or add the matching toleration (Step 3).
didn't match node affinity/selector → fix the label on the pod or apply it to nodes; avoid hard zone pins (Step 4).
unbound PVC / volume node affinity conflict → fix the StorageClass; set WaitForFirstConsumer (Step 5).
too many pods → node hit its IP/pod cap; bigger instances, prefix delegation, or more nodes (Step 6).
Topology/anti-affinity rejection → relax required/DoNotSchedule to preferred/ScheduleAnyway (Step 7).

Don't let Pending pods hide

A pod that can't schedule is invisible to a health check that only watches running pods — it never crashes, it just quietly never runs. That's how a scaled-up deployment silently serves at half capacity. Alert on pods stuck Pending past a threshold:

kube_pod_status_phase{phase="Pending"} == 1
  and on(pod, namespace) (time() - kube_pod_created) > 300

Wire that into the Prometheus + Grafana monitoring setup so a five-minute-stuck pod pages before a rollout stalls, and bake the seven-step triage into your incident runbook. A cluster that's chronically full and can't place pods is one of the quiet Kubernetes mistakes that cost companies millions — cheap to catch, expensive to discover during a traffic spike.

Incident Management Runbook: The Complete SRE Template for 2026

devtocash — Wed, 15 Jul 2026 13:07:24 +0000

Introduction

Every minute of downtime costs your company money. For an e-commerce platform, that is thousands of dollars per minute. For a fintech startup, it is lost trust that takes months to rebuild. Yet when an incident strikes, most teams still scramble through Slack, DMs, and scattered Google Docs — wasting the first 15 minutes just figuring out who owns what.

An Incident Management Runbook fixes this. It is a single source of truth that tells everyone — engineer, manager, or new hire awakened at 3 AM — exactly what to do, who to call, and how to escalate. It eliminates guesswork. It compresses time-to-resolution. It saves your team from burnout.

This guide provides a complete, copy-paste-ready incident management runbook built from real-world SRE practices at companies like Google, Netflix, and PagerDuty. By the end, you will have:

A severity-level framework your entire org can agree on
A role assignment system (Incident Commander, Comms Lead, etc.)
A step-by-step response lifecycle from detection to postmortem
A runbook template you can deploy to your wiki or runbook tool today
Automation patterns using PagerDuty, Opsgenie, and Slack integrations

Whether you are a two-person startup or a 200-engineer platform team, this runbook scales with you. Let's build it.

1. What Is an Incident Management Runbook?

A runbook is a documented, step-by-step procedure for responding to specific types of incidents. Unlike a general "incident response policy" (which says what to do at a high level), a runbook specifies exactly how to do it — which commands to run, which dashboards to check, which people to page.

A good runbook answers these questions before the incident starts:

Who is on-call right now, and who is their backup?
What constitutes a SEV1 vs SEV2 vs SEV3?
How do we declare an incident and notify stakeholders?
Where are the dashboards, logs, and runbooks for each service?
When do we escalate to the next tier or wake up the CTO?

The runbook should be stored in a place accessible during an outage — which means not on the company VPN if the VPN itself is down. Git repositories mirrored to multiple locations, printed copies in the NOC, or tools like PagerDuty Runbook Automation and Rundeck are common solutions.

2. Incident Severity Levels

Before anyone can respond, everyone must agree on what "SEV1" means. Without a shared severity framework, you get arguments during incidents about whether something is "really that bad" — wasting time when seconds matter.

Here is the framework used by most SRE organizations, adapted from Google's SRE book:

Severity	Definition	Example	Response SLA	Escalation
SEV0	Complete service outage. All users affected. Revenue impact.	Website down, payment gateway offline, 100% 5xx errors	5 min acknowledge, 15 min resolve or escalate	Page CTO + VP Eng immediately
SEV1	Major feature broken. Majority of users affected. No workaround.	Login broken, checkout fails, API returning errors for 50%+ users	10 min acknowledge, 30 min resolve or escalate	Page Engineering Director + on-call manager
SEV2	Partial degradation. Subset of users affected. Workaround exists.	Slow page loads, search results stale, one region degraded	30 min acknowledge, 2 hours resolve or escalate	Notify team lead, on-call engineer handles
SEV3	Minor issue. Cosmetic or non-critical.	Typo on landing page, broken image in blog, non-critical cron job failure	Next business day	Create ticket, handle during business hours

Customize the thresholds for your business. An e-commerce site during Black Friday treats a 2% error rate as a SEV0. A SaaS tool on a Sunday afternoon treats it as a SEV2. Define what "revenue impact" means for your specific context.

Every incident should reference the affected error budget — if it's not burning your SLO budget, it may not warrant waking up the team.

Key principle: Err on the side of over-declaring. It is always better to downgrade a SEV2 → SEV3 after investigation than to discover a SEV0 was misclassified as a SEV2 for 45 minutes.

Incidents are triggered when SLIs breach SLO thresholds — define these boundaries first so your severity levels map directly to your reliability targets.

3. Incident Response Roles

Having clearly defined roles prevents the most common incident pitfall: everyone trying to do everything at once, drowning in Slack noise, and nobody communicating with stakeholders.

Role	Responsibility	Who Usually Fills It
Incident Commander (IC)	Runs the incident. Makes all decisions. Keeps the response moving. Only person who can declare "resolved."	Most senior on-call engineer, or designated IC from rotation
Operations Lead (OL)	Investigates and mitigates. Runs commands, checks dashboards, implements fixes.	On-call engineer for the affected service
Communications Lead (CL)	Manages all external communication — status page updates, Slack announcements, customer-facing messages. Shields IC from interruptions.	Engineering manager, TPM, or designated comms person
Scribe	Documents everything in real time — timeline, actions taken, hypotheses tested. Critical for postmortem.	Junior engineer, intern, or automated via incident tooling

In a small team, one person may wear multiple hats — but never combine IC and CL. The IC needs uninterrupted focus on resolution; the CL absorbs all external noise. If your team is 4+ engineers on-call, rotate the IC role so nobody burns out.

When an incident is declared, the first person on the scene automatically becomes Interim IC until a designated IC joins. They announce in the incident channel:

/incident declare
SEV: [1/2/3]
Title: [Brief description]
IC: @username (interim)
Channel: #incident-2026-0625-001

4. The Incident Response Lifecycle

Every incident follows five phases. Your runbook should have a clear procedure for each.

Phase 1: Detection

Incidents are detected through three channels:

Automated monitoring — Alerts from Prometheus Alertmanager, Datadog, Grafana, or New Relic that fire when SLO burn rates exceed thresholds. If you do not have SLO-based alerting yet, read our Error Budgets Guide first.

User reports — Customer support tickets, social media complaints, or internal bug reports. Route these to the on-call channel automatically via webhook.

Engineer observation — A team member notices something wrong during a deployment or code review.

Regardless of how it is detected, the first step is always the same: verify the signal is real. Check the affected dashboard, run a quick smoke test, and confirm you are not chasing a monitoring false positive.

Phase 2: Declaration

Once verified, the on-call engineer declares the incident. This triggers:

Create an incident channel — #incident-YYYY-MMDD-NNN in Slack or Teams
Page the on-call rotation via PagerDuty/Opsgenie
Post initial status to the status page (or internal status channel if no public page)
Assign roles — IC, OL, CL, Scribe

A declaration message looks like:

🚨 INCIDENT DECLARED — SEV2
Title: Checkout API returning 503 errors in us-east-1
IC: @alice (interim, @bob is primary IC joining in 2 min)
OL: @charlie
CL: @diana
Channel: #incident-2026-0625-003
Dashboard: https://grafana.example.com/d/checkout
Runbook: https://wiki.example.com/runbooks/checkout-api

Phase 3: Diagnosis & Mitigation

This is the core of incident response. The OL investigates while the IC coordinates. The process follows a structured loop:

Triage — Isolate the blast radius. Which users? Which region? Which component?
Hypothesize — Propose a likely cause. "Recent deploy changed the DB connection pool size."
Test — Validate the hypothesis. Check deploy logs. Roll back if the hypothesis is strong enough.
Mitigate — Stop the bleeding. Rollback, scale up, failover, feature flag off. Mitigation comes before root cause. A customer does not care why the site is down — they want it back up.
Verify — Confirm the fix worked. Watch dashboards for 5-10 minutes.

Golden rule: If you have not found the cause in 15 minutes, escalate. Call in more engineers. Wake up the service owner. Do not hero-solo an incident.

Phase 4: Resolution

The IC declares the incident resolved only when:

Service is restored and verified for at least 10 minutes
All alerts have returned to normal
Customers are no longer impacted
A rollback or permanent fix is in place (not a fragile workaround)

The resolution message:

✅ INCIDENT RESOLVED — SEV2
Duration: 47 minutes (14:03 – 14:50 UTC)
Root cause: Connection pool exhaustion after config deploy v2.4.1
Mitigation: Rolled back to v2.4.0, connection pool restored
Impact: ~12% of checkout requests failed (est. 1,200 affected users)
Postmortem: Scheduled for 2026-06-26 10:00 UTC
Action items: #INC-42, #INC-43

Phase 5: Postmortem (Blameless)

Within 24-48 hours of resolution, hold a blameless postmortem. The goal is not to assign blame — it is to prevent recurrence. We cover the full postmortem template in Section 8.

5. The Runbook Template

Here is the actual runbook template. Copy this into your wiki, Notion, Confluence, or runbook automation tool. Fill in the [PLACEHOLDERS] for each service you own.

# Runbook: [SERVICE NAME]

## Service Overview
- **Owner Team:** [Team Name]
- **On-Call Rotation:** [PagerDuty/Opsgenie escalation policy link]
- **Primary Dashboard:** [Grafana/Datadog link]
- **Logs:** [Kibana/Loki/Splunk link]
- **Source Code:** [GitHub/GitLab link]
- **CI/CD Pipeline:** [GitHub Actions/GitLab CI link]
- **Runbook Last Updated:** [YYYY-MM-DD]

## Dependencies
- **Upstream:** [List services this depends on]
- **Downstream:** [List services that depend on this]
- **External:** [Third-party APIs, databases, CDNs]

## Alert Triggers
| Alert Name | Severity | Threshold | Dashboard Link |
|-----------|----------|-----------|---------------|
| High Error Rate | SEV1 | >5% 5xx for 5 min | [link] |
| High Latency p99 | SEV2 | >2s for 10 min | [link] |
| Pod CrashLoopBackOff | SEV1 | Any pod restarting >3 times | [link] |
| Certificate Expiry | SEV3 | <30 days until expiry | [link] |

## Common Incidents

### 1. High 5xx Error Rate
**Symptoms:** Dashboard shows error rate spike, users report failures
**Likely Causes:**
- Recent deployment introduced bug → Check deploy history
- Upstream dependency failure → Check dependency dashboards
- Database connection pool exhausted → Check DB metrics
- Rate limiting triggered → Check API gateway metrics

**Immediate Actions:**
1. Check recent deployments:

kubectl rollout history deployment/[service-name] -n production

2. Rollback if deployment was within last 30 minutes:

kubectl rollout undo deployment/[service-name] -n production

3. Check upstream dependencies:

curl -s https://[dependency]/health

4. Scale up replicas if traffic spike:

kubectl scale deployment/[service-name] --replicas=10 -n production

5. If none of the above help → escalate to [TEAM NAME], page [PERSON]

### 2. Pods in CrashLoopBackOff
**Symptoms:** `kubectl get pods` shows restarts, deployment not progressing
**Likely Causes:**
- Misconfigured environment variables or secrets
- Missing PersistentVolume or storage issue
- OOMKilled (memory limit too low)
- Readiness/Liveness probe misconfigured

**Immediate Actions:**
1. Check pod logs:

kubectl logs [pod-name] -n production --tail=100

2. Check previous container logs (if crash + restart):

kubectl logs [pod-name] -n production --previous

3. Describe the pod for events:

kubectl describe pod [pod-name] -n production

4. Check resource usage:

kubectl top pod [pod-name] -n production

5. If OOMKilled → increase memory limits and restart

### 3. Certificate Expiry
**Preventative:** Run this check weekly via cron:

bash
echo | openssl s_client -servername [domain] -connect [domain]:443 2>/dev/null | \
openssl x509 -noout -dates


## Escalation Path
| Level | Who | When | Contact |
|-------|-----|------|---------|
| L1 | On-call engineer | Immediate | PagerDuty rotation |
| L2 | Service owner / Tech Lead | If unresolved after 15 min | Slack @team-leads |
| L3 | Engineering Manager | If unresolved after 30 min | Phone call |
| L4 | Director / VP Engineering | If SEV0 after 45 min | Phone call |
| L5 | CTO | SEV0 lasting >1 hour | Phone call |

## Post-Incident
- [ ] Create postmortem doc within 24 hours
- [ ] Create action items in issue tracker
- [ ] Update this runbook if new causes or fixes were discovered
- [ ] Verify monitoring covers the failure mode detected

yaml

This template is your starting point. Every service in your organization should have one. Keep it updated — an outdated runbook is worse than no runbook because it wastes time with stale information.

6. Automating the Runbook

A static runbook in a wiki is step one. The real SRE progression is toward automated runbooks — where the on-call engineer receives a pre-filled incident channel with relevant dashboards and diagnostic commands already executed.

PagerDuty + Rundeck Automation

Integrate PagerDuty with Rundeck (or Ansible Automation Platform) to trigger diagnostic jobs automatically when an alert fires:

# Rundeck job definition triggered by PagerDuty webhook
- name: checkout-api-auto-diagnose
  node: kubernetes-prod
  steps:
    - exec: kubectl get pods -n production -l app=checkout
    - exec: kubectl top pods -n production -l app=checkout
    - exec: kubectl logs -n production -l app=checkout --tail=50
    - exec: curl -s https://checkout-api/health

The output is posted to the incident Slack channel before the on-call engineer even opens their laptop.

Slack Slash Commands

Build Slack slash commands for common incident actions:

/incident declare checkout-api "503 errors in us-east-1" --sev=SEV2

→ Creates #incident-2026-0625-004, posts dashboard links, pages on-call, assigns IC.

/incident diagnose checkout-api

→ Runs kubectl describe, checks recent deployments, posts logs.

/incident resolve

→ Prompts for root cause, duration, impact summary, posts resolution template.

GitOps for Runbooks

Store runbooks as Markdown in the same Git repository as the service code. This enforces:

Version control — Every runbook change is reviewed via PR
Co-location — Developers update the runbook when they change the service
CI/CD integration — Runbook validity checks in CI (e.g., lint markdown, verify links)

my-service/
├── src/
├── Dockerfile
├── k8s/
└── RUNBOOK.md    # ← Living next to the code

7. Common Pitfalls (and How to Avoid Them)

Even teams with a runbook make these mistakes. Learn from them.

Pitfall 1: The Runbook Is Outdated

Symptom: On-call follows a runbook that references a decommissioned dashboard, a renamed Slack channel, or a service that was migrated six months ago.

Fix: Treat the runbook as code. Require a runbook update as part of every significant deployment or service change. Use a CI check that verifies all links in the runbook return HTTP 200. Set a calendar reminder to audit all runbooks quarterly.

# CI check: verify all URLs in runbook
grep -oP 'https?://[^\s)\]]+' RUNBOOK.md | sort -u | \
  while read url; do
    status=$(curl -sI -o /dev/null -w "%{http_code}" "$url")
    if [ "$status" != "200" ]; then
      echo "BROKEN: $url → $status"
      exit 1
    fi
  done

Pitfall 2: Too Many Alerts, Wrong Severity

Symptom: The on-call phone buzzes 40 times per night. Engineers develop alert fatigue. A real SEV0 gets lost in the noise.

Fix: Every alert must be actionable and correctly prioritized. If an alert fires and the correct response is "acknowledge and ignore," delete the alert. Use error budgets as the gating mechanism — only page when the error budget is burning too fast.

Pitfall 3: Hero Culture

Symptom: One senior engineer tries to solve everything alone. They do not escalate, do not communicate, and 90 minutes later, the SEV2 is now a SEV0.

Fix: Escalation is not weakness — it is process. The runbook's escalation path exists for a reason. The IC's job is to recognize when to pull in more people, not to solo the fix. Institute a hard rule: if the incident is not mitigated within the SLA window, escalation is mandatory, not optional.

Pitfall 4: No Communication During Incidents

Symptom: Stakeholders flood the IC with DMs. "Is it fixed yet?" "When will it be back?" "The CEO is asking." The IC cannot focus on actually fixing the problem.

Fix: The Communications Lead exists for exactly this reason. Their only job is to post status updates at regular intervals (every 15 minutes for SEV1, every 30 for SEV2) so nobody has to ask. Template:

📢 INCIDENT UPDATE — SEV2 — 14:20 UTC
Status: Still investigating. Checkout API returning 503s.
Mitigation attempted: Rollback to v2.4.0 — no improvement.
Current hypothesis: Upstream payment gateway timeout.
Next update: 14:35 UTC

Pitfall 5: Skipping the Postmortem

Symptom: Incident resolved. Everyone is tired. "We'll do the postmortem later." Later never comes.

Fix: Schedule the postmortem during the resolution call. Block 1 hour on everyone's calendar within 48 hours — while memory is fresh. A postmortem done a week later is half as valuable as one done while logs and timelines are still accessible. If your incident management tooling does not auto-schedule postmortems, add it as a manual step in your runbook.

8. Blameless Postmortem Template

A postmortem is a written record of what happened, why, and what will change. It is not about assigning fault. Use this template:

# Postmortem: [INCIDENT TITLE]

## Metadata
- **Incident ID:** INC-YYYY-MMDD-NNN
- **Date:** [YYYY-MM-DD]
- **Duration:** [HH:MM – HH:MM UTC] ([N] minutes)
- **Severity:** SEV[1/2/3]
- **Incident Commander:** @[name]
- **Postmortem Author:** @[name]
- **Status:** [Draft / Reviewed / Published]

## Summary
[One paragraph: what happened, impact, how it was fixed]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:03 | Prometheus alert fired: checkout-api 5xx > 5% |
| 14:05 | @alice acknowledged, began investigation |
| 14:08 | Incident declared SEV2 in #incident-2026-0625-003 |
| 14:12 | Rollback to v2.4.0 attempted — no improvement |
| 14:18 | Upstream payment gateway identified as root cause |
| 14:25 | Payment gateway team paged, confirmed outage on their side |
| 14:35 | Retry circuit breaker activated — error rate dropping |
| 14:47 | All metrics green, 10 min verification passed |
| 14:50 | Incident resolved |

## Root Cause
[Detailed technical explanation. What specific change, failure, or condition triggered the incident?]

## Impact
- **Users affected:** ~1,200 (12% of checkout traffic)
- **Revenue impact:** Estimated $3,400 in lost transactions
- **Data loss:** None
- **Security impact:** None

## What Went Well
- Alert fired within 2 minutes of error rate crossing threshold
- Incident Commander declared within 8 minutes of alert
- Communications Lead posted updates every 15 minutes
- Rollback was attempted quickly even though it didn't help

## What Went Poorly
- Payment gateway was not listed in service dependencies — added 13 min to diagnosis
- No circuit breaker was pre-configured for upstream failures
- Secondary on-call (backup IC) was unreachable for 10 min

## Action Items
| # | Action | Owner | Priority | Due |
|---|--------|-------|----------|-----|
| INC-42 | Add payment gateway to service dependency list and runbook | @charlie | P0 | 2026-06-27 |
| INC-43 | Implement circuit breaker with retry for all upstream calls | @alice | P1 | 2026-07-01 |
| INC-44 | Verify secondary on-call contact info in PagerDuty | @diana | P0 | 2026-06-26 |
| INC-45 | Add synthetic check for payment gateway health | @bob | P2 | 2026-07-15 |

## Lessons Learned
[1-3 sentences capturing the key takeaway for the broader org]

Store postmortems in a shared, searchable location. Over time, they become your organization's institutional memory — patterns emerge, recurring root causes become obvious, and you can justify infrastructure investments with real incident data.

9. Conclusion

An incident management runbook does not prevent incidents. What it does is far more valuable: it compresses the time between "something is wrong" and "it is fixed." It removes the cognitive load of deciding what to do under pressure and replaces it with a muscle-memory procedure.

Incident management is a core SRE competency tested in interviews — see how it's covered in the Top 50 SRE interview questions.

Start today:

Pick one service. Write its runbook using the template in Section 5.
Define your severity levels. Get stakeholder alignment — nobody should argue about SEV during an incident.
Practice. Run a fire drill. Fake an incident and walk through the runbook. Find the gaps before a real outage does.
Automate one step. Even something small — an auto-created Slack channel or a diagnostic script — saves minutes during your next SEV2.

The best-run SRE teams do not have fewer incidents. They recover faster, communicate better, and learn more from each one. A runbook is how they do it.

Kubernetes OOMKilled (Exit Code 137): How to Debug and Fix It

devtocash — Wed, 15 Jul 2026 01:10:58 +0000

What OOMKilled actually means

OOMKilled with exit code 137 means the Linux kernel killed your container because it tried to use more memory than it was allowed. The 137 is 128 + 9 — the process received signal 9 (SIGKILL). It gets no warning, no chance to flush, no chance to log. One moment it's serving traffic; the next it's gone, and the pod restarts into a CrashLoopBackOff if the same thing keeps happening.

There are two completely different situations that both surface as OOMKilled, and the entire fix depends on telling them apart:

Container-level OOM (cgroup): the container exceeded its own resources.limits.memory. Only that container dies. This is 90% of what you'll see.
Node-level OOM: the whole node ran out of physical memory, and the kernel's OOM killer picked a victim — sometimes not even the pod that caused the pressure.

This playbook is the exact sequence I run to confirm which one you have and fix it for good, usually in a few minutes.

Step 1: Confirm it's actually OOMKilled

Don't guess from a restarting pod — read the terminated container's state:

kubectl describe pod payments-api-7d9f4c8b6-xk2mn

Look at the Last State block:

    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Wed, 15 Jul 2026 09:14:02
      Finished:     Wed, 15 Jul 2026 09:14:48

Reason: OOMKilled plus Exit Code: 137 is the signature. Note that a bare 137 without Reason: OOMKilled is different — that's usually a SIGKILL from a failing liveness probe, which is a probe problem, not a memory one. If you see that, the fix is in the liveness and readiness probes guide, not here.

The app logs are almost always useless for OOM because SIGKILL can't be caught — the process never logs its own death. So don't waste time in kubectl logs. The evidence lives in the container state and, for node-level kills, the kernel log.

Step 2: See the limit that was breached

You can't reason about a memory kill without knowing the ceiling it hit:

kubectl get pod payments-api-7d9f4c8b6-xk2mn \
  -o jsonpath='{.spec.containers[0].resources}'

{"limits":{"memory":"256Mi"},"requests":{"memory":"128Mi"}}

Now watch what the container actually consumes over time. You need metrics-server installed for this:

kubectl top pod payments-api-7d9f4c8b6-xk2mn --containers

POD                          NAME          CPU    MEMORY
payments-api-7d9f4c8b6-xk2mn payments-api  120m   248Mi

Sitting at 248Mi against a 256Mi limit is the whole story: this container lives one request away from death. The next question is why — and that's the fork that decides everything.

Step 3: The decisive fork — too-low limit or a leak?

Watch memory over a few minutes, not a single snapshot:

watch -n 5 'kubectl top pod payments-api-7d9f4c8b6-xk2mn --containers'

Flat plateau just under the limit → the limit is genuinely too low. The app has a stable working set that simply doesn't fit. Fix: raise the limit (Step 4).
Steady climb that never comes down, even under constant load → a memory leak. The app grows until it hits any ceiling you set. Fix: a bigger limit only buys hours; you have to find the leak (Step 5).

Getting this wrong wastes a day. If you raise the limit on a leaking app, it OOMs again the next night at 2 a.m. — same crash, bigger blast radius, because now it took more of the node down with it before dying.

Step 4: Right-size requests and limits

If the working set is legitimately larger than the limit, raise it — but set requests and limits deliberately, because they mean different things:

resources:
  requests:
    memory: "512Mi"   # what the scheduler reserves; guarantees placement
  limits:
    memory: "512Mi"   # the hard ceiling; exceed it and you're OOMKilled

Two rules that save real incidents:

For memory, set requests == limits. Memory isn't compressible like CPU — you can't throttle it. When requests equals limits, the pod gets the Guaranteed QoS class and is the last thing the kubelet evicts under node pressure. If requests is much lower than limits (the Burstable class), the scheduler may overcommit the node, and your pod becomes a prime eviction target the moment the node gets tight.
Leave headroom for the runtime, not just the heap. A JVM or Node process needs the container limit to hold the heap plus metaspace, thread stacks, and off-heap buffers. Set the runtime heap to roughly 75% of the container limit:

# Node.js: heap capped below the container limit
node --max-old-space-size=384 server.js      # ~384Mi heap under a 512Mi limit

# JVM: let it read the cgroup limit and cap the heap as a percentage
java -XX:MaxRAMPercentage=75.0 -jar app.jar

Setting -Xmx equal to the container limit is a classic self-inflicted OOM: the heap fills to the limit, then the first thread stack allocation pushes the container past its ceiling and the kernel kills it — even though the JVM heap wasn't technically full.

Right-sizing is also a cost lever, not just a stability one. Over-provisioned memory requests reserve capacity you pay for and never use, and that reservation is what pushes you onto extra nodes. The same discipline runs through Kubernetes cost optimization and FinOps for Kubernetes — right-sized limits keep pods alive and keep the bill down. If you'd rather not tune by hand, the VPA in recommendation mode will watch actual usage and suggest values; see how it fits with HPA and KEDA in the pod autoscaling guide.

Step 5: Hunt the leak

If memory climbs and never recovers, raising the limit is a stopgap. Confirm the trend first, then attack the source.

Confirm with the raw cgroup metric Prometheus exposes. container_memory_working_set_bytes is the number the kernel actually compares against the limit (it excludes reclaimable page cache), so it's the honest signal:

container_memory_working_set_bytes{pod=~"payments-api-.*", container="payments-api"}

A sawtooth that resets on each restart, with the peak creeping up run over run, is a leak's fingerprint. Common culprits by runtime:

Node.js: unbounded caches, listeners added in a loop, closures holding large buffers. Capture a heap snapshot with node --inspect and diff two snapshots taken minutes apart in Chrome DevTools — the objects that grew are your leak.
JVM: rising heap after full GC. Trigger a heap dump on OOM so the next kill leaves evidence:

java -XX:+HeapDumpOnOutOfMemoryError \
     -XX:HeapDumpPath=/dumps/heap.hprof -jar app.jar

Then open the .hprof in Eclipse MAT and run the leak-suspects report.

Go: goroutine leaks (each holds a stack) or slices that keep a reference to a huge backing array. Hit /debug/pprof/heap with go tool pprof and look at inuse_space.

The point is the same everywhere: a leak is a code bug. Kubernetes can only contain it, not cure it, and every limit you set is just a delay timer.

Step 6: When the whole node runs out of memory

If describe shows OOMKilled but the container was nowhere near its own limit, you have node-level OOM. The node ran out of physical RAM — often because too many Burstable pods were overcommitted — and the kernel's OOM killer chose a victim by oom_score, which can be a bystander pod. Check the node:

kubectl describe node ip-10-0-1-42 | grep -A6 "Allocated resources"

If total memory Limits far exceed the node's allocatable RAM, you're overcommitted. You may also see the kubelet evicting pods before the kernel even acts — an Evicted status with The node was low on resource: memory. That's the kubelet trying to reclaim memory gracefully, and Guaranteed QoS pods survive it while BestEffort (no requests/limits at all) pods die first.

Fixes:

Set memory requests on every pod so the scheduler stops overcommitting the node.
Give critical workloads Guaranteed QoS (requests == limits) so they're evicted last.
Add capacity or enable the cluster autoscaler so pressure triggers a scale-out instead of a massacre.

For the deepest confirmation, the kernel logs the kill on the node itself: dmesg -T | grep -i oom shows the Out of memory: Killed process ... line with the exact PID and RSS the kernel objected to.

A repeatable checklist

When a pod is OOMKilled, run this in order:

kubectl describe pod <pod> → confirm Reason: OOMKilled + Exit Code: 137 in Last State.
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources}' → read the limit that was breached.
watch kubectl top pod <pod> --containers → flat plateau = raise the limit; steady climb = leak.
Too-low limit → set requests == limits, leave runtime headroom (MaxRAMPercentage=75, --max-old-space-size).
Leak → confirm with container_memory_working_set_bytes, then heap-dump the runtime and fix the code.
Container under its limit but still killed → node-level OOM; check overcommit and QoS, add requests, scale out.

Don't wait for the kill — alert on the approach

OOMKilled is one of the few pod failures you can see coming. When a container's working set is riding at 90% of its limit, it's a scheduled outage, not a surprise. Wire that into monitoring so it pages before the kernel acts:

container_memory_working_set_bytes{container!=""}
  / on(pod, container) kube_pod_container_resource_limits{resource="memory"}
  > 0.9

Alert on that ratio in the Prometheus + Grafana monitoring setup and you convert a 2 a.m. crash loop into a business-hours ticket. Bake the six-step diagnosis into your incident runbook so the next on-call engineer fixes it in minutes. Mis-set memory limits and hard OOMs are among the quiet Kubernetes mistakes that cost companies millions — cheap to prevent, expensive to sleep through.

OpenTelemetry Tutorial 2026: Complete Setup Guide for SRE & DevOps

devtocash — Tue, 14 Jul 2026 13:07:24 +0000

Introduction

If you operate microservices in production, you already know the pain. A user reports a slow checkout. You open three different dashboards — Grafana for metrics, Jaeger for traces, and grep for logs. By the time you correlate the request ID across all three, the incident has been open for 45 minutes.

OpenTelemetry (OTel) solves this by unifying all three signals under one standard. It is now the CNCF's second-most active project after Kubernetes, and every major observability vendor — Datadog, Honeycomb, Grafana Labs, New Relic — has adopted its protocol. In 2026, if you are not instrumenting with OpenTelemetry, you are building technical debt every time you ship code.

This tutorial walks you through a complete OpenTelemetry setup: instrumentation with the OTel SDK, collector configuration, and exporting traces and metrics to Jaeger and Prometheus. Everything is hands-on with real YAML and code snippets you can run today.

By the end, you will have:

A Python service auto-instrumented with traces and metrics
An OpenTelemetry Collector processing and exporting telemetry
Traces visible in Jaeger and metrics scraped by Prometheus
A working mental model of OTel's pipeline architecture

What Is OpenTelemetry, Actually?

OpenTelemetry is not a backend. It is not a database, a dashboard, or an alerting engine. It is a telemetry pipeline standard — a specification, a set of SDKs, and a collector binary that together generate, process, and export traces, metrics, and logs.

The project emerged from the 2019 merger of OpenTracing and OpenCensus. Both were CNCF observability projects with overlapping goals. Rather than compete, they merged into a single standard. Today, OTel is at version 1.34+ and is considered stable for traces and metrics.

Three things make OpenTelemetry different from what came before:

Vendor-neutral instrumentation. You instrument once with the OTel SDK. Changing backends — from Jaeger to Honeycomb, or from Prometheus to Datadog — means changing an exporter config, not rewriting code.
The Collector. A standalone binary that receives, processes, and exports telemetry. You can run it as a sidecar, a daemonset, or a central gateway. It handles batching, filtering, sampling, and routing — all config-driven.
Context propagation. The traceparent header (W3C Trace Context standard) passes trace context across HTTP, gRPC, and message queues. Every hop in your distributed system links back to a single root span without custom headers.

The telemetry pipeline looks like this:

Application Code --> OTel SDK --> OTel Collector --> Backend (Jaeger/Prometheus/...)
     (API calls)       (auto/manual)    (process/route)         (store/query)

The SDK generates spans and metrics inside your application process. The Collector — a separate binary — receives them via OTLP (OpenTelemetry Protocol) over gRPC or HTTP, then applies processors and exports to one or more backends.

This separation matters. Your application never talks directly to Jaeger or Prometheus. It only talks to the Collector. The Collector absorbs backend changes without touching application code.

OpenTelemetry Architecture: The Pipeline Model

Every observability signal in OTel follows the same pipeline: Instrumentation → Processing → Export.

The Three Components

1. Instrumentation Libraries (SDK)

The SDK lives inside your application process. It creates spans, records metrics, and captures log events. OTel provides SDKs for Python, Go, Java, JavaScript, .NET, Rust, and more. You can use auto-instrumentation (zero code changes — the agent injects hooks at runtime) or manual instrumentation (explicit start_span() and end_span() calls in your code).

Auto-instrumentation covers most common libraries by default: HTTP frameworks (Flask, Express, Spring), database drivers (psycopg2, pgx, JDBC), and gRPC clients. For custom business logic, you add manual spans.

2. The OpenTelemetry Collector

The Collector is the backbone of any production OTel deployment. It is a single Go binary (otelcol-contrib) that runs three types of components in a pipeline:

Receivers: Accept telemetry data (OTLP gRPC, OTLP HTTP, Jaeger, Zipkin, Prometheus scrape)
Processors: Transform data in-flight (batch, filter, tail sampling, attributes mutation, redaction)
Exporters: Send data to backends (Jaeger, Prometheus, Datadog, Honeycomb, Kafka, stdout)

The Collector decouples your application from backends. If you switch from Jaeger to Tempo, or add a second exporter for Honeycomb, you change one YAML file — not every microservice.

3. Exporters and Backends

Exporters are protocol-specific components that push data to observability backends. Common exporters include:

Exporter	Protocol	Typical Backend
`otlp`	gRPC/HTTP	Any OTLP-compatible backend (Jaeger, Tempo, Grafana Agent)
`prometheus`	HTTP scrape	Prometheus server
`jaeger`	Thrift/gRPC	Jaeger backend
`logging`	stdout	Debugging during development
`kafka`	Kafka	Long-term buffering, multi-datacenter pipelines

The OTLP Protocol

All communication between the SDK and the Collector uses OTLP (OpenTelemetry Protocol). OTLP is a Protobuf-based protocol that runs over gRPC (port 4317) or HTTP/1.1 (port 4318). In 2026, OTLP over HTTP has matured enough that many teams prefer it over gRPC for simpler firewall traversal and load balancer compatibility.

A typical OTLP trace payload is a binary-encoded Protobuf message containing resource attributes (service name, host, namespace), span data (trace ID, span ID, parent span ID, start/end timestamps, attributes, events), and instrumentation scope.

Setting Up the OpenTelemetry Collector

Let's start with the Collector — it is the first piece you deploy because your applications need somewhere to send telemetry.

Step 1: Install the Collector

The recommended distribution is otelcol-contrib, which includes receivers and exporters for every major observability tool:

# Linux (AMD64)
wget https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.110.0/otelcol-contrib_0.110.0_linux_amd64.tar.gz
tar -xzf otelcol-contrib_0.110.0_linux_amd64.tar.gz
sudo mv otelcol-contrib /usr/local/bin/

# Verify
otelcol-contrib --version

For Docker-based development:

docker run -d --name otel-collector \
  -p 4317:4317 -p 4318:4318 -p 8888:8888 \
  -v $(pwd)/otel-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:0.110.0

Step 2: Write the Collector Configuration

Create otel-config.yaml. This is the heart of your observability pipeline:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 512
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [jaeger, logging]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus, logging]

This configuration does several things:

Receivers listen on ports 4317 (gRPC) and 4318 (HTTP) for OTLP data from applications
Processors batch spans for efficiency, limit memory usage to 512 MiB, and add an environment=production attribute to every span
Exporters forward traces to Jaeger, expose metrics on port 8889 for Prometheus scraping, and log debug output to stdout
Pipelines wire everything together — traces and metrics take different paths through the same Collector

The batch processor is critical for production. Without it, the Collector sends one span at a time to Jaeger, creating massive network overhead. Batching amortizes the cost across 512 spans.

Step 3: Run the Collector

otelcol-contrib --config=otel-config.yaml

You should see log output confirming that all receivers, processors, and exporters are active. The Collector is now ready to receive telemetry from your applications.

Instrumenting Your First Application

Now that the Collector is running, let's instrument a Python web service. We will use Flask for the HTTP layer and the OpenTelemetry Python SDK for auto-instrumentation, then add manual spans for custom business logic.

Step 1: Install Dependencies

pip install flask opentelemetry-api opentelemetry-sdk \
  opentelemetry-instrumentation-flask \
  opentelemetry-instrumentation-requests \
  opentelemetry-exporter-otlp-proto-grpc

The key packages:

opentelemetry-api and opentelemetry-sdk — the core OTel SDK
opentelemetry-instrumentation-flask — auto-instrumentation for Flask (creates spans for each HTTP request automatically)
opentelemetry-instrumentation-requests — auto-instrumentation for the requests library (spans for outbound HTTP calls)
opentelemetry-exporter-otlp-proto-grpc — the OTLP exporter that sends data to our Collector

Step 2: Write the Application

# app.py
from flask import Flask, request, jsonify
import requests
import time

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.resources import Resource, SERVICE_NAME
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

# --- OTel Setup ---

resource = Resource(attributes={
    SERVICE_NAME: "checkout-service",
    "deployment.environment": "staging"
})

provider = TracerProvider(resource=resource)

otlp_exporter = OTLPSpanExporter(
    endpoint="http://localhost:4317",
    insecure=True
)

provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
trace.set_tracer_provider(provider)

# --- Application ---

app = Flask(__name__)

# Auto-instrument Flask and outgoing HTTP requests
FlaskInstrumentor().instrument_app(app)
RequestsInstrumentor().instrument()

# Get a tracer for manual instrumentation
tracer = trace.get_tracer(__name__)


@app.route("/checkout", methods=["POST"])
def checkout():
    """Process a checkout — spans created automatically by FlaskInstrumentor."""

    data = request.get_json()

    # Manual span for the payment processing step
    with tracer.start_as_current_span("process_payment") as span:
        span.set_attribute("payment.amount", data.get("amount", 0))
        span.set_attribute("payment.method", data.get("method", "unknown"))

        # Simulate payment work
        time.sleep(0.15)

        payment_result = process_payment(data.get("amount", 0))

        span.set_attribute("payment.status", payment_result["status"])
        span.set_status(trace.Status(trace.StatusCode.OK))

    # Manual span for inventory update
    with tracer.start_as_current_span("update_inventory") as span:
        span.set_attribute("inventory.items", len(data.get("items", [])))

        time.sleep(0.08)

        # Outbound HTTP call — automatically traced by RequestsInstrumentor
        resp = requests.post(
            "http://inventory-service:5001/update",
            json={"items": data.get("items", [])}
        )
        span.set_attribute("inventory.response_code", resp.status_code)

    return jsonify({"status": "ok", "order_id": "ord-2026-abc"})


def process_payment(amount: float) -> dict:
    """Simulated payment gateway call."""
    return {"status": "authorized", "transaction_id": "txn-42", "amount": amount}


if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

What This Code Does

Every /checkout request now generates a trace with multiple spans:

Root span — created automatically by FlaskInstrumentor for the HTTP request
process_payment — manual span wrapping payment logic, with custom attributes (amount, method, status)
update_inventory — manual span wrapping inventory logic
HTTP POST to inventory-service — nested span created by RequestsInstrumentor, linked to the parent update_inventory span

Context propagation is automatic. When the /checkout handler calls requests.post(...), the OTel SDK injects the traceparent header into the outbound HTTP request. If the inventory service is also instrumented with OTel, it extracts that header and continues the same trace — creating a single distributed trace across both services.

Step 3: Run and Verify

# Terminal 1: Start the Collector (if not already running)
otelcol-contrib --config=otel-config.yaml

# Terminal 2: Start the Flask app
python app.py

# Terminal 3: Generate a trace
curl -X POST http://localhost:5000/checkout \
  -H "Content-Type: application/json" \
  -d '{"amount": 49.99, "method": "card", "items": [{"id": 1}, {"id": 2}]}'

Check the Collector's debug log output. You should see spans being received, processed, and exported. The logging exporter will print span summaries to stdout — useful for debugging before you wire up Jaeger.

Look for lines like:

Span #0
    Trace ID       : 6e8f4c7a1b2d3e4f5a6b7c8d9e0f1a2b
    Parent ID      :
    ID             : 3a4b5c6d7e8f9a0b
    Name           : POST /checkout
    Kind           : Server
    ...

Span #1
    Trace ID       : 6e8f4c7a1b2d3e4f5a6b7c8d9e0f1a2b
    Parent ID      : 3a4b5c6d7e8f9a0b
    ID             : 1b2c3d4e5f6a7b8c
    Name           : process_payment
    ...

The shared Trace ID across both spans confirms that context propagation is working — both spans belong to the same distributed trace.

Exporting Traces to Jaeger

The logging exporter is useful for debugging, but you need a real trace backend. Let's set up Jaeger and configure the Collector to forward traces.

Step 1: Run Jaeger All-in-One

For development, Jaeger's all-in-one image bundles the collector, query UI, and in-memory storage:

docker run -d --name jaeger \
  -e COLLECTOR_OTLP_ENABLED=true \
  -p 16686:16686 \
  -p 4317:4317 \
  jaegertracing/all-in-one:1.62

Port 16686: Jaeger UI (open http://localhost:16686)
Port 4317: OTLP gRPC receiver (Jaeger can accept OTLP directly as of 1.35+)

However, routing through our Collector is the production pattern. Update the Collector config to point at Jaeger:

# otel-config.yaml (exporter section update)
exporters:
  otlp/jaeger:
    endpoint: localhost:4317
    tls:
      insecure: true
  # ... keep the other exporters

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger, logging]

Step 2: Generate Traces and Inspect

Send a few checkout requests:

for i in $(seq 1 5); do
  curl -s -X POST http://localhost:5000/checkout \
    -H "Content-Type: application/json" \
    -d '{"amount": 49.99, "method": "card", "items": [{"id": 1}]}' > /dev/null
done

Open Jaeger UI at http://localhost:16686:

Select checkout-service from the Service dropdown
Click Find Traces
You should see 5 traces, each containing multiple spans

Click any trace to view the waterfall diagram. You will see the parent POST /checkout span and its children — process_payment, update_inventory, and potentially the outbound HTTP call to inventory-service. Expand a span to see attributes like payment.amount, payment.method, and payment.status.

Debugging Tip: Missing Spans

If you see the root span but not the child spans, check:

# Verify the Collector is receiving spans
curl http://localhost:8888/metrics | grep otelcol_receiver_accepted_spans

# Check Collector logs for export errors
otelcol-contrib --config=otel-config.yaml 2>&1 | grep -i error

Common causes:

Batch processor delay: Spans are batched for up to 5 seconds before export. Wait at least 5 seconds after sending a request.
OTLP endpoint mismatch: The SDK sends to localhost:4317 but the Collector listens on a different host. Use 0.0.0.0:4317 in the Collector config for local dev.
TLS mismatch: If the Collector expects TLS but the SDK sends plaintext (or vice versa), the connection fails silently. Match insecure: true settings on both sides.

Exporting Metrics to Prometheus

Traces tell you what happened. Metrics tell you how often and how fast. OTel's metrics pipeline works the same way, but the Prometheus exporter is an HTTP server that Prometheus scrapes — it does not push.

Step 1: Configure Prometheus Scrape

Add a scrape target to your prometheus.yml:

scrape_configs:
  - job_name: "otel-collector"
    scrape_interval: 15s
    static_configs:
      - targets: ["localhost:8889"]

The Collector's Prometheus exporter already listens on port 8889 (from our earlier config). No additional setup is needed.

Step 2: Auto-Instrument Metrics

The Flask instrumentation also captures HTTP server metrics automatically:

# Add to app.py after the trace setup
from opentelemetry import metrics
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter

metric_reader = PeriodicExportingMetricReader(
    OTLPMetricExporter(endpoint="http://localhost:4317", insecure=True),
    export_interval_millis=15000
)

meter_provider = MeterProvider(
    resource=resource,
    metric_readers=[metric_reader]
)
metrics.set_meter_provider(meter_provider)

This exports HTTP request counts, latency histograms, and error rates — all generated automatically by FlaskInstrumentor.

Step 3: Verify Metrics in Prometheus

# Check that Prometheus is scraping the Collector
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="otel-collector")'

# Query a metric
curl "http://localhost:9090/api/v1/query?query=http_server_duration_milliseconds_bucket"

The metrics pipeline is now live: your application generates metrics, the SDK ships them to the Collector, and Prometheus scrapes the Collector's Prometheus exporter endpoint. Grafana can query Prometheus to build dashboards.

Adding a Custom Metric

Beyond auto-instrumentation, add a business-level counter:

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
order_counter = meter.create_counter(
    "checkout.orders",
    description="Number of completed checkouts",
    unit="1"
)

@app.route("/checkout", methods=["POST"])
def checkout():
    # ... existing code ...
    order_counter.add(1, {"method": data.get("method", "unknown")})
    return jsonify({"status": "ok"})

Now you have a checkout_orders_total metric in Prometheus, labeled by payment method. Query it to track business throughput — not just infrastructure health.

Deploying OpenTelemetry on Kubernetes

Running the Collector as a standalone binary works for development. In production, you deploy it to Kubernetes using one of three patterns. Each has tradeoffs in scalability, latency, and operational complexity.

Pattern 1: Sidecar (Per-Pod Collector)

A Collector container runs alongside your application container in the same pod. The application sends telemetry to localhost:4317.

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-service
  template:
    metadata:
      labels:
        app: checkout-service
    spec:
      containers:
        - name: app
          image: checkout-service:latest
          ports:
            - containerPort: 5000
          env:
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://localhost:4317"

        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          args: ["--config=/etc/otelcol/config.yaml"]
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-sidecar-config

Pros: Simple, no network hops, pod-level isolation.
Cons: One Collector per pod wastes resources. 100 pods = 100 Collectors. Not suitable for large clusters unless you run low-resource Collector replicas.

Pattern 2: DaemonSet (Per-Node Collector)

One Collector runs on every node as a DaemonSet. All pods on that node send telemetry to the node-local Collector via the host network or a node port.

# otel-daemonset.yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector
  namespace: observability
spec:
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      hostNetwork: true
      containers:
        - name: otel-collector
          image: otel/opentelemetry-collector-contrib:0.110.0
          args: ["--config=/etc/otelcol/config.yaml"]
          ports:
            - containerPort: 4317
              hostPort: 4317
            - containerPort: 4318
              hostPort: 4318
          volumeMounts:
            - name: otel-config
              mountPath: /etc/otelcol
          resources:
            limits:
              memory: 512Mi
              cpu: 500m
      volumes:
        - name: otel-config
          configMap:
            name: otel-collector-daemonset-config


## Advanced Sampling Strategies

Tail sampling is one of OpenTelemetry's most powerful features — and one of the easiest to misconfigure. Understanding the decision flow will save you from exploding your telemetry bill or dropping critical traces.

### Head-Based Sampling (Probabilistic)

Head sampling happens at span creation time. The SDK decides immediately whether to keep or drop a span — no buffering, no delay. This is the default if you configure nothing else:

yaml

Collector config for head-based probabilistic sampling

processors:
probabilistic_sampler:
sampling_percentage: 10


This means 10% of all spans are kept, 90% are dropped instantly. The Collector never sees the dropped spans at all — those bytes never leave the application process. Use this when:

- **You are cost-sensitive:** Every exported span costs storage and network. At 1,000 requests per second, keeping 100% of spans can saturate your observability budget.
- **You want trace completeness, not sample size:** If you are debugging a specific slow request, dropping spans at the head means you lose context. Probabilistic sampling gives you a representative subset.

### Tail-Based Sampling

Tail sampling makes the decision **after** all spans in a trace complete — 5 to 30 seconds later, when the full trace is assembled in the Collector. The processor evaluates decision policies:

yaml
processors:
tail_sampling:
decision_wait: 30s
policies:
- name: errors-and-slow
type: and
and_sub_policy:
- name: status_code
type: status_code
status_code: {status_codes: [ERROR]}
- name: latency-over-2s
type: latency
latency: {threshold_ms: 2000}
- name: probabilistic
type: probabilistic
probabilistic: {sampling_percentage: 25}


This configuration keeps 100% of traces that contain an error status code, samples 25% of all other traces, and drops the rest. Additionally, it keeps any span whose total duration exceeds 2 seconds. Tail sampling lets you capture the full picture of every slow request without storing every fast one.

### When to Use Each

| Strategy | When to Use |
|----------|-------------|
| **Head (probabilistic)** | You have a strict sampling budget. You cannot store more than X spans per second. Use for high-throughput, cost-sensitive, always-on observability. |
| **Tail (policy-based)** | You need every trace from a specific slow request. Use when debugging errors, analyzing latency, or auditing compliance. |

## Common Pitfalls and Troubleshooting

### 1. The Collector Is Dropping Spans Silently

This is the most common OTel production issue. The Collector receives spans from the SDK, processes them through the batch processor, then drops them silently at the exporter. Root cause: the gRPC connection between the SDK and Collector times out.

**Fix:** Increase the `send_batch_size` and reduce `timeout` in the batch processor:

yaml
processors:
batch:
timeout: 10s
send_batch_size: 2048


Why this works: the default batch size is 512 spans. If the Collector receives 2,000 spans in 1 second, 1,488 of them exceed the default gRPC message size (4 MiB). The SDK sends 512 spans at a time. The Collector times out waiting for the remaining 488 — and drops them. Increase the batch size to 2,048 so the SDK sends larger chunks, fewer network round-trips.

### 2. The `traceparent` Header Is Missing

Your service A calls service B over HTTP. Service B is also instrumented with OTel. But the trace breaks — service B does not receive the `traceparent` header, so spans link back to service A but not to the same trace.

**Diagnosis:** Check for `traceparent` in the outbound HTTP headers:

bash
curl -H "traceparent: 00-..." http://service-b:5000/endpoint


If the response header is missing, service B is not propagating context. The SDK does not inject `traceparent` into the outbound request. Fix: verify the instrumentation library is loaded and the HTTP client is configured.

python

Explicitly configure the OTLP exporter with headers

from opentelemetry.propagators.textmap import TextMapPropagator
from opentelemetry import trace

Set the global propagator BEFORE creating the TracerProvider

trace.set_span_processor(
CompositePropagator(
propagators=[
W3CTraceContextPropagator(),
BaggagePropagator()
]
)
)

Then create the TracerProvider

provider = TracerProvider()
trace.set_tracer_provider(provider)


The `W3CTraceContextPropagator` injects the W3C `traceparent` header into every outbound HTTP request. Without it, distributed context propagation fails silently.

### 3. High Cardinality Attributes Crash the Backend

Span attributes like `user.id`, `request.id`, and `session.id` are unbounded. If a span carries thousands of unique attributes, the Jaeger backend rejects the entire batch.

**Remediation:** Drop high-cardinality attributes at the SDK level:

python

Create a custom SpanProcessor that truncates attributes

from opentelemetry.sdk.trace.export import SpanExporter, BatchSpanProcessor

class AttributeLimitingProcessor(BatchSpanProcessor):
def on_end(self, span):
# Keep only these attributes — drop everything else
allowed_keys = {"http.method", "http.url", "http.status_code"}
span.attributes = {
k: v for k, v in span.attributes.items()
if k in allowed_keys
}


This SpanProcessor limits attributes to `http.method`, `http.url`, and `http.status_code` — dropping `user.id`, session tokens, and every other high-cardinality field. The backend stays stable.

### 4. Memory Usage Grows Unbounded

The Collector's memory consumption grows linearly with every span. Under sustained load, 512 MiB becomes 1 GiB, then 2 GiB. The OOM killer strikes.

**Fix:** Configure the `memory_limiter` processor aggressively:

yaml
processors:
memory_limiter:
limit_mib: 256
spike_limit_mib: 512
check_interval: 1s


The `limit_mib` sets a hard cap at 256 MiB. The `spike_limit_mib` allows brief spikes to 512 MiB during batch exports. Set both lower than the container memory limit if the Collector also runs a sidecar.

## Security: Redacting Sensitive Data

OpenTelemetry traces can leak secrets. A span attribute like `credit_card_number` or `user.email` travels from your SDK through the Collector to Jaeger — and into your observability vendor's cloud. Every hop stores the attribute permanently.

**Prevention:** Filter sensitive attributes at the Collector level before they leave your network:

yaml
processors:
attributes:
actions:
- key: user.email
action: delete
- key: user.phone
action: delete
- key: credit_card.*
action: delete
- key: password
action: delete




This configuration strips `user.email`, `user.phone`, and any attribute matching the pattern `credit_card.*` or `password` from every span before it reaches the exporter. The sensitive data never leaves your boundary. Combine this with the [`k8sattributes` processor](https://opentelemetry.io/docs/kubernetes/collector/processor/) to redact by label or annotation.

For full defense in depth, review the [OTel Security documentation](https://opentelemetry.io/docs/security/).

## Further Reading

If you have made it this far, you now have a working OpenTelemetry pipeline — instrumentation, a Collector, and at least one observability backend. Here is where to go next:

- **[Kubernetes Security Best Practices 2026](https://devtocash.com/blog/kubernetes-security-best-practices-2026)** — Hardening your cluster before instrumenting your workloads. Security is not optional when observability is production.
- **[Error Budgets: Stop Wasting Your SRE Team's Time](https://devtocash.com/blog/error-budgets-sre-guide)** — Budget for reliability, not just velocity. Your error budget is a policy decision, not a suggestion.
- **OpenTelemetry Tracing: Instrument Your First Application** *(forthcoming)* — Distributed tracing with manual context propagation. A complete guide to instrumenting every service.

## Conclusion

OpenTelemetry is not a tool — it is a standard. Adopting it means instrumenting once with the SDK, processing through the Collector, and exporting to any backend without rewriting code. You have now walked through a complete setup: instrumentation with Python and Flask, Collector configuration in YAML, Jaeger for trace visualization, Prometheus for metrics, Kubernetes for production deployment, and operational patterns from DaemonSet to Gateway.

The most important things to remember:

1. **Instrument once, export anywhere.** The OTel SDK decouples your application from every backend. Changing exporters in the Collector config is not a code change.
2. **The Collector is your control plane.** Receivers, processors, and exporters form a pipeline. Data flows one way — from your code through the SDK to the Collector, then to Jaeger and Prometheus. You control the flow.
3. **Tail sampling saves budget.** Not every span is worth storing. Decide what to keep at the head (probabilistic) or at the tail (policy-based). The Collector makes the decision.

The observability landscape in 2026 is converging on OpenTelemetry. Every major vendor now speaks OTLP. The standard is the protocol — adopt it before it becomes a migration project.

---

📌 **Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on [devtocash.com](https://devtocash.com/blog/open-telemetry-tutorial-setup-guide-2026).**

Kubernetes ImagePullBackOff: How to Debug and Fix It (2026)

devtocash — Tue, 14 Jul 2026 01:17:40 +0000

What ImagePullBackOff actually means

ImagePullBackOff means the kubelet tried to pull your container image, failed, and is now waiting — with an exponential backoff — before trying again. Like CrashLoopBackOff, it is a symptom, not a root cause. The real failure is the pull itself, and the kubelet already recorded exactly why in the pod's events. You never have to guess.

The important distinction: ErrImagePull is the first failed attempt, and ImagePullBackOff is what you see after the kubelet starts backing off (10s, 20s, 40s, capped at 5 minutes). Same underlying problem. Deleting the pod does nothing — a new pod pulls the same broken reference and lands in the same state. This is the exact sequence I run to find the cause in about a minute.

Step 1: Read the actual pull error

Never start from theory. Start from describe, because the Events block quotes the container runtime's error verbatim:

kubectl describe pod payments-api-7d9f4c8b6-xk2mn

Scroll to the bottom:

  Warning  Failed     kubelet  Failed to pull image
           "myregistry/payments-api:1.4.2": failed to resolve reference:
           unexpected status: 401 Unauthorized
  Warning  Failed     kubelet  Error: ErrImagePull
  Normal   BackOff    kubelet  Back-off pulling image
  Warning  Failed     kubelet  Error: ImagePullBackOff

That one line — 401 Unauthorized — routes the entire investigation. The runtime error text maps cleanly to a root cause:

Error text	Root cause	Go to
`not found` / `manifest unknown`	Wrong image name or tag	Step 2
`401 Unauthorized` / `denied`	Missing or wrong registry credentials	Step 3
`429 Too Many Requests` / `toomanyrequests`	Docker Hub anonymous rate limit	Step 4
`no such host` / `i/o timeout` / `connection refused`	Node can't reach the registry	Step 5
`no match for platform`	Architecture mismatch (arm64 vs amd64)	Step 6

Read this first and the rest collapses to a single path.

Step 2: Wrong image name or tag (`manifest unknown`)

The most common cause is the simplest: the image reference doesn't exist. A typo in the repository, a tag that was never pushed, or a CI pipeline that pushed 1.4.2 while your manifest still says 1.4.1.

Confirm exactly what the pod is asking for:

kubectl get pod payments-api-7d9f4c8b6-xk2mn \
  -o jsonpath='{.spec.containers[0].image}'

Then verify that reference actually exists in the registry from your workstation:

docker manifest inspect myregistry/payments-api:1.4.2

If that returns manifest unknown, the tag isn't there. Two things to check:

A CI race. Your deploy ran before the image push finished. This is common when build and deploy are separate jobs — pin the deploy to the image digest (@sha256:...) instead of a moving tag so a deploy can never reference a non-existent image.
The :latest trap. If you use :latest with imagePullPolicy: IfNotPresent, a node that already cached an old latest will silently run stale code instead of failing. Always tag immutably (1.4.2, a git SHA) in production. This is one of the Kubernetes mistakes that quietly cost companies money — a "successful" deploy that shipped nothing.

Step 3: Private registry authentication (`401` / `denied`)

If the image exists but the pull returns 401 Unauthorized or denied, the kubelet has no valid credentials for that registry. Kubernetes pulls images using an imagePullSecrets reference, not your local docker login. The node never sees your laptop's ~/.docker/config.json.

Create the pull secret:

kubectl create secret docker-registry regcred \
  --docker-server=myregistry.example.com \
  --docker-username=deploy-bot \
  --docker-password="$REGISTRY_TOKEN" \
  --namespace=prod

Then attach it. Either reference it on the pod spec:

spec:
  imagePullSecrets:
    - name: regcred
  containers:
    - name: payments-api
      image: myregistry.example.com/payments-api:1.4.2

Or — cleaner for a whole namespace — patch the default ServiceAccount so every pod inherits it:

kubectl patch serviceaccount default -n prod \
  -p '{"imagePullSecrets":[{"name":"regcred"}]}'

Two gotchas that eat hours:

Namespace scope. A pull secret only works in the namespace it was created in. A pod in prod cannot use a secret in default. If you have 12 namespaces, you need the secret in each one that pulls private images.
docker-server must match the image host exactly. If your image is myregistry.example.com/..., the secret's server must be myregistry.example.com — not https://..., not a trailing slash. A mismatch means the kubelet holds a valid credential it never applies.

Because these secrets carry registry write tokens in some setups, scope them tightly and rotate them — the least-privilege reasoning in the Kubernetes RBAC deep dive and the broader Kubernetes security best practices both apply directly to pull credentials.

Step 4: Docker Hub rate limits (`429 Too Many Requests`)

If you pull public images from Docker Hub anonymously, you're capped at 100 pulls per 6 hours per IP. On a busy cluster where many nodes share one NAT egress IP, you hit that ceiling fast, and pulls start failing with toomanyrequests:

Failed to pull image "nginx:1.27": toomanyrequests: You have reached
your pull rate limit.

The fix is to authenticate even for public pulls, which raises the limit substantially. Create a Docker Hub pull secret exactly as in Step 3 (server docker.io) and attach it to the ServiceAccount. Better still, run a pull-through cache (Harbor, or the registry mirror built into most managed clusters) so each image is fetched from Docker Hub once and served internally forever after — that also cuts pull latency on scale-ups and reduces your dependency on an external service during an incident.

Step 5: The node can't reach the registry (`no such host` / timeout)

i/o timeout, no such host, or connection refused means DNS or network — the pull never got far enough to check credentials. The problem is on the node, so debug from the node's perspective, not your laptop's.

Find which node the pod landed on, then test resolution and reachability with a throwaway pod scheduled there:

kubectl get pod payments-api-7d9f4c8b6-xk2mn -o wide   # note the NODE
kubectl run netcheck --rm -it --image=nicolaka/netshoot \
  --restart=Never -- \
  sh -c "nslookup myregistry.example.com && \
         curl -sSv https://myregistry.example.com/v2/ 2>&1 | head"

Common findings:

CoreDNS is down or misconfigured — every pull and every service lookup fails cluster-wide. Check kubectl get pods -n kube-system -l k8s-app=kube-dns.
A private registry behind a VPC endpoint where the node's security group or NAT route was changed. The registry is reachable from your office VPN but not from the node subnet.
A self-hosted registry with an untrusted TLS cert — x509: certificate signed by unknown authority. The node's container runtime needs the CA added to its trust store.

Persistent pull failures across many nodes are exactly the kind of signal worth alerting on before a scale-up stalls a rollout — the Prometheus + Grafana monitoring setup can watch kubelet_image_pull errors and page you before users notice.

Step 6: Architecture mismatch (`no match for platform`)

If you build on an Apple Silicon laptop and deploy to amd64 nodes, an image built only for arm64 fails with no match for platform in manifest. The pull "succeeds" in finding the manifest but has no layer for the node's CPU. Build multi-arch images explicitly:

docker buildx build --platform linux/amd64,linux/arm64 \
  -t myregistry.example.com/payments-api:1.4.2 --push .

This is the pull-time twin of the exec format error you'd see at runtime in a crash loop — same architecture root cause, caught one stage earlier.

A repeatable checklist

When a pod is stuck in ImagePullBackOff, run this in order:

kubectl describe pod → read the Failed to pull image event text.
manifest unknown → wrong name/tag; verify with docker manifest inspect (Step 2).
401 / denied → create and attach an imagePullSecrets, right namespace, exact server (Step 3).
429 / toomanyrequests → authenticate to Docker Hub or run a pull-through cache (Step 4).
no such host / timeout → debug DNS and network from the node (Step 5).
no match for platform → build a multi-arch image (Step 6).

ImagePullBackOff looks intimidating because the pod never even starts, but the container runtime always quotes the exact reason in the pod's events. Read that line first and the fix is almost always mechanical. Pin immutable tags, put pull secrets on the ServiceAccount in every namespace, and mirror public images — and most of these loops never reach production. Pair this with the liveness, readiness, and startup probes guide, and you've covered nearly every reason a Kubernetes pod fails to come up.

Kubernetes CrashLoopBackOff: How to Debug and Fix It (2026)

devtocash — Tue, 14 Jul 2026 01:16:25 +0000

What CrashLoopBackOff actually means

CrashLoopBackOff means your container started, ran, and then exited — and the kubelet has restarted it enough times that it's now waiting, with an exponential backoff, before trying again. Unlike ImagePullBackOff, the image pulled fine and the process actually executed. Something inside the container is dying, and Kubernetes is doing exactly what you told it to: restart a failed container.

The word BackOff is the important part. The kubelet restarts a crashed container immediately, then after 10s, 20s, 40s, doubling up to a 5-minute cap. So a pod that's been crashing for an hour might only restart once every five minutes — the RESTARTS count climbs slowly even though the loop is constant. Deleting the pod does nothing useful: the replacement runs the same broken code or config and lands in the same state. This is the exact sequence I run to find the cause, usually in a couple of minutes.

Step 1: Read the crash logs — including the previous container

Never theorize. The crashed process almost always printed why it died before exiting. The catch: by the time you look, the kubelet may have already started a new container, so plain logs shows the fresh (and often empty) attempt. You want the logs from the instance that actually crashed:

kubectl logs payments-api-7d9f4c8b6-xk2mn --previous

The --previous (or -p) flag is the single most useful thing in this entire playbook. It dumps stdout/stderr from the last terminated container, which is where the stack trace, the panic:, or the Error: connect ECONNREFUSED lives. Nine times out of ten the answer is right there:

Error: Missing required environment variable DATABASE_URL
    at loadConfig (/app/config.js:14:11)
    at Object.<anonymous> (/app/server.js:3:16)

If the log is empty, the process died before it could log — a missing binary, a bad entrypoint, or an instant OOM kill. That's what Step 2 is for.

Step 2: Decode the exit code from `describe`

Every terminated container records an exit code, and the exit code narrows the cause immediately:

kubectl describe pod payments-api-7d9f4c8b6-xk2mn

Look at the Last State block under the container:

    Last State:     Terminated
      Reason:       Error
      Exit Code:    1
      Started:      ...
      Finished:     ...

Map the exit code to a root cause:

Exit code / Reason	What it means	Go to
`1` (Reason: Error)	Application threw and exited — bad config, missing env, startup exception	Step 3
`137` + Reason `OOMKilled`	Kernel killed the container for exceeding its memory limit	Step 4
`137` / `143` (Reason: Error)	Process got SIGKILL/SIGTERM — usually a failing liveness probe	Step 5
`127`	`command not found` — bad `command`/entrypoint or missing binary	Step 6
`126`	Command found but not executable (bad permissions / wrong arch)	Step 6
`0` (still looping)	Process exits cleanly but has nothing to keep it alive	Step 7

Read the exit code first and the rest of the investigation collapses to a single branch.

Step 3: The app crashes on startup (exit code 1)

Exit 1 with a stack trace is the friendliest case — the app told you exactly what's wrong. In practice it's almost always one of three things:

A missing or wrong environment variable. DATABASE_URL, an API key, a feature flag the code assumes is set. Check what's actually injected:

kubectl set env pod/payments-api-7d9f4c8b6-xk2mn --list

A missing ConfigMap or Secret key. If the pod references a key that doesn't exist, the container may not even start; if it starts but the value is empty, it crashes on first use. Verify the source object exists and has the key:

kubectl get secret app-secrets -o jsonpath='{.data}' | tr ',' '\n'

A bad migration or unreachable dependency at boot. Many apps run DB migrations or open a connection pool during startup. If the database isn't reachable yet, they exit non-zero — and Kubernetes crash-loops them until the dependency comes up. The fix is either an initContainer that waits for the dependency, or making the app retry with backoff instead of exiting. Hard-exiting on a transient dependency failure is one of the quiet Kubernetes mistakes that cost companies money — a single slow database restart cascades into every dependent service crash-looping at once.

Step 4: OOMKilled (exit code 137)

If describe shows Reason: OOMKilled and exit 137, the container tried to use more memory than its limit and the kernel killed it. This one is treacherous because the app logs are often empty — the process is killed with SIGKILL and gets no chance to write anything.

Confirm it and see the limit that was breached:

kubectl describe pod payments-api-7d9f4c8b6-xk2mn | grep -A5 "Last State"
kubectl get pod payments-api-7d9f4c8b6-xk2mn \
  -o jsonpath='{.spec.containers[0].resources}'

There are two distinct fixes, and picking the wrong one wastes a day:

The limit is genuinely too low. The app needs more memory than you granted. Raise the resources.limits.memory (and requests to match, so the scheduler places it correctly). If a JVM or Node process, remember the runtime's heap must fit inside the container limit with headroom — set -Xmx or --max-old-space-size to roughly 75% of the container limit, not equal to it.
There's a memory leak. The app grows until it hits any limit you set. Raising the limit just delays the crash. Watch usage over time with kubectl top pod and fix the leak — a bigger box only buys hours.

Right-sizing these limits is also a cost lever, not just a stability one: over-provisioned memory requests reserve capacity you pay for and never use. The same discipline shows up in Kubernetes cost optimization and FinOps for Kubernetes — right-sized limits keep pods alive and keep the bill down.

Step 5: A failing liveness probe is killing a healthy app

This is the most misdiagnosed CrashLoopBackOff of all. Your app is fine — but the liveness probe is failing, so the kubelet kills the container (SIGTERM, then SIGKILL → exit 143/137) on a schedule, and the pod loops forever. The tell is in the events:

Warning  Unhealthy  kubelet  Liveness probe failed: HTTP probe failed
                              with statuscode: 500
Normal   Killing    kubelet  Container failed liveness probe, will be
                              restarted

Common causes:

The probe hits an endpoint that depends on a slow downstream (a /health that checks the database). If the DB blips, the probe fails and Kubernetes kills an otherwise-serving app. Liveness should test the process, not its dependencies — use readiness for dependency checks.
initialDelaySeconds is too short for an app with a slow boot (JVM warmup, large cache load). The probe starts failing before the app has finished starting, so it never gets a chance to become healthy. Use a startupProbe for slow starters so the liveness clock doesn't start until the app is actually up.
Wrong port or path — the probe checks :8080/healthz but the app serves :3000/health.

The full breakdown of which probe does what — and why conflating liveness and readiness causes exactly this loop — is in the liveness, readiness, and startup probes guide. Fixing the probe config, not the app, is the fix here.

Step 6: Bad command or wrong architecture (exit 127 / 126)

Exit 127 means the container's entrypoint or command pointed at something that doesn't exist — a typo'd binary path, a script that isn't in the image, or a shell form that assumes /bin/sh in a distroless image that has none. Exit 126 means the file is there but isn't executable, or is built for the wrong CPU (exec format error — an amd64 binary on an arm64 node, the runtime twin of the platform mismatch you'd catch at pull time).

Check what the container is actually told to run, and confirm the binary exists:

kubectl get pod payments-api-7d9f4c8b6-xk2mn \
  -o jsonpath='{.spec.containers[0].command} {.spec.containers[0].args}'

If it's an architecture mismatch, rebuild multi-arch (docker buildx build --platform linux/amd64,linux/arm64 ...). If it's a missing shell in a distroless base, switch the probe/command to exec form with a real binary rather than sh -c.

Step 7: The container exits 0 and loops anyway

Sometimes the container exits cleanly (code 0) and still crash-loops. That happens when the main process finishes and there's nothing to keep PID 1 alive — a script that runs once and returns, or a web server started in the background while the foreground command exits. Kubernetes treats "container finished" as "restart it" under the default restartPolicy: Always. The fix is to run the long-lived process in the foreground as PID 1. For genuine run-once workloads, use a Job or CronJob instead of a Deployment — those are designed to complete.

A repeatable checklist

When a pod is stuck in CrashLoopBackOff, run this in order:

kubectl logs <pod> --previous → read the crash output from the container that actually died.
kubectl describe pod <pod> → read the exit code and Reason in Last State.
Exit 1 → app-level crash; check env vars, ConfigMaps/Secrets, and startup dependencies (Step 3).
OOMKilled / 137 → raise the memory limit or fix the leak — decide which with kubectl top (Step 4).
Exit 143/137 with a Liveness probe failed event → fix the probe, not the app (Step 5).
Exit 127/126 → bad command path or wrong architecture (Step 6).
Exit 0 looping → run the process in the foreground, or use a Job (Step 7).

CrashLoopBackOff looks alarming because the pod visibly refuses to stay up, but the container runtime records the exit code and the process almost always logs its own cause. Read --previous logs first, decode the exit code second, and the fix is nearly always mechanical. A rising kube_pod_container_status_restarts_total is worth alerting on before a loop takes out a whole service — wire it into the Prometheus + Grafana monitoring setup and capture the diagnosis path in your incident runbook so the next on-call engineer solves it in minutes, not hours. Pair this with the ImagePullBackOff playbook and you've covered nearly every reason a Kubernetes pod fails to run.

SelfMem: One Memory Layer Across Every AI Assistant — Free & Open MCP Tool

devtocash — Mon, 29 Jun 2026 05:20:46 +0000

SelfMem is one of the most useful MCP tools I've tested this year — and it's free, open-source, and self-hostable.

Why SelfMem Matters for DevOps/SRE Teams

If you run multiple AI coding assistants (Claude, Cursor, Windsurf, etc.), you've probably dealt with context fragmentation — each tool has its own memory, none of them talk to each other. SelfMem solves this with an MCP-native memory server that all your AI assistants share.

Key Features

Hybrid Search: PostgreSQL full-text search + pgvector semantic search — you get keyword precision AND semantic understanding in one query
9 MCP Tools: create, search, update, delete, tag, bookmark, and more — all exposed as standard MCP tools
Self-Hostable: Docker one-liner. No vendor lock-in. Your data stays on your infrastructure.
Free Tier: No credit card required. Generous free tier for individual developers and small teams.

How It Compares

I've tested Mem0, ChromaDB, and a few other memory solutions for AI agents. SelfMem wins on simplicity, search quality (hybrid FTS + vector beats pure-vector), MCP-native design, and zero recurring cost for small teams.

If you're building AI-assisted DevOps workflows, this is worth 10 minutes of your time to spin up.

Read the full hands-on review, setup guide, and comparison →

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

eBPF Observability for SRE: The End of Sidecars in 2026

devtocash — Sat, 27 Jun 2026 20:27:14 +0000

Sidecars were the observability pattern of the 2020s. eBPF is the pattern of the 2026 and beyond — and the difference is dramatic.

Traditional monitoring with DaemonSets and sidecars adds 10-30% resource overhead per pod. eBPF observability with tools like Cilium Hubble and Pixie runs in the kernel with less than 1% overhead. Zero code changes. Zero sidecar injection. Just attach a BPF program and you get:

HTTP/gRPC latency at the kernel level (no proxy hop)
Network flow logs with process-level attribution
CPU flame graphs by container without profiling agents
File system and DNS activity per pod

The architecture is elegant: BPF programs hook into kernel tracepoints and kprobes, a user-space agent (running as a DaemonSet) collects and enriches the data, and metrics flow to your existing Prometheus/Grafana stack. No sidecars. No mTLS overhead. No Envoy configuration.

The full article walks through setting up Cilium Hubble on an EKS cluster, configuring Pixie for auto-telemetry, and building eBPF-based SLO dashboards that correlate kernel events to user-facing latency.

Dive into the kernel: https://devtocash.com/blog/ebpf-observability-sre-end-of-sidecars-2026

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

Kubernetes LLM Inference: Deploy and Scale Open-Source LLMs in 2026

devtocash — Sat, 27 Jun 2026 20:21:21 +0000

Running your own LLMs on Kubernetes isn't just a cost play — it's about latency, data sovereignty, and fine-tuning control. But GPU scheduling at scale is a different beast entirely.

Here's what a production K8s LLM inference stack looks like in 2026: vLLM or TGI for the inference server, NVIDIA GPU Operator for driver management, KEDA for request-based autoscaling, and spot instances for dev/staging environments to cut costs by 60-70%.

The numbers matter: a single A100-80GB can serve Llama 3 70B with vLLM at ~30 tokens/second for 4 concurrent users. With continuous batching, that jumps to 8-10 users. But cold starts are brutal — 45-90 seconds for large models — which is why you need keep-warm pods and predictive scaling.

My article covers the complete architecture: GPU node pool setup, vLLM deployment manifests, HPA vs KEDA tradeoffs, model caching strategies with PersistentVolume, and cost optimization with spot/preemptible instances.

Get the full deployment guide with working YAML manifests at https://devtocash.com/blog/kubernetes-llm-inference-deploy-scale-2026

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

AI Agents for SRE: Autonomous Incident Response in 2026

devtocash — Sat, 27 Jun 2026 20:21:20 +0000

When your pager goes off at 3 AM, what if an AI agent could handle the entire incident before you even wake up?

That future is already here. AI agents powered by LLMs are transforming how SRE teams handle incidents — from automated diagnosis using RAG over internal runbooks, to executing remediation playbooks via PagerDuty integration, to generating blameless postmortem drafts before the war room even starts.

The key architecture: a supervisor agent orchestrating specialized sub-agents for log analysis, metric correlation, and remediation. Each sub-agent has access to specific tools — kubectl, Prometheus queries, Slack for escalation, and your internal knowledge base via semantic search.

But it's not plug-and-play. You need careful guardrails: human-in-the-loop for production changes, audit trails for every action, and progressive rollout (shadow mode → suggestion mode → semi-autonomous → full auto).

The article breaks down the full implementation: tool definitions, RAG pipeline for runbooks, PagerDuty webhook integration, and a working Python code example you can adapt today.

Read the complete guide with code at https://devtocash.com/blog/ai-agents-sre-autonomous-incident-response-2026

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

Claude Proxy: Turn Claude Code CLI into an OpenAI-Compatible API Server

devtocash — Sat, 27 Jun 2026 17:01:23 +0000

The Problem: Claude Max Has No API Key

You subscribe to Claude Max. You use Claude Code daily. But the moment you try to plug Claude into Cursor, Continue.dev, Aider, or any agent framework that speaks the OpenAI API, you hit a wall: Claude Max has no raw API key.

The Fix: A 300-Line Zero-Dependency Proxy

claude-proxy wraps your local claude CLI and exposes a standard /v1/chat/completions endpoint. Any OpenAI-compatible client can use your authenticated Claude Code session as the backend.

OpenAI-API client --> /v1/chat/completions --> claude_openai_proxy.py --> claude CLI --> Anthropic

3 commands to run. No pip install. No Docker required. Stream and non-stream. Prompt-based function/tool calling with false-refusal hardening.

Supports sonnet, opus, haiku. Concurrency-controlled (200-400MB per CLI process). systemd and Docker deployment included. Evonic integration example in the repo.

Trade-offs

Tool calling is prompt-emulated (not native API), one CLI process per request (stateless), and using Claude Max as a generic API backend may violate Anthropic's ToS. For production, a real API key is the supported path. For internal tools and agent frameworks on a budget, this is gold.

Full architecture breakdown, concurrency design, env var reference, Docker compose, and comparison table -- I walk through every piece at devtocash.com. Repo: github.com/rephapeng/claude-proxy

📌 Read the latest version of this guide — plus the full library of DevOps, SRE, Kubernetes, observability & cloud-cost guides — on devtocash.com.

DEV Community: devtocash

SLI vs SLO vs SLA: Real SRE Guide with Examples

Introduction

The Analogy: Speedometer, Speed Limit, Traffic Ticket

SLI: The Raw Measurement

Definition

The Four Golden Signals

Choosing Good SLIs

Examples of Good vs Bad SLIs

Defining SLIs in Prometheus

SLO: The Target

Definition

The Error Budget

Setting SLOs: A Practical Approach

Monitoring SLOs in Prometheus

Alerting on Error Budget Burn Rate

SLA: The Contract

Definition

SLA vs SLO: The Key Differences

The SLO Margin

When SLAs Go Wrong

Putting It All Together: A Real Example

Step 1: Define Your SLIs

Step 2: Set Your Internal SLOs

Step 3: Define Your External SLA

Step 4: Measure and Alert

Common Mistakes and How to Avoid Them

Mistake 1: Too Many SLIs

Mistake 2: SLOs Based on Averages

Mistake 3: Identical SLOs for Every Service

Mistake 4: Setting SLOs Without Error Budget Policy

Mistake 5: Confusing SLA with SLO

Actionable Takeaways

Kubernetes Pod Stuck in Pending (FailedScheduling): How to Fix It

What Pending actually means

Step 1: Read the scheduler's verdict

Step 2: Insufficient CPU or memory

Step 3: Untolerated taints

Step 4: Node affinity / selector mismatch

Step 5: Unbound volumes and zone conflicts

Step 6: Node is out of pod slots (too many pods)

Step 7: Topology spread and anti-affinity deadlocks

A repeatable checklist

Don't let Pending pods hide

Related Reading

Incident Management Runbook: The Complete SRE Template for 2026

Introduction

1. What Is an Incident Management Runbook?

2. Incident Severity Levels

3. Incident Response Roles

4. The Incident Response Lifecycle

Phase 1: Detection

Phase 2: Declaration

Phase 3: Diagnosis & Mitigation

Phase 4: Resolution

Phase 5: Postmortem (Blameless)

5. The Runbook Template

6. Automating the Runbook

PagerDuty + Rundeck Automation

Slack Slash Commands

GitOps for Runbooks

7. Common Pitfalls (and How to Avoid Them)

Pitfall 1: The Runbook Is Outdated

Pitfall 2: Too Many Alerts, Wrong Severity

Pitfall 3: Hero Culture

Pitfall 4: No Communication During Incidents

Pitfall 5: Skipping the Postmortem

8. Blameless Postmortem Template

9. Conclusion

Further Reading

Kubernetes OOMKilled (Exit Code 137): How to Debug and Fix It

What OOMKilled actually means

Step 1: Confirm it's actually OOMKilled

Step 2: See the limit that was breached

Step 3: The decisive fork — too-low limit or a leak?

Step 4: Right-size requests and limits

Step 5: Hunt the leak

Step 6: When the whole node runs out of memory

A repeatable checklist

Don't wait for the kill — alert on the approach

Step 6: Node is out of pod slots (`too many pods`)

Step 2: Wrong image name or tag (`manifest unknown`)

Step 3: Private registry authentication (`401` / `denied`)

Step 4: Docker Hub rate limits (`429 Too Many Requests`)

Step 5: The node can't reach the registry (`no such host` / timeout)

Step 6: Architecture mismatch (`no match for platform`)

Step 2: Decode the exit code from `describe`