DEV Community

Aisalkyn Aidarova
Aisalkyn Aidarova

Posted on

Kubernetes Probes — REAL Production Project. SLO, SLA, SLI

Mental Model

Image

Image

Image

Kubernetes does NOT know your app logic.
It only knows signals.

Probe Question Kubernetes asks
startupProbe “Is the app done starting?”
readinessProbe “Should I send traffic?”
livenessProbe “Is this app stuck and needs restart?”

PROJECT OVERVIEW

App behavior

  • App takes 15 seconds to start
  • App has /health
  • App can freeze internally (simulated bug)
  • App sometimes accepts connections but should NOT get traffic

This mimics real production issues.


PART 1 — Create the Demo App (Python)

app.py

from http.server import HTTPServer, BaseHTTPRequestHandler
import time
import threading

READY = False
BROKEN = False

def startup():
    global READY
    time.sleep(15)   # simulate slow startup
    READY = True

def break_app():
    global BROKEN
    time.sleep(40)   # simulate runtime bug
    BROKEN = True

class Handler(BaseHTTPRequestHandler):
    def do_GET(self):
        if self.path == "/health":
            if READY:
                self.send_response(200)
                self.end_headers()
                self.wfile.write(b"OK")
            else:
                self.send_response(503)
                self.end_headers()
                self.wfile.write(b"NOT READY")
        else:
            if BROKEN:
                time.sleep(999)  # freeze
            self.send_response(200)
            self.end_headers()
            self.wfile.write(b"APP RESPONSE")

threading.Thread(target=startup).start()
threading.Thread(target=break_app).start()

HTTPServer(("0.0.0.0", 8080), Handler).serve_forever()
Enter fullscreen mode Exit fullscreen mode

Dockerfile

FROM python:3.11-slim
COPY app.py /app.py
CMD ["python", "/app.py"]
Enter fullscreen mode Exit fullscreen mode

Build & push:

docker build -t <your-dockerhub>/probe-demo .
docker push <your-dockerhub>/probe-demo
Enter fullscreen mode Exit fullscreen mode

PART 2 — Deployment WITHOUT PROBES (INTENTIONAL FAILURE)

deployment-no-probes.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: probe-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: probe-demo
  template:
    metadata:
      labels:
        app: probe-demo
    spec:
      containers:
      - name: app
        image: <your-dockerhub>/probe-demo
        ports:
        - containerPort: 8080
Enter fullscreen mode Exit fullscreen mode

Service

apiVersion: v1
kind: Service
metadata:
  name: probe-demo
spec:
  selector:
    app: probe-demo
  ports:
  - port: 80
    targetPort: 8080
Enter fullscreen mode Exit fullscreen mode

Apply:

kubectl apply -f .
Enter fullscreen mode Exit fullscreen mode

🔴 What breaks (observe carefully)

kubectl get pods -w
Enter fullscreen mode Exit fullscreen mode
  • Pods become Running immediately
  • Traffic starts before app is ready
  • Browser shows timeouts / empty responses
  • Later, app freezes → pods stay Running
  • Kubernetes does NOT restart anything

This is how outages happen without probes.


PART 3 — Add startupProbe (Fix slow startup)

Why?

Without it:

  • Liveness may kill app during startup
  • Readiness may allow traffic too early

startupProbe

startupProbe:
  httpGet:
    path: /health
    port: 8080
  failureThreshold: 30
  periodSeconds: 1
Enter fullscreen mode Exit fullscreen mode

Meaning:

  • Kubernetes gives 30 seconds to start
  • No restarts during startup phase

PART 4 — Add readinessProbe (Protect users)

readinessProbe

readinessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 1
  periodSeconds: 2
Enter fullscreen mode Exit fullscreen mode

What happens now:

kubectl get endpoints probe-demo
Enter fullscreen mode Exit fullscreen mode
  • Pod NOT added to service until ready
  • No traffic leaks
  • Zero user impact during startup

PART 5 — Add livenessProbe (Self-healing)

livenessProbe

livenessProbe:
  httpGet:
    path: /health
    port: 8080
  initialDelaySeconds: 20
  periodSeconds: 5
  failureThreshold: 3
Enter fullscreen mode Exit fullscreen mode

When app freezes:

  • /health stops responding
  • Kubernetes restarts container
  • Service stays stable

FINAL FULL DEPLOYMENT (PRODUCTION-READY)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: probe-demo
spec:
  replicas: 2
  selector:
    matchLabels:
      app: probe-demo
  template:
    metadata:
      labels:
        app: probe-demo
    spec:
      containers:
      - name: app
        image: <your-dockerhub>/probe-demo
        ports:
        - containerPort: 8080

        startupProbe:
          httpGet:
            path: /health
            port: 8080
          failureThreshold: 30
          periodSeconds: 1

        readinessProbe:
          httpGet:
            path: /health
            port: 8080
          periodSeconds: 2

        livenessProbe:
          httpGet:
            path: /health
            port: 8080
          initialDelaySeconds: 20
          periodSeconds: 5
          failureThreshold: 3
Enter fullscreen mode Exit fullscreen mode

PART 6 — How DevOps Troubleshoots Probes in PROD

Pod restarting?

kubectl describe pod <pod>
Enter fullscreen mode Exit fullscreen mode

Look for:

Liveness probe failed
Enter fullscreen mode Exit fullscreen mode

Traffic issues?

kubectl get endpoints probe-demo
Enter fullscreen mode Exit fullscreen mode

App alive but no traffic?

→ readiness failing

App stuck but pod running?

→ missing liveness

Pod never ready?

→ bad readiness path or timing


WHEN TO USE WHICH (INTERVIEW-LEVEL ANSWER)

Scenario Probe
Slow JVM / Spring / DB migrations startupProbe
Protect users from half-ready app readinessProbe
Recover from deadlocks / freezes livenessProbe
Batch jobs NO liveness
Stateful DBs careful readiness
APIs all three

DEVOPS RULES

  • Never expose traffic without readiness
  • Never use liveness without understanding startup
  • Never copy probe values blindly
  • Probes are part of SLO, not YAML decoration

Image

Image

Image

SLI — Service Level Indicator (MEASUREMENT)

A metric that tells you how the system behaves.

Examples:

  • Request success rate
  • Latency (p95, p99)
  • Error rate
  • Availability
  • Health check success

📌 SLI = what you measure


SLO — Service Level Objective (ENGINEERING TARGET)

The goal you want to meet for an SLI.

Examples:

  • 99.9% requests succeed
  • p95 latency < 500ms
  • Error rate < 0.1%

📌 SLO = what DevOps/SRE designs for


SLA — Service Level Agreement (LEGAL PROMISE)

A contract with users/customers.

Examples:

  • “99.9% uptime monthly”
  • “Credits if breached”

📌 SLA = legal & business, not technical


🔑 Golden rule (INTERVIEW ANSWER)

You engineer for SLOs, not SLAs.
SLAs are derived from SLOs.


2️⃣ Real Production Flow (How companies ACTUALLY use them)

SLI  →  SLO  →  SLA
metric   target   contract
Enter fullscreen mode Exit fullscreen mode

Example: API service

SLI

  • /health success rate
  • HTTP 5xx error %

SLO

  • 99.9% healthy responses per month

SLA

  • “99.5% uptime guaranteed to customers”

Why SLA < SLO?
👉 Safety margin.


3️⃣ Where Kubernetes Probes Fit (THIS IS THE KEY)

Kubernetes probes directly affect SLIs.

Probe Affects which SLI
readinessProbe Availability, success rate
livenessProbe Availability, MTTR
startupProbe Cold start latency, stability

👉 Bad probes = bad SLIs = SLO breach = SLA penalty


4️⃣ Concrete Example (REAL PROD SCENARIO)

Business SLO

99.9% successful requests monthly

That means:

  • ~43 minutes downtime per month allowed

Without readinessProbe

  • App starts
  • Traffic sent too early
  • Users get 500 / timeouts

❌ SLI drops
❌ SLO breached
❌ SLA credits paid


With readinessProbe

  • Pod not added to Service until ready
  • Users never hit half-ready app

✅ SLI stable
✅ SLO met
✅ SLA safe

📌 Readiness protects SLO directly


5️⃣ LivenessProbe & Error Budgets

Error Budget (important SRE concept)

If SLO = 99.9%
→ Error budget = 0.1%

You are allowed some failures.


Liveness without thinking (BAD)

  • Restarts too aggressively
  • Causes cascading failures
  • Burns error budget faster

Liveness done correctly (GOOD)

  • Detects real deadlocks
  • Restarts only when needed
  • Improves MTTR (Mean Time To Recovery)

📌 LivenessProbe protects SLO by reducing recovery time


6️⃣ StartupProbe & SLO (Often missed in interviews)

Without startupProbe

  • Liveness kills slow-starting app
  • CrashLoopBackOff
  • App never becomes available

❌ Availability SLI destroyed


With startupProbe

  • Kubernetes waits
  • No false restarts
  • Stable startup

✅ Startup SLI protected
✅ Availability SLO met


7️⃣ How DevOps USES SLOs Practically (Not theory)

Step 1 — Define SLIs

Examples:

  • % of requests with HTTP < 500
  • /health success
  • p95 latency

Step 2 — Set SLOs

Examples:

  • 99.9% success
  • p95 < 500ms

Step 3 — Design Kubernetes accordingly

  • readinessProbe → traffic safety
  • livenessProbe → self-healing
  • startupProbe → stability
  • replicas → fault tolerance
  • rollout strategy → limit blast radius

Step 4 — Alert on SLO burn, not raw metrics

Bad alert:

CPU 80%

Good alert:

Error budget burn rate > 2x


8️⃣ Interview-Level Mapping (VERY IMPORTANT)

Interview question:

“How do probes relate to SLO/SLA?”

Strong answer:

“Probes directly impact SLIs like availability and success rate.
Readiness protects user-facing SLIs, liveness reduces MTTR, and startup prevents false restarts.
We design probes based on SLOs, not arbitrarily, to avoid burning error budget and breaching SLA.”

That answer = senior level


9️⃣ Simple Table (Memorize this)

Concept Owner Purpose
SLI DevOps/SRE Measure reality
SLO DevOps/SRE Engineering target
SLA Business/Legal Customer promise
Probes DevOps Protect SLOs

10️⃣ Final DevOps Truth (Production mindset)

  • Probes are not YAML decorations
  • Probes are reliability controls
  • Every probe decision = business impact
  • Bad probes cost real money

Top comments (0)