Kubernetes Probes — Liveness & Startup
1. Why probes exist (the real problem)
Containers do not fail cleanly.
Things that happen in production:
- App process is running, but logic is dead
- App is stuck in deadlock
- JVM / Python app needs 30–120s to start
- App boots but cannot accept traffic yet
- App consumes CPU but does nothing
Kubernetes cannot guess this.
So Kubernetes needs signals from YOU.
That is what probes are.
2. The 3 probe types (mental model first)
| Probe | Who uses it | Purpose |
|---|---|---|
| Startup | kubelet | “Is the app done starting?” |
| Liveness | kubelet | “Is the app stuck or dead?” |
| Readiness | Service / Endpoints | “Can traffic go here?” |
This lesson focuses on Startup + Liveness, but you must understand how they interact with Readiness.
3. Startup Probe (the most misunderstood)
What it REALLY means
“Do NOT touch this container until startup is complete.”
If startupProbe exists:
- Liveness is disabled
- Readiness is disabled
- kubelet waits patiently
Why startupProbe exists
Without it:
- Liveness starts too early
- Kubernetes kills the container while it’s still booting
- You get CrashLoopBackOff
- DevOps says: “It works locally”
Correct use cases
You SHOULD add a startup probe when:
- Java / Spring Boot
- .NET Core
- Python app loading ML models
- App performs DB migrations
- App needs secrets, config, cache warm-up
- App startup > 10–15 seconds
Startup probe lifecycle
Flow:
- Container starts
- Startup probe runs repeatedly
- Until success
- THEN liveness + readiness begin
Example (safe production default)
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 5
What this means:
- Kubernetes allows 150 seconds for startup
- No restarts during this time
- No traffic yet
4. Liveness Probe (dangerous if misused)
What it REALLY means
“If this fails → KILL the container.”
Not restart the app logic.
Not reload config.
Hard kill.
When to use liveness
You SHOULD use liveness when:
- App can deadlock
- App can hang on memory leaks
- App stops processing requests
- You want self-healing
You SHOULD NOT use liveness when:
- App is slow but working
- Dependency outages are expected
- Startup time is unpredictable (unless startupProbe exists)
Liveness lifecycle
Flow:
- Probe fails N times
- kubelet kills container
- Container restarts
- Pod stays the same
Example (safe liveness)
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
failureThreshold: 3
This gives:
- 30 seconds of failure tolerance
- Prevents flapping
- Allows brief GC pauses
5. Why startup + liveness must work together
BAD (common mistake)
livenessProbe:
httpGet:
path: /health
port: 8080
Result:
- App takes 45s to start
- Liveness starts at 10s
- Kubernetes kills it
- Infinite CrashLoopBackOff
GOOD (production pattern)
startupProbe:
httpGet:
path: /health/startup
port: 8080
failureThreshold: 30
periodSeconds: 5
livenessProbe:
httpGet:
path: /health/live
port: 8080
periodSeconds: 10
Why this works:
- Startup probe shields the app
- Liveness only checks after startup
- Clean separation of concerns
6. Probe types (how checks are done)
HTTP probe (MOST COMMON)
httpGet:
path: /health
port: 8080
Used when:
- Web apps
- APIs
- Microservices
TCP probe
tcpSocket:
port: 5432
Used when:
- Databases
- Message brokers
- Non-HTTP services
Exec probe (RARE, risky)
exec:
command:
- cat
- /tmp/healthy
Used when:
- Legacy apps
- No health endpoint
- You accept performance overhead
7. Debugging probes (REAL production workflow)
Step 1 — See probe failures
kubectl describe pod <pod>
Look for:
Liveness probe failed
Startup probe failed
Step 2 — Check restart pattern
kubectl get pod <pod>
Indicators:
-
RESTARTSincreasing → liveness issue -
CrashLoopBackOff→ probe or startup issue
Step 3 — Check timing mismatch
Ask:
- How long does app REALLY start?
- Are probes aggressive?
- Is failureThreshold too low?
Step 4 — Test inside container
kubectl exec -it <pod> -- curl localhost:8080/health
If this fails → probe is correct
If this works → probe config is wrong
Step 5 — Temporarily disable liveness
During debugging:
kubectl patch deployment app --type=json -p='[
{"op":"remove","path":"/spec/template/spec/containers/0/livenessProbe"}
]'
Never leave it disabled permanently.
8. Common outage patterns (interview GOLD)
Pattern 1 — Endless restarts
Cause:
- Liveness without startup Fix:
- Add startupProbe
Pattern 2 — Traffic hits broken pods
Cause:
- Missing readiness (not probe topic, but related) Fix:
- Add readinessProbe
Pattern 3 — Healthy pods killed during GC
Cause:
- Aggressive liveness Fix:
- Increase failureThreshold / periodSeconds
Pattern 4 — App stuck but not restarted
Cause:
- No liveness probe Fix:
- Add liveness with safe timing
9. What a 6-year DevOps engineer MUST articulate
You must be able to say:
- Startup probe disables liveness temporarily
- Liveness kills containers, not pods
- Probes run on kubelet
- Probes are not monitoring
- Bad probes cause more outages than no probes
- Readiness controls traffic, not liveness
- Probes must match real app behavior
10. Interview-ready summary (memorize)
“Startup probes protect slow-starting applications from being killed prematurely. Liveness probes provide self-healing for hung or deadlocked processes. In production, startup probes must exist before aggressive liveness probes, otherwise Kubernetes causes crash loops. Probes should reflect application behavior, not infrastructure health.”
Architecture Overview (Mental Model)
Traffic flow:
Browser
↓
Ingress
↓
Service
↓
Pod
↓
Container
Everything in this material builds around this flow.
MODULE 1 — Kubernetes Services & Networking
Why Services Exist
Pods:
- Have dynamic IPs
- Can be recreated at any time
- Must never be accessed directly
A Service provides:
- Stable IP
- Load balancing
- Pod discovery
Service Types
| Type | Purpose | Production Usage |
|---|---|---|
| ClusterIP | Internal access | Most common |
| NodePort | Direct node access | Debug / learning |
| LoadBalancer | Cloud LB | External traffic |
Project 1 — Service Traffic Flow
Step 1 — Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 2
selector:
matchLabels:
app: web
template:
metadata:
labels:
app: web
spec:
containers:
- name: app
image: hashicorp/http-echo:0.2.3
args:
- "-listen=:8080"
- "-text=SERVICE WORKS"
ports:
- containerPort: 8080
Apply:
kubectl apply -f deployment.yaml
Step 2 — ClusterIP Service
apiVersion: v1
kind: Service
metadata:
name: web-svc
spec:
selector:
app: web
ports:
- port: 80
targetPort: 8080
Apply:
kubectl apply -f service.yaml
Verify:
kubectl get svc
kubectl get endpoints web-svc
Step 3 — Access Inside Cluster
kubectl run tmp --rm -it --image=busybox -- sh
wget -qO- http://web-svc
Key Concepts Learned
- Services select Pods using labels
- Endpoints show real traffic targets
- Service failure usually means selector mismatch
MODULE 2 — Ingress (Real Production Entry)
Ingress provides:
- Single entry point
- Path-based routing
- Host-based routing
- SSL termination
Project 2 — Ingress Routing
Step 1 — Deploy Two Versions
Stable Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: stable
spec:
replicas: 2
selector:
matchLabels:
app: echo
version: stable
template:
metadata:
labels:
app: echo
version: stable
spec:
containers:
- name: app
image: hashicorp/http-echo:0.2.3
args:
- "-listen=:8080"
- "-text=STABLE VERSION"
ports:
- containerPort: 8080
Canary Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: canary
spec:
replicas: 1
selector:
matchLabels:
app: echo
version: canary
template:
metadata:
labels:
app: echo
version: canary
spec:
containers:
- name: app
image: hashicorp/http-echo:0.2.3
args:
- "-listen=:8080"
- "-text=CANARY VERSION"
ports:
- containerPort: 8080
Step 2 — Services
apiVersion: v1
kind: Service
metadata:
name: stable-svc
spec:
selector:
app: echo
version: stable
ports:
- port: 80
targetPort: 8080
apiVersion: v1
kind: Service
metadata:
name: canary-svc
spec:
selector:
app: echo
version: canary
ports:
- port: 80
targetPort: 8080
Step 3 — Ingress
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: app-ingress
spec:
rules:
- http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: stable-svc
port:
number: 80
- path: /canary
pathType: Prefix
backend:
service:
name: canary-svc
port:
number: 80
Test
curl http://<INGRESS-IP>/
curl http://<INGRESS-IP>/canary
MODULE 3 — ConfigMaps & Secrets
Why Configuration Is External
Images must:
- Be immutable
- Work in all environments
Configuration must:
- Change without rebuilding images
- Be environment-specific
Project 3 — ConfigMap Injection
Step 1 — ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
MESSAGE: "CONFIGMAP VALUE"
Step 2 — Deployment Using ConfigMap
containers:
- name: app
image: hashicorp/http-echo:0.2.3
args:
- "-listen=:8080"
- "-text=$(MESSAGE)"
env:
- name: MESSAGE
valueFrom:
configMapKeyRef:
name: app-config
key: MESSAGE
Update Config Live
kubectl edit configmap app-config
kubectl rollout restart deployment web
MODULE 4 — Resource Management
Requests vs Limits
| Setting | Meaning |
|---|---|
| requests | Guaranteed |
| limits | Maximum allowed |
Project 4 — OOM Kill Demo
resources:
requests:
memory: "32Mi"
cpu: "50m"
limits:
memory: "64Mi"
cpu: "100m"
Observe:
kubectl describe pod
MODULE 5 — Autoscaling (HPA)
Project 5 — CPU-Based Scaling
Step 1 — Enable Metrics
kubectl get apiservices | grep metrics
Step 2 — HPA
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
Generate Load
while true; do wget -qO- http://web-svc; done
Watch:
kubectl get hpa
kubectl get pods
MODULE 6 — Logs & Troubleshooting
Debug Order
- Pod status
- Events
- Logs
- Resource usage
- Service endpoints
Commands
kubectl get pods
kubectl describe pod <pod>
kubectl logs <pod>
kubectl get events --sort-by=.metadata.creationTimestamp
Incident Simulation
- Pod is Running
- Browser shows nothing
- Endpoint list is empty
- Fix selector
MODULE 7 — Security Basics
Minimal SecurityContext
securityContext:
runAsNonRoot: true
allowPrivilegeEscalation: false
Image Best Practices
- Never use
latest - Use fixed versions
- Use trusted registries
Final Integrated Project
Production Application Includes:
- Deployment with readiness probe
- ClusterIP Service
- Ingress routing
- ConfigMap
- Resource limits
- HPA
- Logs & events
- Secure container settings
This mirrors how Kubernetes is used in real companies.










Top comments (0)