When running Jobs in Kubernetes, sometimes your Pods may fail due to errors in the application, misconfigurations, or external issues.
By default, Kubernetes will retry a failed Job Pod up to 6 times before marking it as failed. But this is configurable using the backoffLimit setting.
In this post, we’ll walk through how to implement backoffLimit in GKE and see how it behaves when a Job keeps failing.
🔹 Step 01: Introduction
- If a Job has errors, Kubernetes will retry it multiple times before finally marking it as failed.
- backoffLimit controls how many retries should happen.
- Default value: 6
- In our demo, we’ll set backoffLimit: 4. That means the Job will retry 4 times before being marked as failed.
👉 Expected Result: We should see 4 Pods in Error state.
🔹 Step 02: job2.yaml
Here’s our Job manifest:
apiVersion: batch/v1
kind: Job
metadata:
# Unique key of the Job instance
name: job2
spec:
template:
metadata:
name: job2
spec:
containers:
- name: job2
image: alpine
# Exit code 0 = success, non-zero (like 1) = failure
command: ['sh', '-c', 'echo Kubernetes Jobs Demo - backoffLimit Test ; exit 1']
# Do not restart containers after they exit
restartPolicy: Never
# backoffLimit: Number of retries before marking as failed.
backoffLimit: 4 # Default is 6
👉 Notice: We intentionally use exit 1 to simulate a failure.
🔹 Step 03: Deploy Kubernetes Manifests
Apply the Job YAML:
# Deploy the Job
kubectl apply -f job2.yaml
# OR
kubectl create -f job2.yaml
Verify the Job
# List Jobs
kubectl get jobs
# Describe the Job
kubectl describe job job2
# List Pods created by the Job
kubectl get pods
Observation
- We’ll see multiple Pods created, each going into Error state.
- Kubernetes retried 4 times, as per backoffLimit: 4.
✅ Example output:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
job2-c88pf 0/1 Error 0 3m14s
job2-dxlzq 0/1 Error 0 3m34s
job2-jsb6n 0/1 Error 0 2m44s
job2-pxn5t 0/1 Error 0 2m29s
# Check Pod details
kubectl describe pod <POD-NAME>
# Check Job details
kubectl describe job job2
👉 Job Status:
- 0/1 completed (because no Pod succeeded)
- 4 Pods failed (as per backoffLimit)
🔹 Step 04: Delete the Job
Once tested, clean up:
kubectl delete job job2
This removes the Job and all associated Pods.
🎯 Conclusion
- backoffLimit is a safety mechanism to prevent Jobs from retrying endlessly.
- By default, Kubernetes retries 6 times.
- You can lower or increase this value based on your workload.
- Useful in scenarios where you want to fail fast instead of wasting cluster resources.
Now you’ve seen how backoffLimit works with a failing Job in Kubernetes 🚀.
Top comments (0)