Part-96: ⚙️Implementing Job backoffLimit in Google Kubernetes Engine (GKE)

#kubernetes #googlecloud #devops #cloud

When running Jobs in Kubernetes, sometimes your Pods may fail due to errors in the application, misconfigurations, or external issues.

By default, Kubernetes will retry a failed Job Pod up to 6 times before marking it as failed. But this is configurable using the backoffLimit setting.

In this post, we’ll walk through how to implement backoffLimit in GKE and see how it behaves when a Job keeps failing.

🔹 Step 01: Introduction

If a Job has errors, Kubernetes will retry it multiple times before finally marking it as failed.
backoffLimit controls how many retries should happen.
Default value: 6
In our demo, we’ll set backoffLimit: 4. That means the Job will retry 4 times before being marked as failed.

👉 Expected Result: We should see 4 Pods in Error state.

🔹 Step 02: job2.yaml

Here’s our Job manifest:

apiVersion: batch/v1
kind: Job
metadata:
  # Unique key of the Job instance
  name: job2
spec:
  template:
    metadata:
      name: job2
    spec:
      containers:
      - name: job2
        image: alpine
        # Exit code 0 = success, non-zero (like 1) = failure
        command: ['sh', '-c', 'echo Kubernetes Jobs Demo - backoffLimit Test ; exit 1']
      # Do not restart containers after they exit
      restartPolicy: Never
  # backoffLimit: Number of retries before marking as failed.
  backoffLimit: 4 # Default is 6

👉 Notice: We intentionally use exit 1 to simulate a failure.

🔹 Step 03: Deploy Kubernetes Manifests

Apply the Job YAML:

# Deploy the Job
kubectl apply -f job2.yaml 
# OR
kubectl create -f job2.yaml

Verify the Job

# List Jobs
kubectl get jobs

# Describe the Job
kubectl describe job job2

# List Pods created by the Job
kubectl get pods

Observation

We’ll see multiple Pods created, each going into Error state.
Kubernetes retried 4 times, as per backoffLimit: 4.

✅ Example output:

$ kubectl get pods
NAME         READY   STATUS   RESTARTS   AGE
job2-c88pf   0/1     Error    0          3m14s
job2-dxlzq   0/1     Error    0          3m34s
job2-jsb6n   0/1     Error    0          2m44s
job2-pxn5t   0/1     Error    0          2m29s

# Check Pod details
kubectl describe pod <POD-NAME>

# Check Job details
kubectl describe job job2

👉 Job Status:

0/1 completed (because no Pod succeeded)
4 Pods failed (as per backoffLimit)

🔹 Step 04: Delete the Job

Once tested, clean up:

kubectl delete job job2

This removes the Job and all associated Pods.

🎯 Conclusion

backoffLimit is a safety mechanism to prevent Jobs from retrying endlessly.
By default, Kubernetes retries 6 times.
You can lower or increase this value based on your workload.
Useful in scenarios where you want to fail fast instead of wasting cluster resources.