Ratul sharker

Posted on Nov 28, 2023

My understanding was wrong about readiness and liveliness probes

#kubernetes #devops #docker

For measuring a services health, Kubernetes provides probes. By using these probes Kubernetes becomes aware when the container is healthy and un-healthy.

My understanding was ...

I was working on a micro service architecture based spring boot project, which is deployed into Kubernetes cluster.

We used probes to check the health of the services.

What is this probe ?
In the deployment definition, we provide a hook, which can be used to check the healthiness check of the service. There are several configuration of probes, http probes are the most common. Kubelet expect response code HTTP 200 OK to determine a service is working as expected.

There are two common kind of probes:

Readiness probe
Liveliness probe

Readiness Probe
If readiness probe is configured, kubernetes uses this readiness probe to route the traffic to a running container.

Liveliness Probe
If liveliness probe is configured, kubernetes uses this livelines probe to decide whether to keep a container or kill that container.

Our deployment configuration:

One of our deployment configuration is as following:

...
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 30
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080    
  initialDelaySeconds: 60
  periodSeconds: 60
...

Readiness probe configuration:
The probe starts firing right after 30 seconds the container is started. After each 30 seconds readiness probe's http call is fired. As the default successThreshold is 1 if container responded with HTTP 2xx then the container is marked as ready and traffic is routed to this container. If container does not return HTTP 2xx then it is retried failureThreshold times, which is defaulted to 3.

So the container gets initial 30(initialDelaySeconds) seconds and 3 (failureThreshold) times 30(periodSeconds) seconds in total 2 minute to get ready.

Liveliness probe configuration:
This probes starts firing after 60 seconds. My understanding was this probe starts firing after the container is marked as ready (This is where i was wrong). After each 60 seconds liveliness probe's http call is fired, failureThreshold defaulted to 3 this probe fires 3 times before killing the container.

So if container does not responded with HTTP 2xx within 60 (initialDelaySeconds) seconds and 3 (failureThreshold) times 60(periodSeconds) seconds in total 4 minutes before killing the container.

Where i was wrong:
My understanding was, liveliness probe is firing after the container passed the readiness probe checking.

Recently one of our service becomes bulky. That service requires around 4+ minute to start responding to any http request.

I was expecting, everything will be turned well without any changes in the readiness & liveliness probes. 2 minutes for readiness probe and 4 minutes for liveliness probe. So container should not get killed by kubelet before 6 minutes.

🚧⚠️ But container was getting killed!!! ⚠️🚧
🚧⚠️ Right after 4 minutes!!! ⚠️🚧

Reality check:

Right after the disaster, i googled, which directs me to this Github issue.

This issue states, readiness & liveliness probes are fired in parallel. Which explains the disaster. Container's liveliness probe check is started right after the container is started. Which explains the reason why my bulky service container was getting killed.

There is a newer probe startup probe (Although it was introduced in 2020, but i was not exposed to this earlier).

After this startup probe passed, the readiness & liveliness probe is fired in parallel.

Better understanding:
To get a better understanding of these probes, i setup a simple nodejs + express project.

server.js

const livelinessResponseStatusCode = parseInt(process.env.LIVELINESS_RESPONSE_STATUS_CODE || "200");
const readinessResponseStatusCode = parseInt(process.env.READINESS_RESPONSE_STATUS_CODE || "200");
const startupResponseStatusCode = parseInt(process.env.START_RESPONSE_STATUS_CODE || "200");

app.get('/liveliness', (req, res) => {
    console.log(`${new Date()} : Liveliness probe fired & returned : ${livelinessResponseStatusCode}` );
    res.status(livelinessResponseStatusCode);
    res.send("");
});

app.get('/readiness', (req, res) => {
    console.log(`${new Date()} : Readiness probe fired & returned : ${readinessResponseStatusCode}`);
    res.status(readinessResponseStatusCode);
    res.send("");
});

app.get('/startup', (req, res) => {
    console.log(`${new Date()} : Startup probe fired & returned : ${startupResponseStatusCode}`);
    res.status(startupResponseStatusCode);
    res.send("");
});

Dockerfile

FROM node:18.17.0-alpine3.18

RUN mkdir /app
WORKDIR /app

COPY . .

RUN npm ci --omit=dev

CMD [ "node", "server.js" ]

k8-probes-without-startup.yaml

# Namespace
apiVersion: v1
kind: Namespace
metadata:
  name: playground

---

# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: k8-probes-deployment
  namespace: playground
  labels:
    app: k8-probes
spec:
  replicas: 1
  selector:
    matchLabels:
      app: k8-probes
  template:
    metadata:
      labels:
        app: k8-probes
    spec:
      containers:
      - name: k8-probes
        image: ratulsharker/k8-probes:latest
        ports:
        - containerPort: 3000
        livenessProbe:
          httpGet:
            path: /liveliness
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 60
        readinessProbe:
          httpGet:
            path: /readiness
            port: 3000
          initialDelaySeconds: 5
          periodSeconds: 30
        env:
        - name: LIVELINESS_RESPONSE_STATUS_CODE
          value: "500"
        - name: READINESS_RESPONSE_STATUS_CODE
          value: "500"
        - name: START_RESPONSE_STATUS_CODE 
          value: "500"

For testing i used killercoda.com.

Cloning the repository

controlplane $ git clone https://github.com/ratulSharker/k8-probes.git
Cloning into 'k8-probes'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 35 (delta 17), reused 26 (delta 8), pack-reused 0
Unpacking objects: 100% (35/35), 11.87 KiB | 1.98 MiB/s, done.

Getting into the cloned repository:

controlplane $ cd k8-probes/

Running the deployment without startup probe

controlplane $ kubectl apply -f k8-probes-without-startup.yaml 
namespace/playground created
deployment.apps/k8-probes-deployment created
controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:16:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:16:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:16:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:17:15 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:17:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:17:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:17:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:43 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:19:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500

So it is definite that, liveliness & readiness probes are firing in parallel. Container get restarted eventually after 3 minutes 5 seconds.

Now try with k8-probes-with-startup.yaml

k8-probes-with-startup.yaml

startupProbe:
  httpGet:
    path: /startup
    port: 3000
  initialDelaySeconds: 5
  periodSeconds: 20
env:
...
- name: START_RESPONSE_STATUS_CODE 
  value: "200"

Now before starting the container with startup probe, delete the previous deployment:

controlplane $ kubectl delete deployment k8-probes-deployment -n playground
deployment.apps "k8-probes-deployment" deleted

Running the deployment with startup probe:

controlplane $ kubectl apply -f k8-probes-with-startup.yaml 
namespace/playground unchanged
deployment.apps/k8-probes-deployment created

Inspecing the logs

controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:30:07 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 200
Tue Nov 28 2023 20:30:08 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:09 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:17 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:47 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:47 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
... continues to previous pattern

Readiness and Liveliness probe is fired, after the Startup probe is passed.

Now changing the environment START_RESPONSE_STATUS_CODE to 500. Deleting existing deployment, starting the deployment again and inspecting the logs again:

controlplane $ kubectl delete deployment k8-probes-deployment -n playground
deployment.apps "k8-probes-deployment" deleted
controlplane $ kubectl apply -f k8-probes-with-startup.yaml
namespace/playground unchanged
deployment.apps/k8-probes-deployment created
controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:33:36 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
Tue Nov 28 2023 20:33:56 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
Tue Nov 28 2023 20:34:16 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
controlplane $ kubectl get po -n playground
NAME                                    READY   STATUS    RESTARTS      AGE
k8-probes-deployment-76c67669b5-cxl7v   0/1     Running   1 (17s ago)   107s

So if the ready probe is not passed, then readiness and liveliness probe does not fired up. After defaulted failureThreshold(3) failed attempt the container get restarted.

Conclusion:

Final understanding

Startup Probe:
- Fired right after container started.
- If passed, then fires readiness & liveliness (if declared).
- If not passed, then container restarted.
Readiness Probe:
- If startup probe declared then fired after startup passed.
- If startup probe not declared then fired right after container started.
- If passed, then traffics are routed to the container.
Liveliness Probe:
- If startup probe declared then fired after startup passed.
- If startup probe not declared then fired right after container started.
- If passed, then container is kept as it is.
- If not passed, then container get killed.