For measuring a services health, Kubernetes provides probes. By using these probes Kubernetes becomes aware when the container is healthy and un-healthy.
My understanding was ...
I was working on a micro service architecture based spring boot project, which is deployed into Kubernetes cluster.
We used probes to check the health of the services.
What is this probe ?
In the deployment definition, we provide a hook, which can be used to check the healthiness check of the service. There are several configuration of probes, http probes are the most common. Kubelet expect response code HTTP 200 OK
to determine a service is working as expected.
There are two common kind of probes:
- Readiness probe
- Liveliness probe
Readiness Probe
If readiness probe is configured, kubernetes uses this readiness probe to route the traffic to a running container.
Liveliness Probe
If liveliness probe is configured, kubernetes uses this livelines probe to decide whether to keep a container or kill that container.
Our deployment configuration:
One of our deployment configuration is as following:
...
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
initialDelaySeconds: 30
periodSeconds: 30
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
initialDelaySeconds: 60
periodSeconds: 60
...
Readiness probe configuration:
The probe starts firing right after 30 seconds the container is started. After each 30 seconds readiness probe's http call is fired. As the default successThreshold
is 1
if container responded with HTTP 2xx
then the container is marked as ready and traffic is routed to this container. If container does not return HTTP 2xx
then it is retried failureThreshold
times, which is defaulted to 3
.
So the container gets initial 30(initialDelaySeconds) seconds and 3 (failureThreshold) times 30(periodSeconds) seconds in total 2 minute to get ready.
Liveliness probe configuration:
This probes starts firing after 60 seconds. My understanding was this probe starts firing after the container is marked as ready (This is where i was wrong). After each 60 seconds liveliness probe's http call is fired, failureThreshold
defaulted to 3
this probe fires 3 times before killing the container.
So if container does not responded with HTTP 2xx
within 60 (initialDelaySeconds) seconds and 3 (failureThreshold) times 60(periodSeconds) seconds in total 4 minutes before killing the container.
Where i was wrong:
My understanding was, liveliness probe is firing after the container passed the readiness probe checking.
Recently one of our service becomes bulky. That service requires around 4+ minute to start responding to any http request.
I was expecting, everything will be turned well without any changes in the readiness
& liveliness
probes. 2 minutes for readiness probe and 4 minutes for liveliness probe. So container should not get killed by kubelet before 6 minutes.
๐งโ ๏ธ But container was getting killed!!! โ ๏ธ๐ง
๐งโ ๏ธ Right after 4 minutes!!! โ ๏ธ๐ง
Reality check:
Right after the disaster, i googled, which directs me to this Github issue.
This issue states, readiness
& liveliness
probes are fired in parallel. Which explains the disaster. Container's liveliness probe check is started right after the container is started. Which explains the reason why my bulky service container was getting killed.
There is a newer probe startup
probe (Although it was introduced in 2020, but i was not exposed to this earlier).
After this startup
probe passed, the readiness
& liveliness
probe is fired in parallel.
Better understanding:
To get a better understanding of these probes, i setup a simple nodejs + express
project.
server.js
const livelinessResponseStatusCode = parseInt(process.env.LIVELINESS_RESPONSE_STATUS_CODE || "200");
const readinessResponseStatusCode = parseInt(process.env.READINESS_RESPONSE_STATUS_CODE || "200");
const startupResponseStatusCode = parseInt(process.env.START_RESPONSE_STATUS_CODE || "200");
app.get('/liveliness', (req, res) => {
console.log(`${new Date()} : Liveliness probe fired & returned : ${livelinessResponseStatusCode}` );
res.status(livelinessResponseStatusCode);
res.send("");
});
app.get('/readiness', (req, res) => {
console.log(`${new Date()} : Readiness probe fired & returned : ${readinessResponseStatusCode}`);
res.status(readinessResponseStatusCode);
res.send("");
});
app.get('/startup', (req, res) => {
console.log(`${new Date()} : Startup probe fired & returned : ${startupResponseStatusCode}`);
res.status(startupResponseStatusCode);
res.send("");
});
Dockerfile
FROM node:18.17.0-alpine3.18
RUN mkdir /app
WORKDIR /app
COPY . .
RUN npm ci --omit=dev
CMD [ "node", "server.js" ]
k8-probes-without-startup.yaml
# Namespace
apiVersion: v1
kind: Namespace
metadata:
name: playground
---
# Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: k8-probes-deployment
namespace: playground
labels:
app: k8-probes
spec:
replicas: 1
selector:
matchLabels:
app: k8-probes
template:
metadata:
labels:
app: k8-probes
spec:
containers:
- name: k8-probes
image: ratulsharker/k8-probes:latest
ports:
- containerPort: 3000
livenessProbe:
httpGet:
path: /liveliness
port: 3000
initialDelaySeconds: 5
periodSeconds: 60
readinessProbe:
httpGet:
path: /readiness
port: 3000
initialDelaySeconds: 5
periodSeconds: 30
env:
- name: LIVELINESS_RESPONSE_STATUS_CODE
value: "500"
- name: READINESS_RESPONSE_STATUS_CODE
value: "500"
- name: START_RESPONSE_STATUS_CODE
value: "500"
For testing i used killercoda.com.
Cloning the repository
controlplane $ git clone https://github.com/ratulSharker/k8-probes.git
Cloning into 'k8-probes'...
remote: Enumerating objects: 35, done.
remote: Counting objects: 100% (35/35), done.
remote: Compressing objects: 100% (27/27), done.
remote: Total 35 (delta 17), reused 26 (delta 8), pack-reused 0
Unpacking objects: 100% (35/35), 11.87 KiB | 1.98 MiB/s, done.
Getting into the cloned repository:
controlplane $ cd k8-probes/
Running the deployment without startup probe
controlplane $ kubectl apply -f k8-probes-without-startup.yaml
namespace/playground created
deployment.apps/k8-probes-deployment created
controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:16:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:16:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:16:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:17:15 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:17:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:17:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:17:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:43 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:18:53 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:19:23 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
So it is definite that, liveliness
& readiness
probes are firing in parallel. Container get restarted eventually after 3 minutes 5 seconds.
Now try with k8-probes-with-startup.yaml
k8-probes-with-startup.yaml
startupProbe:
httpGet:
path: /startup
port: 3000
initialDelaySeconds: 5
periodSeconds: 20
env:
...
- name: START_RESPONSE_STATUS_CODE
value: "200"
Now before starting the container with startup
probe, delete the previous deployment:
controlplane $ kubectl delete deployment k8-probes-deployment -n playground
deployment.apps "k8-probes-deployment" deleted
Running the deployment with startup probe:
controlplane $ kubectl apply -f k8-probes-with-startup.yaml
namespace/playground unchanged
deployment.apps/k8-probes-deployment created
Inspecing the logs
controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:30:07 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 200
Tue Nov 28 2023 20:30:08 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:09 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:17 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:47 GMT+0000 (Coordinated Universal Time) : Readiness probe fired & returned : 500
Tue Nov 28 2023 20:30:47 GMT+0000 (Coordinated Universal Time) : Liveliness probe fired & returned : 500
... continues to previous pattern
Readiness and Liveliness probe is fired, after the Startup probe is passed.
Now changing the environment START_RESPONSE_STATUS_CODE
to 500
. Deleting existing deployment, starting the deployment again and inspecting the logs again:
controlplane $ kubectl delete deployment k8-probes-deployment -n playground
deployment.apps "k8-probes-deployment" deleted
controlplane $ kubectl apply -f k8-probes-with-startup.yaml
namespace/playground unchanged
deployment.apps/k8-probes-deployment created
controlplane $ kubectl logs -f deployment/k8-probes-deployment -n playground
Example app listening on port 3000
Tue Nov 28 2023 20:33:36 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
Tue Nov 28 2023 20:33:56 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
Tue Nov 28 2023 20:34:16 GMT+0000 (Coordinated Universal Time) : Startup probe fired & returned : 500
controlplane $ kubectl get po -n playground
NAME READY STATUS RESTARTS AGE
k8-probes-deployment-76c67669b5-cxl7v 0/1 Running 1 (17s ago) 107s
So if the ready probe is not passed, then readiness and liveliness probe does not fired up. After defaulted failureThreshold
(3) failed attempt the container get restarted.
Conclusion:
Final understanding
-
Startup Probe:
- Fired right after container started.
- If passed, then fires readiness & liveliness (if declared).
- If not passed, then container restarted.
-
Readiness Probe:
- If startup probe declared then fired after startup passed.
- If startup probe not declared then fired right after container started.
- If passed, then traffics are routed to the container.
-
Liveliness Probe:
- If startup probe declared then fired after startup passed.
- If startup probe not declared then fired right after container started.
- If passed, then container is kept as it is.
- If not passed, then container get killed.
Repository used in the above cases can be accessed in Github.
Top comments (0)