Cloud Native Monitoring at Scale - Application's Health

#kubernetes #monitoring #devops #cloudnative

Introduction

As we move towards a Cloud Native world, where loads are ephemeral, horizontal scaling is key and microservices are the norm, monitoring all of these spread out components becomes not only essential, but mandatory on any production-ready environment.

We will navigate through a series named Cloud Native Monitoring at Scale which focuses on all stages of monitoring across a cloud-native application deployed on Kubernetes. Since from a single running application to understand if it is up and running as expected (this post) all the way to having multiple k8s clusters running multiple applications simultaneously.

Cloud Native Application's health

On this blog post we will go through the process of developing a cloud-native application and making sure that it's up and running at a pod level, by leveraging the built-in capabilities within Kubernetes of Readiness and Liveness probes.

Goal

Create an application that replies with pong on a /ping endpoint.

Steps

Dockerize our development environment;
Develop an application on a "cloud-native friendly" programming language such as Go, Rust or Deno;
Describe and implement all tests to validate most (if not all) of our use-cases;
Deploy our application to our test/integration environment, make all smoke tests and validations;
Get the green light and promote the application to a production environment;

Once we have gone through all these steps, we are able to see that the application is up and running in our k8s namespace:

→ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
app-basic-5499dbdcc-4xlmm   1/1     Running   0          6s
app-basic-5499dbdcc-j84bl   1/1     Running   0          6s

We can even validate our production environment that everything is working as expected:

→ curl --max-time 10 http://kubernetes.docker.internal:31792/ping
{"msg":"pong"}

Go live!

Our application has been successfully implemented and everything (seems) to be working just fine, so we are live and can now provide this service to our customers without a problem!

Except a few minutes later we start getting emails and calls from customers that the service is not working as expected. Weird, right? Let's check if the pod is running:

→ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
app-basic-6b6dd6b98f-6t7jl   1/1     Running   0          57s
app-basic-6b6dd6b98f-pzgxr   1/1     Running   0          57s

It is running just fine, it says there right under STATUS as Running so there must be an issue with the customer, right? Let's just for the sake of sanity to do a test ourselves:

→ curl --max-time 10 http://kubernetes.docker.internal:31792/ping
curl: (28) Operation timed out after 10003 milliseconds with 0 bytes received

This is awkward, we have tested our application, the deployment is working just fine, but apparently after a while our application stops working but this is not reflected in our Kubernetes environment.

For this exercise, we have created a small "kill switch" that breaks the application 30 seconds after it first started running, as we could've seen through the logs:

→ kubectl logs app-basic-6b6dd6b98f-6t7jl
[11/29/2020, 6:49:42 PM] Running on http://0.0.0.0:8080
[11/29/2020, 6:50:12 PM] Upsie, I'm dead...

Kubernetes to the rescue

Well this is where the Kubernetes concept of Readiness and Liveness Probe comes in handy, it allows the kubelet to periodically check the status of a pod by probing an endpoint to understand its state.

Liveness vs. Readiness probe

Although initially these two concepts can be confusing, in terms of their responsibilities, it gets clearer as we start exploring its main purpose.

An application's health can be into two main statuses:

Alive (liveness): The application is up and running and working as expected;
Ready (readiness): The application is ready to receive new requests;

Let's imagine we have an application that initially loads an in-memory cache to enable replying to external requests faster and this process takes roughly 15 seconds. This means that the application is alive and running from 0s but it will only be ready at around 15s.

Having these probes allow Kubernetes to execute crucial orchestration tasks such as routing the requests to a specific pod from a service only when it is up and ready, as well as set thresholds to restart pods when the pod is not alive/ready for a certain amount of time.

From the orchestration perspective, Kubernetes have got us covered, by simply adding these parameters on our deployment manifest:

...
containers:
  - name: app
    image: pong
    ...
    livenessProbe:
      httpGet:
        path: /health
        port: http
      initialDelaySeconds: 5
      periodSeconds: 5
      timeoutSeconds: 1
      failureThreshold: 1
    readinessProbe:
      httpGet:
        path: /ready
        port: http
      initialDelaySeconds: 15
      periodSeconds: 5
      timeoutSeconds: 1
      failureThreshold: 1
    ...

The implementation of the /health and /ready endpoints are part of the application's responsibility as these can have different meaning across all kinds of applications, in our specific case, these two concepts can be overlapped and use the /pong endpoint as our check if our application is alive and ready.

If we look back into our pong application, by having the timeoutSeconds set to 1s, and having periodSeconds as 5s (kubelet will probe this endpoint every 5s), it would detect that the endpoint was taking more than 1 second to respond, causing the pod to restart and enabling our application to receive requests.

Of course, this would not necessarily fix our root cause, but it would decrease the impact on our customers and also let us know that something was wrong when we would see multiple triggered restarts on our pods.

→ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
app-basic-6b6dd6b98f-6t7jl   1/1     Running   4          2m33s
app-basic-6b6dd6b98f-pzgxr   1/1     Running   4          2m33s

Under this scenario, we could investigate a bit further to understand why we had so many restarts under our pod's events:

→ kubectl get event --field-selector involvedObject.name=app-basic-6b6dd6b98f-6t7jl
LAST SEEN   TYPE      REASON      OBJECT                           MESSAGE
4m          Normal    Scheduled   pod/app-basic-6b6dd6b98f-6t7jl   Successfully assigned default/app-basic-6b6dd6b98f-6t7jl to docker-desktop
95s         Normal    Pulled      pod/app-basic-6b6dd6b98f-6t7jl   Container image "pong:latest" already present on machine
2m10s       Normal    Created     pod/app-basic-6b6dd6b98f-6t7jl   Created container app-basic
2m10s       Normal    Started     pod/app-basic-6b6dd6b98f-6t7jl   Started container app-basic
2m48s       Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Readiness probe failed: Get "http://10.1.0.15:8080/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
97s         Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Liveness probe failed: Get "http://10.1.0.15:8080/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
97s         Normal    Killing     pod/app-basic-6b6dd6b98f-6t7jl   Container app-basic failed liveness probe, will be restarted
2m9s        Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Readiness probe failed: Get "http://10.1.0.15:8080/ping": dial tcp 10.1.0.15:8080: connect: connection refused

Here we can clearly see that the pod was killed due to the fact that both of our liveness and readiness probes were timing out.

This would enable us to understand that our application was showing some unhealthy symptoms that would need to be explored within our code on what was causing our application to stop responding every 30s.

Nonetheless by implementing our Readiness and Liveness probe, we have made sure that our pod health status reflects our application's health, as well as Kubernetes is aware of these metrics and is able to react accordingly. In this case, by restarting the pods every time it is unhealthy will bring the pod back to a healthy state, being able to reply pong to our customers, decreasing the downtime of our so much valued service.

Conclusion

This is the first blog post of our "Cloud Native monitoring at scale" series, as to which we will evolve from identifying all steps to expose our application's health and monitor at scale and ultimately leverage this system to build a full-fledged and automated system to get alerted whenever something across our entire application, system or organization goes wrong.

If you feel like discussing more about how monitoring not only your infrastructure, but also all layers that can impact your business are crucial to your organization, then reach out to me on Twitter.