<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Felipe Schmitt</title>
    <description>The latest articles on DEV Community by Felipe Schmitt (@schmittfelipe).</description>
    <link>https://dev.to/schmittfelipe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F522817%2F02818f7f-ea44-4033-9861-13d52143d0aa.jpg</url>
      <title>DEV Community: Felipe Schmitt</title>
      <link>https://dev.to/schmittfelipe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/schmittfelipe"/>
    <language>en</language>
    <item>
      <title>Cloud Native Monitoring at Scale - Collecting Metrics</title>
      <dc:creator>Felipe Schmitt</dc:creator>
      <pubDate>Sun, 29 Nov 2020 20:00:03 +0000</pubDate>
      <link>https://dev.to/schmittfelipe/cloud-native-monitoring-at-scale-collecting-metrics-3j4i</link>
      <guid>https://dev.to/schmittfelipe/cloud-native-monitoring-at-scale-collecting-metrics-3j4i</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;As we move towards a Cloud Native world, where loads are ephemeral, horizontal scaling is key and microservices are the norm, monitoring all of these spread out components becomes not only essential, but mandatory on any production-ready environment.&lt;/p&gt;

&lt;p&gt;This post is part of a series named &lt;strong&gt;Cloud Native Monitoring at Scale&lt;/strong&gt; which focuses on all stages of monitoring across a cloud-native application deployed on Kubernetes. Since from a single running application to understand if it is up and running as expected (check my &lt;a href="https://felipeschmitt.com/cloud-native-monitoring-app-health/" rel="noopener noreferrer"&gt;previous post&lt;/a&gt;) all the way to having multiple k8s clusters running multiple applications simultaneously.&lt;/p&gt;

&lt;p&gt;This post aims to understand why exposing metrics of our applications is so important and why they are relevant to our ability to monitor systems at scale.&lt;/p&gt;

&lt;h1&gt;
  
  
  What are metrics?
&lt;/h1&gt;

&lt;p&gt;I believe that Digital Ocean's page on &lt;a href="https://www.digitalocean.com/community/tutorials/an-introduction-to-metrics-monitoring-and-alerting" rel="noopener noreferrer"&gt;Metrics, Monitoring and Alerting&lt;/a&gt; gives a very well-written explanation about these concepts:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Metrics, monitoring, and alerting are all interrelated concepts that together form the basis of a monitoring system. They have the ability to provide visibility into the health of your systems, help you understand trends in usage or behavior, and to understand the impact of changes you make. If the metrics fall outside of your expected ranges, these systems can send notifications to prompt an operator to take a look, and can then assist in surfacing information to help identify the possible causes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;These metrics can be sourced from multiple levels of abstraction on which our application run on top. Some examples are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Host-based (Bare metal, VM, Container)&lt;/strong&gt;: CPU usage, Memory usage, Disk space, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Container Orchestration (e.g. Kubernetes)&lt;/strong&gt;: Pod restart count, CPU Throttling, Container health status (as shown on &lt;a href="https://dev.to/schmittfelipe/cloud-native-monitoring-at-scale-application-s-health-17n7"&gt;my previous post&lt;/a&gt;), etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Application&lt;/strong&gt;: most common metrics methodology for monitoring applications are the RED (request rate, errors, duration), USE (usage, saturation, errors) or Golden signal metrics (latency, traffic, errors, saturation);&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Why are metrics important?
&lt;/h1&gt;

&lt;p&gt;Metrics are a way into understanding better the behavior and symptoms of our systems which enable us to act on them, possibly preventing worsen scenarios such as service degradation or even downtime.&lt;/p&gt;

&lt;p&gt;A good way to look at metrics is by doing an analogy with a very well known complex system that we interact with every single day: &lt;strong&gt;The human body&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Our own bodies expose metrics all the time such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Body temperature&lt;/li&gt;
&lt;li&gt;O&lt;sup&gt;2&lt;/sup&gt; saturation&lt;/li&gt;
&lt;li&gt;Heartbeat rate&lt;/li&gt;
&lt;li&gt;etc.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw71l0ojys0gsi1ncbw3z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fw71l0ojys0gsi1ncbw3z.png" alt="Human vitals monitoring"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These metrics enable us (or doctors) to have a deeper understanding about the overall status of our (body) systems and identify certain thresholds on which abnormal behavior can trigger an alert/concern.&lt;/p&gt;

&lt;p&gt;For example, if our Body Temperature is above 38ºC (100.4ºF) we identify this symptom has having a fever which can have multiple root causes behind it (infectious disease, immunological diseases, metabolic disorders, etc.) but it helps us understand that something within our overall system is not within the "normal" threshold and should be investigated further.&lt;/p&gt;

&lt;p&gt;Our applications are no different, exposing metrics about our applications such as response latency, error rate, number of cache miss events, etc. allows us to monitor these metrics and understand whenever an abnormal behavior is present and start investigating what the root cause might be.&lt;/p&gt;

&lt;h1&gt;
  
  
  Exposing metrics
&lt;/h1&gt;

&lt;p&gt;In order for us to take full advantage of these capabilities we need to expose metrics about all components to which can affect the overall health of our  application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scenario
&lt;/h2&gt;

&lt;p&gt;Let's picture a scenario where we have a Golang application that reads from a queue and sends emails, this application is running within a container on top of an orchestrator like Kubernetes that is running in our on-prem bare-metal machines. &lt;/p&gt;

&lt;p&gt;In this scenario we can clearly split the three levels of abstraction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Bare metal machine (Host layer)&lt;/li&gt;
&lt;li&gt;Kubernetes (Orchestration layer)&lt;/li&gt;
&lt;li&gt;Our Application (Application layer)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As we want to have as much information as possible regarding all layers that could affect our application, we would need to expose metrics of each of these layers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Host layer
&lt;/h3&gt;

&lt;p&gt;In order to collect metrics about our host layer, such as CPU usage, Disk usage / availability, network traffic, etc. we can take advantage of open-source metrics exporters that not only expose these metrics, but also enable them to be collected by the most common used tools in the market (e.g. Prometheus / PromQL).&lt;/p&gt;

&lt;p&gt;All metrics mentioned above can be exposed by having &lt;a href="https://prometheus.io/docs/guides/node-exporter/" rel="noopener noreferrer"&gt;Node Exporter&lt;/a&gt; running within your host layer, which will make your host expose its metrics at a specific port (default: 9100), for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="n"&gt;HELP&lt;/span&gt; &lt;span class="n"&gt;go_gc_duration_seconds&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;GC&lt;/span&gt; &lt;span class="n"&gt;invocation&lt;/span&gt; &lt;span class="n"&gt;durations&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="n"&gt;TYPE&lt;/span&gt; &lt;span class="n"&gt;go_gc_duration_seconds&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;
&lt;span class="n"&gt;go_gc_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;3.8996e-05&lt;/span&gt;
&lt;span class="n"&gt;go_gc_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"0.25"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;4.5926e-05&lt;/span&gt;
&lt;span class="n"&gt;go_gc_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;quantile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"0.5"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;5.846e-05&lt;/span&gt;
&lt;span class="err"&gt;#&lt;/span&gt; &lt;span class="n"&gt;etc&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These metrics can then be collected by an external system as shown on the metrics collection section.&lt;/p&gt;

&lt;h3&gt;
  
  
  Orchestration layer
&lt;/h3&gt;

&lt;p&gt;When using an orchestration layer to manage the deployment such as Kubernetes, it eases our operational efforts of having to orchestrate all of our loads, but it also adds some complexity on top, which means another layer that could lead to operational issues if not monitored properly.&lt;/p&gt;

&lt;p&gt;An orchestration layer can provide metrics such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deployment/Pod status (Running, Pending, Error, etc.)&lt;/li&gt;
&lt;li&gt;Container resource usage (CPU, Memory, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the case of &lt;strong&gt;Kubernetes&lt;/strong&gt;, it can expose metrics about some of its resources, such as: DaemonSet, Job, Deployment, CronJob, PersistentVolume, StorageClass, ResourceQuota, Namespaces, Pod, VerticalPodAutoscaler, etc.&lt;/p&gt;

&lt;p&gt;These metrics can be exposed by using tools such as &lt;a href="https://github.com/kubernetes/kube-state-metrics" rel="noopener noreferrer"&gt;kube-state-metrics&lt;/a&gt; which also make these metrics available for collection.&lt;/p&gt;

&lt;h3&gt;
  
  
  Application layer
&lt;/h3&gt;

&lt;p&gt;As this layer is the one that varies the most and is can be tailor-made to whatever business value your application aims to bring, on most cases it requires us to do an exploration of what key metrics are valuable about our application use-case that will matter the most when we want to understand its behavior and automate its monitoring.&lt;/p&gt;

&lt;p&gt;In our scenario, our application which focuses on reading messages from a certain queue, then based on some business logic sends an email with a certain template.&lt;/p&gt;

&lt;p&gt;It might be useful to expose some metrics about the usage and overall health of our application's ability to fulfill its added business value, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;email_messages_processed&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"sales_approved"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;532&lt;/span&gt;
&lt;span class="n"&gt;email_messages_processed&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"sales_rejected"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;21&lt;/span&gt;
&lt;span class="n"&gt;email_messages_read&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"finance_queue"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;68&lt;/span&gt;
&lt;span class="n"&gt;email_messages_read&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"human_resources_queue"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;11&lt;/span&gt;
&lt;span class="n"&gt;email_messages_read&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"management_queue"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;14&lt;/span&gt;
&lt;span class="n"&gt;email_messages_failed&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"unknown"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;7&lt;/span&gt;
&lt;span class="n"&gt;email_messages_failed&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;"timeout"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="m"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It can be achieved by implementing instrumentation within our codebase and have these metrics exposed under an endpoint (e.g. &lt;code&gt;/metrics&lt;/code&gt;), in order to understand a bit better on how to achieve this, feel free to go through &lt;a href="https://prometheus.io/docs/guides/go-application/" rel="noopener noreferrer"&gt;Prometheu's guide&lt;/a&gt; by using any of theirs client implementation for &lt;a href="https://github.com/prometheus/client_golang" rel="noopener noreferrer"&gt;&lt;code&gt;Golang&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prometheus/client_python" rel="noopener noreferrer"&gt;&lt;code&gt;Python&lt;/code&gt;&lt;/a&gt;, &lt;a href="https://github.com/prometheus/client_ruby" rel="noopener noreferrer"&gt;&lt;code&gt;Ruby&lt;/code&gt;&lt;/a&gt; and &lt;a href="https://github.com/prometheus/client_java" rel="noopener noreferrer"&gt;&lt;code&gt;Java&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;By implementing and exposing these metrics it enables our monitoring stack to not only track the health of our underlining systems, but also metrics related to our specific business logic and identify specific issues related to our codebase.&lt;/p&gt;

&lt;h1&gt;
  
  
  Metrics collection and analysis
&lt;/h1&gt;

&lt;p&gt;Once we managed to have the metrics of all these different layers being exposed, we need to collect and store them in a way that would enable us to query and analyze their value over time. &lt;a href="https://prometheus.io" rel="noopener noreferrer"&gt;Prometheus&lt;/a&gt;, being the second hosted project at CNCF, right after Kubernetes, makes it the cloud-native industry standard for metrics collection, monitoring and alerting. This stack can be composed by the following components:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus (Metrics Collection and Monitoring)&lt;/li&gt;
&lt;li&gt;Alertmanager (Alerts management)&lt;/li&gt;
&lt;li&gt;Grafana (Metrics Visualisation)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fc8gxetrbdt8le4cbfepo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fc8gxetrbdt8le4cbfepo.png" alt="Prometheus architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics collection and monitoring
&lt;/h2&gt;

&lt;p&gt;Prometheus main responsibility is to scrape target endpoints to collect metrics across multiple systems, through:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pre-defined HTTP endpoints (e.g. &lt;a href="https://example.com/metrics" rel="noopener noreferrer"&gt;https://example.com/metrics&lt;/a&gt;);&lt;/li&gt;
&lt;li&gt;Service Discovery (e.g. Kubernetes Prometheus Operator &lt;a href="https://coreos.com/blog/the-prometheus-operator.html" rel="noopener noreferrer"&gt;ServiceMonitor&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Upon the collection of these metrics, Prometheus enables the usage of its own query language &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" rel="noopener noreferrer"&gt;PromQL&lt;/a&gt; to analyze these metrics and get more information about all of our systems, e.g.:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="s"&gt;"email_processor"&lt;/span&gt;&lt;span class="p"&gt;}}[&lt;/span&gt;&lt;span class="m"&gt;5&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;or check if there's any container in a non-ready status within our Kubernetes cluster:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt; &lt;span class="n"&gt;kube_pod_container_status_ready&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By leveraging our ability to query the metrics of all our systems, including our own application metrics, it allows the creation of the so-called alert expressions which when it returns results it triggers an alert, for example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;alert: EmailProcessTimeout
expr: rate(email_messages_failed{reason="timeout"}[5m]) &amp;gt; 0.5
for: 5m
labels:
  severity: warning
annotations:
  message: Suddently multiple email message processing have failed with reason Timeout over the last 5 minutes.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If there's a sudden change of emails failing due to timeout over the last 5 minutes, then it would trigger this alert and it would send it to an Alertmanager to be routed accordingly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Alert management
&lt;/h2&gt;

&lt;p&gt;As mentioned earlier, another core component of the Prometheus stack is the &lt;a href="https://www.prometheus.io/docs/alerting/latest/alertmanager/" rel="noopener noreferrer"&gt;Alertmanager&lt;/a&gt; which its core focus is to handle alerts sent by client applications such as Prometheus.&lt;/p&gt;

&lt;p&gt;Alertmanager not only handles the routing of these alerts to external vendors such as &lt;a href="https://www.pagerduty.com" rel="noopener noreferrer"&gt;PagerDuty&lt;/a&gt;, &lt;a href="https://www.atlassian.com/software/opsgenie" rel="noopener noreferrer"&gt;OpsGenie&lt;/a&gt;, &lt;a href="https://slack.com/" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;, Email, etc. but it also handles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deduplication&lt;/li&gt;
&lt;li&gt;Grouping&lt;/li&gt;
&lt;li&gt;Silencing and inhibition&lt;/li&gt;
&lt;li&gt;High Availability (deployed as a cluster)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Which are highly useful core functionalities to prevent &lt;a href="https://en.wikipedia.org/wiki/Alarm_fatigue" rel="noopener noreferrer"&gt;alert fatigue&lt;/a&gt; and better handle alerts across an environment.&lt;/p&gt;

&lt;p&gt;Alertmanager also allows to route alerts based on predefined labels, such as the &lt;code&gt;severity: warning&lt;/code&gt; defined in the alert example exposed above.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;The exposure, collection and analysis of metrics of all layers that have an impact in our infrastructure, orchestration and application which at the end of the day could hinder our capability of generating business value is not only important but essential to enable the early detection of issues within our engineering solutions.&lt;/p&gt;

&lt;p&gt;Leveraging these tools to increase our observability across our entire stack, not only empowers an SRE team to more easily do its job, but also the ability to analyze metrics and its impact on the overall health of the system.&lt;/p&gt;

&lt;p&gt;If you feel like discussing more about how monitoring not only your infrastructure, but also all layers that can impact your business are crucial to your organization, then reach out to me &lt;a href="https://twitter.com/schmittfelipe" rel="noopener noreferrer"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
    <item>
      <title>Cloud Native Monitoring at Scale - Application's Health</title>
      <dc:creator>Felipe Schmitt</dc:creator>
      <pubDate>Sun, 29 Nov 2020 18:00:03 +0000</pubDate>
      <link>https://dev.to/schmittfelipe/cloud-native-monitoring-at-scale-application-s-health-17n7</link>
      <guid>https://dev.to/schmittfelipe/cloud-native-monitoring-at-scale-application-s-health-17n7</guid>
      <description>&lt;h1&gt;
  
  
  Introduction
&lt;/h1&gt;

&lt;p&gt;As we move towards a Cloud Native world, where loads are ephemeral, horizontal scaling is key and microservices are the norm, monitoring all of these spread out components becomes not only essential, but mandatory on any production-ready environment.&lt;/p&gt;

&lt;p&gt;We will navigate through a series named &lt;strong&gt;Cloud Native Monitoring at Scale&lt;/strong&gt; which focuses on all stages of monitoring across a cloud-native application deployed on Kubernetes. Since from a single running application to understand if it is up and running as expected (this post) all the way to having multiple k8s clusters running multiple applications simultaneously.&lt;/p&gt;

&lt;h1&gt;
  
  
  Cloud Native Application's health
&lt;/h1&gt;

&lt;p&gt;On this blog post we will go through the process of developing a cloud-native application and making sure that it's up and running at a &lt;code&gt;pod&lt;/code&gt; level, by leveraging the built-in capabilities within Kubernetes of &lt;code&gt;Readiness&lt;/code&gt; and &lt;code&gt;Liveness&lt;/code&gt; probes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Goal
&lt;/h3&gt;

&lt;p&gt;Create an application that replies with &lt;code&gt;pong&lt;/code&gt; on a &lt;code&gt;/ping&lt;/code&gt; endpoint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Dockerize our development environment;&lt;/li&gt;
&lt;li&gt;Develop an application on a "cloud-native friendly" programming language such as Go, Rust or Deno;&lt;/li&gt;
&lt;li&gt;Describe and implement all tests to validate most (if not all) of our use-cases;&lt;/li&gt;
&lt;li&gt;Deploy our application to our test/integration environment, make all smoke tests and validations;&lt;/li&gt;
&lt;li&gt;Get the green light and promote the application to a production environment;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once we have gone through all these steps, we are able to see that the application is up and running in our k8s namespace:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
app-basic-5499dbdcc-4xlmm   1/1     Running   0          6s
app-basic-5499dbdcc-j84bl   1/1     Running   0          6s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We can even validate our production environment that everything is working as expected:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ curl --max-time 10 http://kubernetes.docker.internal:31792/ping
{"msg":"pong"}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Go live!
&lt;/h3&gt;

&lt;p&gt;Our application has been successfully implemented and everything (seems) to be working just fine, so we are live and can now provide this service to our customers without a problem!&lt;/p&gt;

&lt;p&gt;Except a few minutes later we start getting emails and calls from customers that the service is not working as expected. Weird, right? Let's check if the pod is running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ kubectl get pods
NAME                        READY   STATUS    RESTARTS   AGE
app-basic-6b6dd6b98f-6t7jl   1/1     Running   0          57s
app-basic-6b6dd6b98f-pzgxr   1/1     Running   0          57s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It is running just fine, it says there right under &lt;code&gt;STATUS&lt;/code&gt; as &lt;code&gt;Running&lt;/code&gt; so there must be an issue with the customer, right? Let's just for the sake of sanity to do a test ourselves:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ curl --max-time 10 http://kubernetes.docker.internal:31792/ping
curl: (28) Operation timed out after 10003 milliseconds with 0 bytes received
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is awkward, we have tested our application, the deployment is working just fine, but apparently after a while our application stops working but this is not reflected in our Kubernetes environment.&lt;/p&gt;

&lt;p&gt;For this exercise, we have created a small "kill switch" that breaks the application 30 seconds after it first started running, as we could've seen through the logs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ kubectl logs app-basic-6b6dd6b98f-6t7jl
[11/29/2020, 6:49:42 PM] Running on http://0.0.0.0:8080
[11/29/2020, 6:50:12 PM] Upsie, I'm dead...
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kubernetes to the rescue
&lt;/h3&gt;

&lt;p&gt;Well this is where the Kubernetes concept of &lt;a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/"&gt;Readiness and Liveness Probe&lt;/a&gt; comes in handy, it allows the &lt;a href="https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/"&gt;kubelet&lt;/a&gt; to periodically check the status of a pod by probing an endpoint to understand its state.&lt;/p&gt;

&lt;h4&gt;
  
  
  Liveness vs. Readiness probe
&lt;/h4&gt;

&lt;p&gt;Although initially these two concepts can be confusing, in terms of their responsibilities, it gets clearer as we start exploring its main purpose.&lt;/p&gt;

&lt;p&gt;An application's health can be into two main statuses:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Alive&lt;/strong&gt; (liveness): The application is up and running and working as expected;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ready&lt;/strong&gt; (readiness): The application is ready to receive new requests;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's imagine we have an application that initially loads an in-memory cache to enable replying to external requests faster and this process takes roughly 15 seconds. This means that the application is &lt;code&gt;alive&lt;/code&gt; and &lt;code&gt;running&lt;/code&gt; from &lt;code&gt;0s&lt;/code&gt; but it will only be &lt;code&gt;ready&lt;/code&gt; at around &lt;code&gt;15s&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Having these probes allow Kubernetes to execute crucial orchestration tasks such as routing the requests to a specific pod from a service only when it is up and ready, as well as set thresholds to restart pods when the pod is not alive/ready for a certain amount of time. &lt;/p&gt;

&lt;p&gt;From the orchestration perspective, Kubernetes have got us covered, by simply adding these parameters on our deployment manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="nn"&gt;...&lt;/span&gt;
&lt;span class="na"&gt;containers&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;app&lt;/span&gt;
    &lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pong&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
    &lt;span class="na"&gt;livenessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/health&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;readinessProbe&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;httpGet&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;/ready&lt;/span&gt;
        &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;http&lt;/span&gt;
      &lt;span class="na"&gt;initialDelaySeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15&lt;/span&gt;
      &lt;span class="na"&gt;periodSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5&lt;/span&gt;
      &lt;span class="na"&gt;timeoutSeconds&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
      &lt;span class="na"&gt;failureThreshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="s"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The implementation of the &lt;code&gt;/health&lt;/code&gt; and &lt;code&gt;/ready&lt;/code&gt; endpoints are part of the application's responsibility as these can have different meaning across all kinds of applications, in our specific case, these two concepts can be overlapped and use the &lt;code&gt;/pong&lt;/code&gt; endpoint as our check if our application is alive and ready.&lt;/p&gt;

&lt;p&gt;If we look back into our &lt;code&gt;pong&lt;/code&gt; application, by having the &lt;code&gt;timeoutSeconds&lt;/code&gt; set to 1s, and having &lt;code&gt;periodSeconds&lt;/code&gt; as 5s (kubelet will probe this endpoint every 5s), it would detect that the endpoint was taking more than 1 second to respond, causing the pod to restart and enabling our application to receive requests. &lt;/p&gt;

&lt;p&gt;Of course, this would not necessarily fix our root cause, but it would &lt;em&gt;decrease the impact on our customers&lt;/em&gt; and also let us know that something was wrong when we would see &lt;em&gt;multiple triggered restarts&lt;/em&gt; on our pods.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ kubectl get pods
NAME                         READY   STATUS    RESTARTS   AGE
app-basic-6b6dd6b98f-6t7jl   1/1     Running   4          2m33s
app-basic-6b6dd6b98f-pzgxr   1/1     Running   4          2m33s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Under this scenario, we could investigate a bit further to understand why we had so many restarts under our pod's events:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;→ kubectl get event --field-selector involvedObject.name=app-basic-6b6dd6b98f-6t7jl
LAST SEEN   TYPE      REASON      OBJECT                           MESSAGE
4m          Normal    Scheduled   pod/app-basic-6b6dd6b98f-6t7jl   Successfully assigned default/app-basic-6b6dd6b98f-6t7jl to docker-desktop
95s         Normal    Pulled      pod/app-basic-6b6dd6b98f-6t7jl   Container image "pong:latest" already present on machine
2m10s       Normal    Created     pod/app-basic-6b6dd6b98f-6t7jl   Created container app-basic
2m10s       Normal    Started     pod/app-basic-6b6dd6b98f-6t7jl   Started container app-basic
2m48s       Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Readiness probe failed: Get "http://10.1.0.15:8080/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
97s         Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Liveness probe failed: Get "http://10.1.0.15:8080/ping": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
97s         Normal    Killing     pod/app-basic-6b6dd6b98f-6t7jl   Container app-basic failed liveness probe, will be restarted
2m9s        Warning   Unhealthy   pod/app-basic-6b6dd6b98f-6t7jl   Readiness probe failed: Get "http://10.1.0.15:8080/ping": dial tcp 10.1.0.15:8080: connect: connection refused
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here we can clearly see that the pod was &lt;code&gt;killed&lt;/code&gt; due to the fact that both of our liveness and readiness probes were timing out.&lt;/p&gt;

&lt;p&gt;This would enable us to understand that our application was showing some unhealthy symptoms that would need to be explored within our code on what was causing our application to stop responding every 30s.&lt;/p&gt;

&lt;p&gt;Nonetheless by implementing our &lt;code&gt;Readiness&lt;/code&gt; and &lt;code&gt;Liveness&lt;/code&gt; probe, we have made sure that our pod health status reflects our application's health, as well as Kubernetes is aware of these metrics and is able to react accordingly. In this case, by restarting the pods every time it is unhealthy will bring the pod back to a healthy state, being able to reply &lt;code&gt;pong&lt;/code&gt; to our customers, decreasing the downtime of our so much valued service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This is the first blog post of our &lt;strong&gt;"Cloud Native monitoring at scale"&lt;/strong&gt; series, as to which we will evolve from identifying all steps to expose our application's health and monitor at scale and ultimately leverage this system to build a full-fledged and automated system to get alerted whenever something across our entire application, system or organization goes wrong.&lt;/p&gt;

&lt;p&gt;If you feel like discussing more about how monitoring not only your infrastructure, but also all layers that can impact your business are crucial to your organization, then reach out to me &lt;a href="https://twitter.com/schmittfelipe"&gt;on Twitter&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>monitoring</category>
      <category>devops</category>
      <category>cloudnative</category>
    </item>
  </channel>
</rss>
