DEV Community: Javier Martínez

Top metrics for Elasticsearch monitoring with Prometheus

Javier Martínez — Tue, 09 May 2023 08:18:13 +0000

Starting the journey for Elasticsearch monitoring is crucial to get the right visibility and transparency over its behavior.

Elasticsearch is the most used search and analytics engine. It provides both scalability and redundancy to provide a high-availability search. As of 2023, more than sixty thousand companies of all sizes and backgrounds are using it as their search solution to track a diverse range of data, like analytics, logging, or business information.

By distributing data in JSON documents and indexing that data into several shards, Elastic search provides high availability, quick search, and redundancy capabilities.

In this article, we will evaluate the most important Prometheus metrics provided by the Elasticsearch exporter.

You will learn what are the main areas to focus on when monitoring an Elasticsearch system:

Start monitoring Elasticsearch with Prometheus.
How to monitor Golden Signals.
How to monitor infra metrics.
How to monitor index performance.
How to monitor search performance.
How to monitor cluster performance.
Advanced monitoring and next steps

How to start monitoring ElasticSearch with Prometheus

As usual, the easiest way to start your Prometheus monitoring journey with Elasticsearch is to use PromCat.io to find the best configs, dashboards, and alerts. The Elasticsearch setup guide in Promcat includes the Elasticsearch exporter with a series of out-of-box metrics that will be automatically scrapped to Prometheus. It also includes a collection of curated alerts and dashboards to start monitoring Elasticsearch right away.

You can combine these metrics with the Node Exporter to get more insights into your infrastructure. Also, if you're running Elasticsearch on Kubernetes, you can use KSM and CAdvisor to combine Kubernetes metrics with Elasticsearch metrics.

How to monitor Golden Signals in Elasticsearch

To review a bare minimum of important metrics, remember to check the so-called Golden Signals:

Errors.
Traffic.
Saturation.
Latency.

These represent a set of the essential metrics to look for in a system, in order to track black-box monitoring (focus only on what’s happening in the system, not why). In other words, Golden Signals will measure symptoms, not solutions to the current problem. This could be a good starting point for creating an Elasticsearch monitoring dashboard.

Errors

elasticsearch_cluster_health_status

Cluster health in Elasticsearch is measured by the colors green, yellow, and red, as follows:

Green: Data integrity is correct, no shard is missing.
Yellow: There’s at least one shard missing, but data integrity can be preserved due to replicas.
Red: A primary shard is missing or unassigned, and there’s a data loss.

With elasticsearch_cluster_health_status, you can quickly check the current situation for Elasticsearch data on a particular cluster. Remember that this won’t retrieve the actual causes of the data integrity loss, just that you need to act in order to prevent further problems.

Traffic

elasticsearch_indices_search_query_total

This metric is a counter with the total number of search queries executed, which by itself won’t give you much information as a number.

Consider as well using rate() or irate(), to detect sudden changes or spikes in traffic. Dig deeper into Prometheus queries with our Getting started with PromQL guide

Saturation

For a detailed latency analysis, check the section on How to monitor Elasticsearch infra metrics.

Latency

For a detailed latency analysis, check the section on How to monitor Elasticsearch index performance.

How to monitor Elasticsearch infra metrics

Infrastructure monitoring focuses on tracking the overall performance of the servers and nodes of a system. As with similar cloud applications, most of the effort will be spent on monitoring CPU and Memory consumption.

Monitoring Elasticsearch CPU

elasticsearch_process_cpu_percent

This is a gauge metric used to measure the current CPU usage percent (0-100) of the Elasticsearch process. Since chances are that you’re running several Elasticsearch nodes, you will need to track each one separately.

elasticsearch_indices_store_throttle_time_seconds_total

In case you’re using a file system as an index store, you can expect a certain level of delays in input and output operations. This metric represents how much your Elasticsearch index store is being throttled.

Since this is a counter metric that will only aggregate the total number of seconds, consider using rate or irate for an evaluation of how much it’s suddenly changing.

Monitoring Elasticsearch JVM Memory

Elasticsearch is based on Lucene, which is built in Java. This means that monitoring the Java Virtual Machine (JVM) memory is crucial to understand the current usage of the whole system.

elasticsearch_jvm_memory_used_bytes

This metric is a gauge that represents the memory usage in bytes for each area.

How to monitor Elasticsearch index performance

Indices in ElasticSearch partition the data as a logical namespace. Elasticsearch indexes documents in order to retrieve or search them as fast as possible.

Every time a new index is created, you can define the number of shards and replicas for it:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

elasticsearch_indices_indexing_index_time_seconds_total

This metric is a counter of the seconds accumulated spent on indexing. It can give you a very approximated idea of the Elasticsearch indexing performance.

Note that you can divide this metric by elasticsearch_indices_indexing_index_total in order to get the average indexing time per operation.

elasticsearch_indices_refresh_time_seconds_total

For an index to be searchable, Elasticsearch needs a refresh to be executed. This is set up with the config index.refresh_interval, which is set by default to one minute.

This metric elasticsearch_indices_refresh_time_seconds_total represents a counter with the total time dedicated to refreshing in Elasticsearch.

In case you want to measure the average time for refresh, you can divide this metric by elasticsearch_indices_refresh_total.

How to monitor Elasticsearch search performance

While Elasticsearch promises near-instant query speed, chances are that in the real world, you may feel that is not the case. The number of shards, the storage solution chosen, or the cache configuration might impact search performance, and it’s crucial to track what is the current behavior.

Additionally, the usage of wildcards, joins or the number of fields being searched will affect drastically the overall processing time of search queries.

elasticsearch_indices_search_fetch_time_seconds

A counter metric aggregating the total amount of seconds dedicated to fetching results in search.

In case you want to retrieve the average fetch time per operation, just divide the result by elasticsearch_indices_search_fetch_total.

How to monitor Elasticsearch cluster performance

Apart from the usual cloud requirements, an Elasticsearch system would like to look at:

Number of shards.
Number of replicas.

As a rule of thumb, the ratio between the number of shards and GB of heap space should be less than 20.

Note as well that it’s suggested to have a separate cluster dedicated to monitoring.

elasticsearch_cluster_health_active_shards

This metric is a gauge that will indicate the number of active shards (both primary and replicas) from all the clusters.

elasticsearch_cluster_health_relocating_shards

Elasticsearch will dynamically move shards between nodes based on balancing or current usage. With this metric, you can control when this movement is happening.

Advanced Monitoring

Remember that the Prometheus exporter will give you a set of out-of-the-box metrics that are relevant enough to kickstart your monitoring journey. But the real challenge comes when you take the step to create your own custom metrics tailored to your application.

REST API

Additionally, mind that Elasticsearch provides a REST API that you can query for more fine-grained monitoring.

VisualVM

The Java VisualVM project is an advanced dashboard for Memory and CPU monitoring. It features advanced resource visualization, as well as process and thread utilization.

Download the Dashboards

You can download the dashboards with the metrics seen in this article through the Promcat official page.

This is a curated selection of the above metrics that can be easily integrated with your Grafana or Sysdig Monitor solution.

Conclusion

Elasticsearch is one of the most important search engines available, featuring high availability, high scalability, and distributed capabilities through redundancy.

Using the Elasticsearch exporter for Prometheus you can kickstart the monitoring journey in an easy way, by automatically receiving the important metrics directly.

As with many other applications, CPU, and Memory are crucial to understand system saturation. You should be aware of the current CPU throttling and the memory handling of the JVM.

Finally, it’s important to dig deeper into the particularities of Elasticsearch, like indices and search capabilities, to truly understand the challenges of monitoring and visualization.

Kubernetes CreateContainerConfigError and CreateContainerError

Javier Martínez — Thu, 23 Mar 2023 15:58:05 +0000

CreateContainerConfigError and CreateContainerError are two of the most prevalent Kubernetes errors found in cloud-native applications.

CreateContainerConfigError is an error happening when the configuration specified for a container in a Pod is not correct or is missing a vital part.

CreateContainerError is a problem happening at a later stage in the container creation flow. Kubernetes displays this error when it attempts to create the container in the Pod.

In this article, you will learn:

What is Kubernetes CreateContainerConfigError?
What is Kubernetes CreateContainerError?
Kubernetes container creation flow
Common causes for CreateContainerError and CreateConfigError
How to troubleshoot both errors
How to detect both errors in Prometheus

What is CreateContainerConfigError?

During the process to start a new container, Kubernetes first tries to generate the configuration for it. In fact, this is handled internally by calling a method called generateContainerConfig, which will try to retrieve:

Container command and arguments
Relevant persistent volumes for the container
Relevant ConfigMaps for the container
Relevant secrets for the container

Any problem in the elements above will result in a CreateContainerConfigError.

What is CreateContainerError?

Kubernetes throws a CreateContainerError when there’s a problem in the creation of the container, but unrelated with configuration, like a referenced volume not being accessible or a container name already being used.

Similar to other problems like CrashLoopBackOff, this article only covers the most common causes, but there are many others depending on your current application.

How you can detect CreateContainerConfigError and CreateContainerError

You can detect both errors by running kubectl get pods:

NAME  READY STATUS                     RESTARTS AGE

mypod 0/1   CreateContainerConfigError 0        11m

As you can see from this output:

Pod is not ready: container has an error.
There are no restarts: these two errors are not like CrashLoopBackOff, where automatic retrials are in place.

Kubernetes container creation flow

In order to understand CreateContainerError and CreateContainerConfligError, we need to first know the exact flow for container creation.

Kubernetes follows the next steps every time a new container needs to be started:

Pull the image.
Generate container configuration.
Precreate container.
Create container.
Pre-start container.
Start container.

As you can see, steps 2 and 4 are where a CreateContainerConfig and CreateContainerErorr might appear, respectively.

Common causes for CreateContainerError and CreateContainerConfigError

Not found ConfigMap

Kubernetes ConfigMaps are a key element to store non-confidential information to be used by Pods as key-value pairs.

When adding a ConfigMap reference in a Pod, you are effectively indicating that it should retrieve specific data from it. But, if a Pod references a non-existent ConfigMap, Kubernetes will return a CreateContainerConfigError.

Not found Secret

Secrets are a more secure manner to store sensitive information in Kubernetes. Remember, though, this is just raw data encoded in base64, so it’s not really encrypted, just obfuscated.

In case a Pod contains a reference to a non-existent secret, Kubelet will throw a CreateContainerConfigError, indicating that necessary data couldn’t be retrieved in order to form container config.

Container name already in use

While an unusual situation, in some cases a conflict might occur because a particular container name is already being used. Since every docker container should have a unique name, you will need to either delete the original or rename the new one being created.

How to troubleshoot CreateContainerError and CreateContainerConfigError

While the causes for an error in container creation might vary, you can always rely on the following methods to troubleshoot the problem that’s preventing the container from starting.

Describe Pods

With kubectl describe pod, you can retrieve the detailed information for the affected Pod and its containers:

Containers:
  mycontainer:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:  3
---
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      myconfigmap
    Optional:  false

Get logs from containers

Use kubectl logs to retrieve the log information from containers in the Pod. Note that for Pods with multiple containers, you need to use the –all-containers parameter:

Error from server (BadRequest): container "mycontainer" in pod "mypod" is waiting to start: CreateContainerConfigError

Check the events

You can also run kubectl get events to retrieve all the recent events happening in your Pods. Remember that the describe pods command also displays the particular events at the end.

Terminal windows for the kubectl commands used to troubleshoot a CreateContainerConfigError

How to detect CreateContainerConfigError and CreateContainerError in Prometheus

When using Prometheus + kube-state-metrics, you can quickly retrieve Pods that have containers with errors at creation or config steps:

kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"} > 0
kube_pod_container_status_waiting_reason{reason="CreateContainerError"} > 0

Other similar errors

Pending

Pending is a Pod status that appears when the Pod couldn’t even be started. Note that this happens at schedule time, so Kube-scheduler couldn’t find a node because of not enough resources or not proper taints/tolerations config.

ContainerCreating

ContainerCreating is another waiting status reason that can happen when the container could not be started because of a problem in the execution (e.g: No command specified)

Error from server (BadRequest): container "mycontainer" in pod "mypod" is waiting to start: ContainerCreating

RunContainerError

This might be a similar situation to CreateContainerError, but note that this happens during the run step and not the container creation step.

A RunContainerError most likely points to problems happening at runtime, like attempts to write on a read-only volume.

CrashLoopBackOff

Remember that CrashLoopBackOff is not technically an error, but the waiting time grace period that is added between retrials.

Unlike CrashLoopBackOff events, CreateContainerError and CreateContainerConfigError won’t be retried automatically.

Conclusion

In this article, you have seen how both CreateContainerConfigError and CreateContainerError are important messages in the Kubernetes container creation process. Being able to detect them and understand at which stage they are happening is crucial for the day-to-day debugging of cloud-native services.

Also, it’s important to know the internal behavior of the Kubernetes container creation flow and what is errors might appear at each step.

Finally, CreateContainerConfigError and CreateContainerError might be mistaken with other different Kubernetes errors, but these two happen at container creation stage and they are not automatically retried.

Troubleshoot CreateContainerError with Sysdig Monitor

With Sysdig Monitor’s Advisor, you can easily detect which containers are having CreateContainerConfigError or CreateContainerError problems in your Kubernetes cluster.

Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

Monitoring with Custom Metrics

Javier Martínez — Thu, 02 Mar 2023 10:19:53 +0000

Custom metrics are application-level or business-related tailored metrics, as opposed to the ones that come directly out-of-the-box from monitoring systems like Prometheus (e.g: kube-state-metrics or node exporter)

By kickstarting a monitoring project with Prometheus, you might realize that you get an initial set of out-of-the-box metrics with just Node Exporter and Kube State Metrics. But, this will only get you so far since you will just be performing black box monitoring. How can you go to the next level and observe what’s beyond?

They are an essential part of the day-to-day monitoring of cloud-native systems, as they provide an additional dimension to the business and app level.

Metrics provided by an exporter
Tailored metrics designed by the customer
An aggregate from previous existing metrics

In this article, you will see:

Why custom metrics are important

Custom metrics allow companies to:

Monitor Key Performance Indicators (KPIs).
Detect issues faster.
Track resource utilization.
Measure latency.
Track specific values from their services and systems.

Examples of custom metrics:

Latency of transactions in milliseconds.
Database open connections.
% cache hits / cache misses.
orders/sales in e-commerce site.
% of slow responses.
% of responses that are resource intensive.

As you can see, any metrics retrieved from an exporter or created ad hoc will fit into the definition for custom metric.

When to use Custom Metrics

Autoscaling

By providing specific visibility over your system, you can define rules on how the workload should scale.

Horizontal autoscaling: add or remove replicas of a Pod.
Vertical autoscaling: modify limits and requests of a container.
Cluster autoscaling: add or remove nodes in a cluster.

If you want to dig deeper, check this article about autoscaling in Kubernetes.

Latency monitoring

Latency measures the time it takes for a system to serve a request. This monitoring golden signal is essential to understand what the end-user experience for your application is.

These are considered custom metrics as they are not part of the out-of-the-box set of metrics coming from Kube State Metrics or Node Exporter. In order to measure latency, you might want to either track individual systems (database, API) or end-to-end.

Application level monitoring

Kube-state-metrics or node-exporter might be a good starting point for observability, but they just scratch the surface as they perform black-box monitoring. By instrumenting your own application and services, you create a curated and personalized set of metrics for your own particular case.

Considerations when creating Custom Metrics

Naming

Check for any existing convention on naming, as they might be either colliding with existing names or confusing. Custom metric name is the first description for its purpose.

Labels

Thanks to labels, we can add parameters to our metrics, as we will be able to filter and refine through additional characteristics. Cardinality is the number of possible values for each label and since each combination of possible values will require a time series entry, that can increase resources drastically. Choosing the correct labels carefully is key to avoiding this cardinality explosion, which is one of the causes of resource spending spikes.

Costs

Custom metrics may have some costs associated with them depending on the monitoring system you are using. Double-check what is the dimension used to scale costs:

Number of time series
Number of labels
Data storage

Custom Metric lifecycle

In case the Custom Metric is related to a job or a short-living script, consider using Pushgateway.

Kubernetes Metric API

One of the most important features of Kubernetes is the ability to scale the workload based on the values of metrics automatically.

Metrics API are defined in the official repository from Kubernetes:

metrics.k8s.io
custom.metrics.k8s.io
external.metrics.k8s.io

Creating new metrics

You can set new metrics by calling the K8s metrics API as follows:

curl -X POST \
  -H 'Content-Type: application/json' \
  http://localhost:8001/api/v1/namespaces/custom-metrics/services/custom-metrics-apiserver:http/proxy/write-metrics/namespaces/default/services/kubernetes/test-metric \
  --data-raw '"300m"'

Prometheus custom metrics

As we mentioned, every exporter that we include in our Prometheus integration will account for several custom metrics.

Check the following post for a detailed guide on Prometheus metrics.

Challenges when using custom metrics

Cardinality explosion

While the resources consumed by some metrics might be negligible, the moment these are available to be used with labels in queries, things might get out of hand.

Cardinality refers to the cartesian products of metrics and labels. The result will be the amount of time series entries that need to be used for that single metric.

Also, every metric will be scraped and stored in a time series database based on your scrape_interval. The higher this value, the higher the amount of time series entries.

All these factors will eventually lead to:

Higher resource consumption.
Higher storage demand.
Monitoring performance degradation.

Moreover, most common monitoring tools don’t give visibility on current cardinality of metrics or costs associated.

Exporter over usage

Exporters are a great way to include relevant metrics to your system. With them, you can easily instrument relevant metrics bound to your microservices and containers. But with great power comes great responsibility. Chances are that many of the metrics included in the package may not be relevant to your business at all.

By enabling custom metrics and exporters in your solution, you may end up having a burst in the amount of time series database entries.

Cost spikes

Because of the elements explained above, monitoring costs could increase suddenly, as your current solution might be consuming more resources than expected, or your current monitoring solution has certain thresholds that were surpassed.

Alert fatigue

With metrics, most companies and individuals would love to start adding alerts and notifications when their values exceed certain thresholds. However, this could lead to higher notification sources and a reduced attention span.
Learn more about Alert Fatigue and how to mitigate it.

Conclusion

Custom metrics represent the next step for cloud-native monitoring as they represent the core of business observability. While using Prometheus along kube-state-metrics and node exporter is a nice starting step, eventually companies and organizations will need to take the next step and create tailored and on-point metrics to suit their needs.

Prometheus Alertmanager best practices

Javier Martínez — Thu, 09 Feb 2023 09:59:54 +0000

Have you ever fallen asleep to the sounds of your on-call team in a Zoom call? If you’ve had the misfortune to sympathize with this experience, you likely understand the problem of Alert Fatigue firsthand.

During an active incident, it can be exhausting to tease the upstream root cause from downstream noise while you’re context switching between your terminal and your alerts.

This is where Alertmanager comes in, providing a way to mitigate each of the problems related to Alert Fatigue.

In this article, you will learn:

What Alert Fatigue is
What AlertManager is
Routing
Inhibition
Silencing and Throttling
Grouping
Notification Template

Alert Fatigue

Alert Fatigue is the exhaustion of frequently responding to unprioritized and unactionable alerts. This is not sustainable in the long term. Not every alert is so urgent that it should wake up a developer. Ensuring that an on-call week is sustainable must prioritize sleep as well.

Was an engineer woken up more than twice this week?
Can the resolution be automated or wait until morning?
How many people were involved?

Companies often focus on response time and how long a resolution takes but how do they know the on-call process is not contributing to burn out?

Pain Point	Feature	Alertmanager
Send alerts to the right team	Routing	Labeled alerts are routed to the corresponding receiver
Too many alerts at once	Inhibition	Alerts can inhibit other alerts (e.g., Datacenter down alert inhibits downtime alert)
False positive on an Alert	Silencing	Temporarily silence an alert, especially when performing scheduled maintenance
Alerts are too frequent	Throttling	Customizable back-off options to avoid re-notifying too frequently
Unorganized alerts	Grouping	Logically group alerts by labels such as 'environment=dev' or 'service=broker'
Notifications are unstructured	Notification Template	Standardize alerts to a template so that alerts are structured across services

Alertmanager

Prometheus Alertmanager is the open source standard for translating alerts into alert notifications for your engineering team. Alertmanager challenges the assumption that a dozen alerts should result in a dozen alert notifications. By leveraging the features of Alertmanager, dozens of alerts can be distilled into a handful of alert notifications, allowing on-call engineers to context switch less by thinking in terms of incidents rather than alerts.

Routing

Routing is the ability to send alerts to a variety of receivers including Slack, Pagerduty, and email. It is the core feature of Alertmanager.

route:
  receiver: slack-default            # Fallback Receiver if no routes are matched
  routes:
    - receiver: pagerduty-logging
      continue: true
    - match:
      team: support
      receiver: jira
    - match:
      team: on-call
      receiver: pagerduty-prod

Here, an alert with the label {team:on-call} was triggered. Routes are matched from top to bottom with the first receiver being pagerduty-logging, a receiver for your on-call manager to track all alerts at the end of each month. Since the alert does not have a {team:support} label, the matching continues to {team:on-call} where the alert is properly routed to the pagerduty-prod receiver. The default route, slack-default, is specified at the top of the routes, in case no matches are found.

Inhibition

Inhibition is the process of muting downstream alerts depending on their label set. Of course, this means that alerts must be systematically tagged in a logical and standardized way, but that's a human problem, not an Alertmanager one. While there is no native support for warning thresholds, the user can take advantage of labels and inhibit a warning when the critical condition is met.

This has the unique advantage of supporting a warning condition for alerts that don't use a scalar comparison. It's all well and good to warn at 60% CPU usage and alert at 80% CPU usage, but what if we wanted to craft a warning and alert that compares two queries? This alert triggers when a node has more pods than its capacity.

(sum by (kube_node_name) (kube_pod_container_status_running)) > 
on(kube_node_name) kube_node_status_capacity_pods

We can do exactly this by using inhibition with Alertmanager. In the first example, an alert with the label {severity=critical} will inhibit an alert of {severity=warning} if they share the same region, and alertname.

In the second example, we can also inhibit downstream alerts when we know they won't be important in the root cause. It might be expected that a Kafka consumer behaves anomalously when the Kafka producer doesn't publish anything to the topic.

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['region','alertname']
  - source_match:
      service: 'kafka_producer'
    target_match:
      service: 'kafka_consumer'
    equal: ['environment','topic']

Silencing and Throttling

Now that you've woken up at 2 a.m. to exactly one root cause alert, you may want to acknowledge the alert and move forward with remediation. It’s too early to resolve the alert but alert re-notifications don’t give any extra context. This is where silencing and throttling can help.

Silencing allows you to temporarily snooze an alert if you're expecting the alert to trigger for a scheduled procedure, such as database maintenance, or if you've already acknowledged the alert during an incident and want to keep it from renotifying while you remediate.

Throttling solves a similar pain point but in a slightly different fashion. Throttles allow the user to tailor the renotification settings with three main parameters:

group_wait
group_interval
repeat_interval

When Alert #1 and Alert #3 are initially triggered, Alertmanager will use group_wait to delay by 30 seconds before notifying. After an initial alert has been triggered, any new alert notifications are delayed by group_interval . Since there was no new alert for the next 90 seconds, there was no notification. Over the subsequent 90 seconds however, Alert #2 was triggered and a notification of Alert #2 and Alert #3 was sent. In order to not forget about the current alerts if no new alert has been triggered, repeat_interval can be configured to a value, such as 24 hours, so that the currently triggered alerts send a re-notifications every 24 hours.

Grouping

Grouping in Alertmanager allows multiple alerts sharing a similar label set to be sent at the same time- not to be confused with Prometheus grouping, where alert rules in a group are evaluated in sequential order. By default, all alerts for a given route are grouped together. A group_by field can be specified to logically group alerts.

route:
  receiver: slack-default            # Fallback Receiver if no routes are matched
  group_by: [env]
  routes:
    - match:
        team: on-call
      Group_by: [region, service]
      receiver: pagerduty-prod

Alerts that have the label {team:on-call} will be grouped by both region and service. This allows users to immediately have context that all of the notifications within this alert group share the same service and region. Grouping with information such as instance_id or ip_address tends to be less useful, since it means that every unique instance_id or ip_address will produce its own notification group. This may produce noisy notifications and defeat the purpose of grouping.

If no grouping is configured, all alerts will be part of the same alert notification for a given route.

Notification Template

Notification templates offer a way to customize and standardize alert notifications. For example, a notification template can use labels to automatically link to a runbook or include useful labels for the on-call team to build context. Here, app and alertname labels are interpolated into a path that links out to a runbook. Standardizing on a notification template can make the on-call process run more smoothly since the on-call team may not be the direct maintainers of the microservice that is paging.

Manage alerts with a click with Sysdig Monitor

As organizations grow, maintaining Prometheus and Alertmanager can become difficult to manage across teams. Sysdig Monitor makes this easy with Role-Based Access Control where teams can focus on the metrics and alerts most important to them. We offer a turn-key solution where you can manage your alerts from a single pane of glass. With Sysdig Monitor you can spend less time maintaining Prometheus Alertmanager and spend more time monitoring your actual infrastructure. Come chat with industry experts in monitoring and alerting and we'll get you up and running.

Kubernetes OOM and CPU Throttling

Javier Martínez — Thu, 26 Jan 2023 09:56:52 +0000

Introduction

When working with Kubernetes, Out of Memory (OOM) errors and CPU throttling are the main headaches of resource handling in cloud applications. Why is that?

CPU and Memory requirements in cloud applications are ever more important, since they are tied directly to your cloud costs.

With limits and requests, you can configure how your pods should allocate memory and CPU resources in order to prevent resource starvation and adjust cloud costs.

In case a Node doesn’t have enough resources, Pods might get evicted via preemption or node-pressure.
When a process runs Out Of Memory (OOM), it’s killed since it doesn’t have the required resources.
In case CPU consumption is higher than the actual limits, the process will start to be throttled.

But, how can you actively monitor how close your Kubernetes Pods to OOM and CPU throttling?

Kubernetes OOM

Every container in a Pod needs memory to run.

Kubernetes limits are set per container in either a Pod definition or a Deployment definition.

All modern Unix systems have a way to kill processes in case they need to reclaim memory. This will be marked as Error 137 or OOMKilled.

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 10 Oct 2019 11:04:03 +0200
      Finished:     Thu, 10 Oct 2019 11:14:11 +0200

This Exit Code 137 means that the process used more memory than the allowed amount and had to be terminated.

This is a feature present in Linux, where the kernel sets an oom_score value for the process running in the system. Additionally, it allows setting a value called oom_score_adj, which is used by Kubernetes to allow Quality of Service. It also features an OOM Killer, which will review the process and terminate those that are using more memory than they should.

Note that in Kubernetes, a process can reach any of these limits:

A Kubernetes Limit set on the container.
A Kubernetes ResourceQuota set on the namespace.
The node’s actual Memory size.

Memory overcommitment

Limits can be higher than requests, so the sum of all limits can be higher than node capacity. This is called overcommit and it is very common. In practice, if all containers use more memory than requested, it can exhaust the memory in the node. This usually causes the death of some pods in order to free some memory.

Monitoring Kubernetes OOM

When using node exporter in Prometheus, there’s one metric called node_vmstat_oom_kill. It’s important to track when an OOM kill happens, but you might want to get ahead and have visibility of such an event before it happens.

Instead, you can check how close a process is to the Kubernetes limits:

(sum by (namespace,pod,container)
(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by 
(namespace,pod,container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

Kubernetes CPU throttling

CPU Throttling is a behavior where processes are slowed when they are about to reach some resource limits.

Similar to the memory case, these limits could be:

A Kubernetes Limit set on the container.
A Kubernetes ResourceQuota set on the namespace.
The node’s actual Memory size.

Think of the following analogy. We have a highway with some traffic where:

CPU is the road.
Vehicles represent the process, where each one has a different size.
Multiple lanes represent having several cores.
A request would be an exclusive road, like a bike lane.

Throttling here is represented as a traffic jam: eventually, all processes will run, but everything will be slower.

CPU process in Kubernetes

CPU is handled in Kubernetes with shares. Each CPU core is divided into 1024 shares, then divided between all processes running by using the cgroups (control groups) feature of the Linux kernel.

If the CPU can handle all current processes, then no action is needed. If processes are using more than 100% of the CPU, then shares come into place. As any Linux Kernel, Kubernetes uses the CFS (Completely Fair Scheduler) mechanism, so the processes with more shares will get more CPU time.

Unlike memory, a Kubernetes won't kill Pods because of throttling.

You can check CPU stats in /sys/fs/cgroup/cpu/cpu.stat

CPU overcommitment

As we saw in the limits and requests article, it’s important to set limits or requests when we want to restrict the resource consumption of our processes. Nevertheless, beware of setting up total requests larger than the actual CPU size, as this means that every container should have a guaranteed amount of CPU.

Monitoring Kubernetes CPU throttling

You can check how close a process is to the Kubernetes limits:

(sum by (namespace,pod,container)(rate(container_cpu_usage_seconds_total
{container!=""}[5m])) / sum by (namespace,pod,container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

In case we want to track the amount of throttling happening in our cluster, cadvisor provides container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total. With these two, you can easily calculate the % of throttling in all CPU periods.

Best practices

Beware of limits and requests

Limits are a way to set up a maximum cap on resources in your node, but these need to be treated carefully, as you might end up with a process throttled or killed.

Prepare against eviction

By setting very low requests, you might think this will grant a minimum of either CPU or Memory to your process. But kubelet will evict first those Pods with usage higher than requests first, so you’re marking those as the first to be killed!

In case you need to protect specific Pods against preemption (when kube-scheduler needs to allocate a new Pod), assign Priority Classes to your most important processes.

Throttling is a silent enemy

By setting unrealistic limits or overcommitting, you might not be aware that your processes are being throttled, and performance impacted. Proactively monitor your CPU usage and know your actual limits in both containers and namespaces.

Wrapping up

Here’s a cheat sheet on Kubernetes resource management for CPU and Memory. This summarizes the current article plus these ones which are part of the same series:

Rightsize your Kubernetes Resources with Sysdig Monitor

With Sysdig Monitor’s new feature, Cost Advisor, you can optimize your Kubernetes costs:

Memory requests
CPU requests

With our out-of-the-box Kubernetes Dashboards, you can discover underutilized resources
in a couple of clicks.

Try it free for 30 days!

Kubernetes Services: ClusterIP, Nodeport and LoadBalancer

Javier Martínez — Fri, 09 Dec 2022 10:03:19 +0000

Pods are ephemeral. And they are meant to be. They can be seamlessly destroyed and replaced if using a Deployment. Or they can be scaled at some point when using Horizontal Pod Autoscaling (HPA).

This means we can’t rely on the Pod IP address to connect with applications running in our containers internally or externally, as the Pod might not be there in the future.

You may have noticed that Kubernetes Pods get assigned an IP address:

stable-kube-state-metrics-758c964b95-6fnbl               1/1     Running   0          3d20h   100.96.2.5      ip-172-20-54-111.ec2.internal   <none>           <none>
stable-prometheus-node-exporter-4brgv                    1/1     Running   0          3d20h   172.20.60.26    ip-172-20-60-26.ec2.internal

This is a unique and internal IP for this particular Pod, but there’s no guarantee that this IP will exist in the future, due to the Pod's nature.

Services

A Kubernetes Service is a mechanism to expose applications both internally and externally.

Every service will create an everlasting IP address that can be used as a connector.

Additionally, it will open a port that will be linked with a targetPort. Some services can create ports in every Node, and even external IPs to create connectors outside the cluster.

With the combination of both IP and Port, we can create a way to uniquely identify an application.

Creating a service

Every service has a selector that filters that will link it with a set of Pods in your cluster.

spec:
  selector:
    app.kubernetes.io/name: myapp

So all Pods with the label myapp will be linked to this service.

There are three port attributes involved in a Service configuration:

  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30036
    protocol: TCP

port: the new service port that will be created to connect to the application.
targetPort: application port that we want to target with the services requests.
nodePort: this is a port in the range of 30000-32767 that will be open in each node. If left empty, Kubernetes selects a free one in that range.
protocol: TCP is the default one, but you can use others like SCTP or UDP.

You can review services created with:

kubectl get services
kubectl get svc

Types of services

Kubernetes allows the creation of these types of services:

ClusterIP (default)
Nodeport
LoadBalancer
ExternalName

Let’s see each of them in detail.

ClusterIP

This is the default type for service in Kubernetes.

As indicated by its name, this is just an address that can be used inside the cluster.

Take, for example, the initial helm installation for Prometheus Stack. It installs Pods, Deployments, and Services for the Prometheus and Grafana ecosystem.

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                     ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   3m27s
kubernetes                                ClusterIP   100.64.0.1      <none>        443/TCP                      18h
prometheus-operated                       ClusterIP   None            <none>        9090/TCP                     3m27s
stable-grafana                            ClusterIP   100.66.46.251   <none>        80/TCP                       3m29s
stable-kube-prometheus-sta-alertmanager   ClusterIP   100.64.23.19    <none>        9093/TCP                     3m29s
stable-kube-prometheus-sta-operator       ClusterIP   100.69.14.239   <none>        443/TCP                      3m29s
stable-kube-prometheus-sta-prometheus     ClusterIP   100.70.168.92   <none>        9090/TCP                     3m29s
stable-kube-state-metrics                 ClusterIP   100.70.80.72    <none>        8080/TCP                     3m29s
stable-prometheus-node-exporter           ClusterIP   100.68.71.253   <none>        9100/TCP                     3m29s

This creates a connection using an internal Cluster IP address and a Port.

But, what if we need to use this connector from outside the Cluster? This IP is internal and won’t work outside.

This is where the rest of the services come in…

NodePort

A NodePort differs from the ClusterIP in the sense that it exposes a port in each Node.

When a NodePort is created, kube-proxy exposes a port in the range 30000-32767:

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  selector:
    app: myapp
  type: NodePort
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30036
    protocol: TCP

NodePort is the preferred element for non-HTTP communication.

The problem with using a NodePort is that you still need to access each of the Nodes separately.

So, let’s have a look at the next item on the list…

LoadBalancer

A LoadBalancer is a Kubernetes service that:

Creates a service like ClusterIP
Opens a port in every node like NodePort
Uses a LoadBalancer implementation from your cloud provider (your cloud provider needs to support this for LoadBalancers to work).

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  ports:
  - name: web
    port: 80
  selector:
    app: web
  type: LoadBalancer
my-service                                LoadBalancer   100.71.69.103   <pending>     80:32147/TCP                 12s
my-service                                LoadBalancer   100.71.69.103   a16038a91350f45bebb49af853ab6bd3-2079646983.us-east-1.elb.amazonaws.com   80:32147/TCP                 16m

In this case, Amazon Web Service (AWS) was being used, so an external IP from AWS was created.

Then, if you use kubectl describe my-service, you will find that several new attributes were added:

Name:                     my-service
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app.kubernetes.io/name=pegasus
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       100.71.69.103
IPs:                      100.71.69.103
LoadBalancer Ingress:     a16038a91350f45bebb49af853ab6bd3-2079646983.us-east-1.elb.amazonaws.com
Port:                     <unset>  80/TCP
TargetPort:               9376/TCP
NodePort:                 <unset>  32147/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster

The main difference with NodePort is that LoadBalancer can be accessed and will try to equally assign requests to Nodes.

ExternalName

The ExternalName service was introduced due to the need of connecting to an element outside of the Kubernetes cluster. Think of it not as a way to connect to an item within your cluster, but as a connector to an external element of the cluster.

This serves two purposes:

It creates a single endpoint for all communications to that element.
In case that external service needs to be replaced, it’s easier to switch by just modifying the ExternalName, instead of all connections.

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  ports:
    - name: web
      port: 80
  selector:
    app: web
  type: ExternalName
  externalName: db.myexternalserver.com

Conclusion

Services are a key aspect of Kubernetes, as they provide a way to expose internal endpoints inside and outside of the cluster.

ClusterIP service just creates a connector for in-node communication. Use it only in case you have a specific application that needs to connect with others in your node.

NodePort and LoadBalancer are used for external access to your applications. It’s preferred to use LoadBalancer to equally distribute requests in multi-pod implementations, but note that your vendor should implement load balancing for this to be available.

Apart from these, Kubernetes provides Ingresses, a way to create an HTTP connection with load balancing for external use.

Debug service golden signals with Sysdig Monitor

With Sysdig Monitor, you can quickly debug:

Error rate
Saturation
Traffic
Latency

And thanks to its Container Observability with eBPF, you can do this without adding any app or code instrumentation. Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Kubernetes 1.26 What's new?

Javier Martínez — Thu, 01 Dec 2022 09:19:05 +0000

Kubernetes 1.26 is about to be released, and it comes packed with novelties! Where do we begin?

This release brings 37 enhancements, on par with the 40 in Kubernetes 1.25 and the 46 in Kubernetes 1.24. Of those 37 enhancements, 11 are graduating to Stable, 10 are existing features that keep improving, 16 are completely new, and one is a deprecated feature.

Watch out for all the deprecations and removals in this version!

Two new features stand out in this release that have the potential to change the way users interact with Kubernetes: Being able to provisioning volumes with snapshots from other namespaces.

There are also new features aimed at high performance workloads, like science researching or machine learning: Better what physical CPU cores your workloads run on.

Also, other features will make life easier for cluster administrators, like support for OpenAPIv3.

We are really hyped about this release!

There is plenty to talk about, so let's get started with what’s new in Kubernetes 1.25.

Kubernetes 1.26 – Editor’s pick:

These are the features that look most exciting to us in this release (ymmv):

#3294 Provision volumes from cross-namespace snapshots

The VolumeSnapshot feature allows Kubernetes users provision volumes from volume snapshots, providing great benefits for users and applications, like enabling database administrators to snapshot a database before any critical operation, or the ability to develop and implement backup solutions.

Starting in Kubernetes 1.26 as an Alpha feature, users will be able to create a PersistentVolumeClaim from a VolumeSnapshot across namespaces, breaking the initial limitation of having both objects in the same namespace.

This enhancement comes to eliminate the constraints that prevented users and applications from operating on fundamental tasks, like saving a database checkpoint when applications and services are in different namespaces.

Víctor Hernando - Sr. Technical Marketing Manager at Sysdig

#3488 CEL for admission control

Finally, a practical implementation of the validation expression language from Kubernetes 1.25!

By defining rules for the admission controller as Kubernetes objects, we can start forgetting about managing webhooks, simplifying the setup of our clusters. Not only that, but implementing Kubernetes security is a bit easier now.

We love to see these user-friendly improvements. They are the key to keep growing Kubernetes adoption.

Víctor Jiménez Cerrada - Content Engineering Manager at Sysdig

#3466 Kubernetes component health SLIs

Since Kubernetes 1.26, you can configure Service Level Indicator (SLI) metrics for the Kubernetes components binaries. Once you enable them, Kubernetes will expose the SLI metrics in the /metrics/slis endpoint - so you won't need a Prometheus exporter. This can take Kubernetes monitoring to another level making it easier to create health dashboards and configure PromQL alerts to assure your cluster's stability.

Jesús Ángel Samitier - Integrations Engineer at Sysdig

#2371 cAdvisor-less, CRI-full container and Pod stats

Currently, to gather metrics from containers, such as CPU or memory consumed, Kubernetes relies on cAdvisor. This feature presents an alternative, enriching the CRI API to provide all the metrics from the containers, allowing more flexibility and better accuracy. After all, it's the Container Runtime who best knows the behavior of the container.

This feature represents one more step on the roadmap to remove cAdvisor from Kubernetes code. However, during this transition, cAdvisor will be modified not to generate the metrics added to the CRI API, avoiding duplicated metrics with possible different and incoherent values.

David de Torres Huerta – Engineer Manager at Sysdig

#3063 Dynamic resource allocation

This new Kubernetes release introduces a new Alpha feature which will provide extended resource management for advanced hardware. As a cherry on top, it comes with a user-friendly API to describe resource requests. With the increasing demand to process different hardware components, like GPU or FPGA, and the need to set up initialization and cleanup, this new feature will speed up Kubernetes adoption in areas like scientific research or edge computing.

Javier Martínez - Devops Content Engineer at Sysdig

#3545 Improved multi-numa alignment in Topology Manager

This is yet another feature aimed at high performance workloads, like those involved in scientific computing. We are seeing the new CPU manager taking shape since Kubernetes 1.22 and 1.23, enabling developers to keep their workloads close to where their data is stored in memory, improving performance. Kubernetes 1.26 goes a step further, opening the door to further customizations for this feature. After all, not all workloads and CPU architectures are the same.

The future of HPC on Kubernetes is looking quite promising, indeed.

Vicente J. Jiménez Miras – Security Content Engineer at Sysdig

#3335 Allow StatefulSet to control start replica ordinal numbering

StatefulSets in Kubernetes often are critical backend services, like clustered databases or message queues.
This enhancement, seemingly a trivial numbering change, allows for greater flexibility and enables new techniques for rolling cross-namespace or even cross-cluster migrations of the replicas of the StatefulSet without any downtime. While the process might seem a bit clunky, involving careful definition of PodDisruptionBudgets and the moving of resources relative to the migrating replica, we can surely envision tools (or existing operators enhancements) that automate these operations for seamless migrations, in stark contrast with the cold-migration strategy (shutdown-backup-restore) that is currently possible.

Daniel Simionato - Security Content Engineer at Sysdig

#3325 Auth API to get self user attributes

This new feature coming to alpha will simplify cluster Administrator's work, especially when they are managing multiple clusters. It will also assist in complex authentication flows, as it lets users query their user information or permissions inside the cluster.

Also, this includes whether you are using a proxy (Kubernetes API server fills in the userInfo after all authentication mechanisms are applied) or impersonating (you receive the details and properties for the user that was impersonated), so you will have your user information in a very easy way.

Miguel Hernández - Security Content Engineer at Sysdig

#3352 Aggregated Discovery

This is a tiny change for the users, but one step further on cleaning the Kubernetes internals and improving its performance. Reducing the number of API calls by aggregating them (or at least on the discovery part) is a nice solution to a growing problem. Hopefully, this will provide a small break to cluster administrators.

Devid Dokash - Content Engineering Intern at Sysdig

Deprecations

A few beta APIs and features have been removed in Kubernetes 1.26, including:

Deprecated API versions that are no longer served, and you should use a newer one:

CRI v1alpha2, use v1 (containerd version 1.5 and older are not supported).
flowcontrol.apiserver.k8s.io/v1beta1, use v1beta2.
autoscaling/v2beta2, use v2.

Deprecated. Implement an alternative before the next release goes out:

In-tree GlusterFS driver.
kubectl --prune-whitelist, use --prune-allowlist instead.
kube-apiserver --master-service-namespace.
Several unused options for kubectl run: --cascade, --filename, --force, --grace-period, --kustomize, --recursive, --timeout, --wait.
CLI flag pod-eviction-timeout.
The apiserver_request_slo_duration_seconds metric, use apiserver_request_sli_duration_seconds.

Removed. Implement an alternative before upgrading:

Legacy authentication for Azure and Google Cloud is deprecated.
The userspace proxy mode.
Dynamic kubelet configuration.
Several command line arguments related to logging.
in-tree OpenStack (cinder volume type), use the CSI driver.

Other changes you should adapt your configs for:

Pod Security admission: the pod-security warn level will now default to the enforce level.
kubelet: The default cpuCFSQuotaPeriod value with the cpuCFSQuotaPeriod flag enabled is now 100µs instead of 100ms.
kubelet: The --container-runtime-endpoint flag cannot be empty anymore.
kube-apiserver: gzip compression switched from level 4 to level 1.
Metrics: Changed preemption_victims from LinearBuckets to ExponentialBuckets.
Metrics: etcd_db_total_size_in_bytes is renamed to apiserver_storage_db_total_size_in_bytes.
Metrics: kubelet_kubelet_credential_provider_plugin_duration is renamed kubelet_credential_provider_plugin_duration.
Metrics: kubelet_kubelet_credential_provider_plugin_errors is renamed kubelet_credential_provider_plugin_errors.
Removed Windows Server, Version 20H2 flavors from various container images.
The e2e.test binary no longer emits JSON structs to document progress.

You can check the full list of changes in the Kubernetes 1.26 release notes. Also, we recommend the Kubernetes Removals and Deprecations In 1.26 article, as well as keeping the deprecated API migration guide close for the future.

#281 Dynamic Kubelet configuration

Feature group: node

After being in beta since Kubernetes 1.11, the Kubernetes team has decided to deprecate DynamicKubeletConfig instead of continuing its development.

This feature was marked for deprecation in 1.21, then removed from the Kubelet in 1.24. Now in 1.26, it has been completely removed from Kubernetes.

Kubernetes 1.26 API

#3352 Aggregated discovery

Stage: Net new to Alpha
Feature group: api-machinery
Feature gate: AggregatedDiscoveryEndpoint Default value: false

Every Kubernetes client like kubectl needs to discover what APIs and versions of those APIs are available in the kubernetes-apiserver. For that, they need to make a request per each API and version, which causes a storm of requests.

This enhancement aims to reduce all those calls to just two.

Clients can include as=APIGroupDiscoveryList to the Accept field of their requests to the /api and /apis endpoints. Then, the server will return an aggregated document (APIGroupDiscoveryList) with all the available APIs and their versions.

#3488 CEL for admission control

Stage: Net new to Alpha
Feature group: api-machinery

Feature gate: ValidatingAdmissionPolicy Default value: false

Building on #2876 CRD validation expression language from Kubernetes 1.25, this enhancement provides a new admission controller type (ValidatingAdmissionPolicy) that allows implementing some validations without relying on webhooks.

These new policies can be defined like:

 apiVersion: admissionregistration.k8s.io/v1alpha1
 kind: ValidatingAdmissionPolicy
 metadata:
   name: "demo-policy.example.com"
 Spec:
   failurePolicy: Fail
   matchConstraints:
     resourceRules:
     - apiGroups:   ["apps"]
       apiVersions: ["v1"]
       operations:  ["CREATE", "UPDATE"]
       resources:   ["deployments"]
   validations:
     - expression: "object.spec.replicas <= 5"

This policy would deny requests for some deployments with 5 replicas or less.

Discover the full power of this feature in the docs.

#1965 kube-apiserver identity

Stage: Graduating to Beta
Feature group: api-machinery
Feature gate: APIServerIdentity Default value: true

In order to better control which kube-apiservers are alive in a high availability cluster, a new lease / heartbeat system has been implemented.

Read more in our "What's new in Kubernetes 1.20" article.

Apps in Kubernetes 1.26

#3017 PodHealthyPolicy for PodDisruptionBudget

Stage: Net new to Alpha
Feature group: apps
Feature gate: PDBUnhealthyPodEvictionPolicy Default value: false

A PodDisruptionBudget allows you to communicate some minimums to your cluster administrator to make maintenance tasks easier, like "Do not destroy more than one of these" or "Keep at least two of these alive".

However, this only takes into account if the pods are running, not if they are healthy. It may happen that your pods are Running but not Ready, and a PodDisruptionBudget may be preventing its eviction.

This enhancement expands these budget definitions with the status.currentHealthy, status.desiredHealthy, and spec.unhealthyPodEvictionPolicy extra fields to help you define how to manage unhealthy pods.

$ kubectl get poddisruptionbudgets example-pod
apiVersion: policy/v1
kind: PodDisruptionBudget
[...]
status:
  currentHealthy: 3
  desiredHealthy: 2
  disruptionsAllowed: 1
  expectedPods: 3
  observedGeneration: 1
  unhealthyPodEvictionPolicy: IfHealthyBudget

#3335 Allow StatefulSet to control start replica ordinal numbering

Stage: Net new to Alpha
Feature group: apps
Feature gate: StatefulSetStartOrdinal Default value: false

StatefulSets in Kubernetes currently number their pods using ordinal numbers, with the first replica being 0 and the last being spec.replicas.

This enhancement adds a new struct with a single field to the StatefulSet manifest spec, spec.ordinals.start, which allows to define the starting number for the replicas controlled by the StatefulSet.

This is useful, for example, in cross-namespace or cross-cluster migrations of StatefulSet, where a clever use of PodDistruptionBudgets (and multi-cluster services) can allow a controlled rolling migration of the replicas avoiding any downtime to the StatefulSet.

#3329 Retriable and non-retriable Pod failures for Jobs

Stage: Graduating to Beta
Feature group: apps
Feature gate: JobPodFailurePolicy Default value: true Feature gate: PodDisruptionsCondition Default value: true

This enhancement allows us to configure a .spec.podFailurePolicy on the Jobs's spec that determines whether the Job should be retried or not in case of failure. This way, Kubernetes can terminate Jobs early, avoiding increasing the backoff time in case of infrastructure failures or application errors.

Read more in our "What's new in Kubernetes 1.25" article.

#2307 Job tracking without lingering Pods

Stage: Graduating to Stable
Feature group: apps
Feature gate: JobTrackingWithFinalizers Default value: true

With this enhancement, Jobs will be able to remove completed pods earlier, freeing resources in the cluster.

Read more in our "Kubernetes 1.22 - What's new?" article.

Kubernetes 1.26 Auth

#3325 Auth API to get self user attributes

Stage: Net new to Alpha
Feature group: auth
Feature gate: APISelfSubjectAttributesReview Default value: false

This new feature is extremely useful when a complicated authentication flow is used in a Kubernetes cluster, and you want to know all your userInfo, after all authentication mechanisms are applied.

Executing kubectl alpha auth whoami will produce the following output:

apiVersion: authentication.k8s.io/v1alpha1
kind: SelfSubjectReview
status:
  userInfo:
    username: jane.doe
    uid: b79dbf30-0c6a-11ed-861d-0242ac120002
    groups:
    - students
    - teachers
    - system:authenticated
    extra:
      skills:
      - reading
      - learning
      subjects:
      - math
      - sports

In summary, we are now allowed to do a typical /me to know our own permissions once we are authenticated in the cluster.

#2799 Reduction of secret-based service account tokens

Stage: Graduating to Beta
Feature group: auth
Feature gate: LegacyServiceAccountTokenNoAutoGeneration Default value: true

API credentials are now obtained through the TokenRequest API, are stable since Kubernetes 1.22, and are mounted into Pods using a projected volume. They will be automatically invalidated when their associated Pod is deleted.

Read more in our "Kubernetes 1.24 - What's new?" article.

Network in Kubernetes 1.26

#3453 Minimizing iptables-restore input size

Stage: Net new to Alpha
Feature group: network
Feature gate: MinimizeIPTablesRestore Default value: false

This enhancement aims to improve the performance of kube-proxy. It will do so by only sending the rules that have changed on the calls to iptables-restore, instead of the whole set of rules.

#1669 Proxy terminating endpoints

Stage: Graduating to Beta
Feature group: network
Feature gate: ProxyTerminatingEndpoints Default value: true

This enhancement prevents traffic drops during rolling updates by sending all external traffic to both ready and not ready terminating endpoints (preferring the ready ones).

Read more in our "Kubernetes 1.22 - What's new?" article.

#2595 Expanded DNS configuration

Stage: Graduating to Beta
Feature group: network
Feature gate: ExpandedDNSConfig Default value: true

With this enhancement, Kubernetes allows up to 32 DNS in the search path, and an increased number of characters for the search path (up to 2048), to keep up with recent DNS resolvers.

Read more in our "Kubernetes 1.22 - What's new?" article.

#1435 Support of mixed protocols in Services with type=LoadBalancer

Stage: Graduating to Stable
Feature group: network
Feature gate: MixedProtocolLBService Default value: true

This enhancement allows a LoadBalancer Service to serve different protocols under the same port (UDP, TCP). For example, serving both UDP and TCP requests for a DNS or SIP server on the same port.

Read more in our "Kubernetes 1.20 - What's new?" article.

#2086 Service internal traffic policy

Stage: Graduating to Stable
Feature group: network
Feature gate: ServiceInternalTrafficPolicy Default value: true

You can now set the spec.trafficPolicy field on Service objects to optimize your cluster traffic:

With Cluster, the routing will behave as usual.
When set to Topology, it will use the topology-aware routing.
With PreferLocal, it will redirect traffic to services on the same node.
With Local, it will only send traffic to services on the same node.

Read more in our "Kubernetes 1.21 - What's new?" article.

#3070 Reserve service IP ranges for dynamic and static IP allocation

Stage: Graduating to Stable
Feature group: network
Feature gate: ServiceIPStaticSubrange Default value: true

This update to the --service-cluster-ip-range flag will lower the risk of having IP conflicts between Services using static and dynamic IP allocation, and at the same time, keep the compatibility backwards.

Read more in our "What's new in Kubernetes 1.24" article.

Kubernetes 1.26 Nodes

#2371 cAdvisor-less, CRI-full container and Pod stats

Stage: Major change to Alpha
Feature group: node
Feature gate: PodAndContainerStatsFromCRI Default value: false

This enhancement summarizes the efforts to retrieve all the stats about running containers and pods from the Container Runtime Interface (CRI), removing the dependencies from cAdvisor.

Starting with 1.26, the metrics on /metrics/cadvisor are gathered by CRI instead of cAdvisor.

Read more in our "Kubernetes 1.23 - What's new?" article.

#3063 Dynamic resource allocation

Stage: Net new to Alpha
Feature group: node
Feature gate: DynamicResourceAllocation Default value: false

Traditionally, the Kubernetes scheduler could only take into account CPU and memory limits and requests. Later on, the scheduler was expanded to also take storage and other resources into account. However, this is limiting in many scenarios.

For example, what if the device needs initialization and cleanup, like an FPGA; or what if you want to limit the access to the resource, like a shared GPU?

This new API covers those scenarios of resource allocation and dynamic detection, using the new ResourceClaimTemplate and ResourceClass objects, and the new resourceClaims field inside Pods.

apiVersion: v1
 kind: Pod
[...]
 spec:
   resourceClaims:
   - name: resource0
     source:
       resourceClaimTemplateName: resource-claim-template
   - name: resource1
     source:
       resourceClaimTemplateName: resource-claim-template
[...]

The scheduler can keep track of these resource claims, and only schedule Pods in those nodes with enough resources available.

#3386 Kubelet evented PLEG for better performance

Stage: Net new to Alpha
Feature group: node
Feature gate: EventedPLEG Default value: false

The aim of this enhancement is to reduce the CPU usage of the kubelet when keeping track of all the pod states.

It will partially reduce the periodic polling that the kubelet performs, instead relying on notifications from the Container Runtime Interface (CRI) as much as possible.

If you are interested in the implementation details, you may want to take a look at the KEP.

#3545 Improved multi-NUMA alignment in topology manager

Stage: Net new to Alpha
Feature group: node
Feature gate: TopologyManagerPolicyOptions Default value: false Feature gate: TopologyManagerPolicyBetaOptions Default value: false
Feature gate: TopologyManagerPolicyAlphaOptions Default value: false

This is an improvement for TopologyManager to better handle Non-Uniform Memory Access (NUMA) nodes. For some high-performance workloads, it is very important to control in which physical CPU cores they run. You can significantly improve performance if you avoid memory jumping between the caches of the same chip, or between sockets.

A new topology-manager-policy-options flag for kubelet will allow you to pass options and modify the behavior of a topology manager.

Currently, only one alpha option is available:

When prefer-closest-numa-nodes=true is passed along, the Topology Manager will align the resources on either a single NUMA node or the minimum number of NUMA nodes possible.

As new options may be added in the future, several feature gates have been added so you can choose to focus only on the stable ones:

TopologyManagerPolicyOptions: Will enable the topology-manager-policy-options flag and the stable options.
TopologyManagerPolicyBetaOptions: Will also enable the beta options.
TopologyManagerPolicyAlphaOptions: Will also enable the alpha options.

#2133 Kubelet credential provider

Stage: Graduating to Stable
Feature group: node
Feature gate: KubeletCredentialProviders Default value: true

This enhancement replaces in-tree container image registry credential providers with a new mechanism that is external and pluggable.

Read more in our "Kubernetes 1.20 - What's new?" article.

#3570 Graduate to CPUManager to GA

Stage: Graduating to Stable
Feature group: node
Feature gate: CPUManager Default value: true

The CPUManager is the Kubelet component responsible for assigning pod containers to sets of CPUs on the local node.

It was introduced in Kubernetes 1.8, and graduated to beta in release 1.10. For 1.26, the core CPUManager has been deemed stable, while experimentation continues with the additional work on its policies.

#3573 Graduate DeviceManager to GA

Stage: Graduating to Stable
Feature group: node
Feature gate: DevicePlugins Default value: true

The DeviceManager in the Kubelet is the component managing the interactions with the different Device Plugins.

Initially introduced in Kubernetes 1.8 and moved to beta stage in release 1.10, the Device Plugin framework saw widespread adoption and is finally moving to GA in 1.26.

This framework allows the use of external devices (e.g., NVIDIA GPUs, AMD GPUS, SR-IOV NICs) without modifying core Kubernetes components.

Scheduling in Kubernetes 1.26

#3521 Pod scheduling readiness

Stage: Net new to Alpha
Feature group: schedulingFeature gate: PodSchedulingReadiness Default value: false

This enhancement aims to optimize scheduling by letting the Pods define when they are ready to be actually scheduled.

Not all pending Pods are ready to be scheduled. Some stay in a miss-essential-resources state for some time, which causes extra work in the scheduler.

The new .spec.schedulingGates of a Pod allows to identify when they are ready for scheduling:

apiVersion: v1
 kind: Pod
[...]
 spec:
   schedulingGates:
   - name: foo
   - name: bar
[...]

When any scheduling gate is present, the Pod won't be scheduled.

You can check the status with:

$ kubectl get pod test-pod
 NAME       READY   STATUS            RESTARTS   AGE
 test-pod   0/1     SchedulingGated   0          7s

#3094 Take taints/tolerations into consideration when calculating PodTopologySpread skew

Stage: Graduating to Beta
Feature group: scheduling
Feature gate: NodeInclusionPolicyInPodTopologySpread Default value: true

As we discussed in our "Kubernetes 1.16 - What's new?" article, the topologySpreadConstraints fields, along with maxSkew, allow you to spread your workloads across nodes. A new NodeInclusionPolicies field allows taking into account NodeAffinity and NodeTaint when calculating this pod topology spread skew.

Read more in our "What's new in Kubernetes 1.25" article.

Kubernetes 1.26 storage

#3294 Provision volumes from cross-namespace snapshots

Stage: Net new to Alpha
Feature group: storage
Feature gate: CrossNamespaceVolumeDataSource Default value: false

Prior to Kubernetes 1.26, users were able to provision volumes from snapshots thanks to the VolumeSnapshot feature. While this is a great and super useful feature. it had some limitations, like the inability to bind a PersistentVolumeClaim to VolumeSnapshots from other namespaces.

This enhancement breaks this limitation and allows Kubernetes users to provision volumes from snapshots across namespaces.

If you want to use the cross-namespace VolumeSnapshot feature, you’ll have to first create a ReferenceGrant object, and then a PersistentVolumeClaim binding to the VolumeSnapshot. Here, you’ll find a simple example of both objects for learning purposes.

---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: ReferenceGrant
metadata:
  name: test
  namespace: default
spec:
  from:
  - group: ""
    kind: PersistentVolumeClaim
    namespace: nstest1
  to:
  - group: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: testsnapshot
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testvolumeclaim
  namespace: nstest1
spec:
  storageClassName: mystorageclass
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  dataSourceRef2:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: testsnapshot
    namespace: default
  volumeMode: Filesystem

#2268 Non-graceful node shutdown

Stage: Graduating to Beta
Feature group: storage
Feature gate: NodeOutOfServiceVolumeDetach Default value: true

This enhancement addresses node shutdown cases that are not detected properly, where the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node and cannot be moved to a new running node.

The pods will be forcefully deleted in this case, trigger the deletion of the VolumeAttachments, and new pods will be created on a different running node so that application can continue to function.

Read more in our "Kubernetes 1.24 - What's new?" article.

#3333 Retroactive default StorageClass assignement

Stage: Graduating to Beta
Feature group: storage
Feature gate: RetroactiveDefaultStorageClass Default value: false

This enhancement helps manage the case when cluster administrators change the default storage class. All PVCs without StorageClass that were created while the change took place will retroactively be set to the new default StorageClass.

Read more in our "What's new in Kubernetes 1.25" article.

#1491 vSphere in-tree to CSI driver migration

Stage: Graduating to Stable
Feature group: storage
Feature gate: CSIMigrationvSphere Default value: false

As we covered in our "What's new in Kubernetes 1.19" article, the CSI driver for vSphere has been stable for some time. Now, all plugin operations for vspherevolume are now redirected to the out-of-tree 'csi.vsphere.vmware.com' driver.

This enhancement is part of the #625 In-tree storage plugin to CSI Driver Migration effort.

#1885 Azure file in-tree to CSI driver migration

Stage: Graduating to Stable
Feature group: storage
Feature gate: InTreePluginAzureDiskUnregister Default value: true

This enhancement summarizes the work to move Azure File code out of the main Kubernetes binaries (out-of-tree).

Read more in our "Kubernetes 1.21 - What's new?" article.

#2317 Allow Kubernetes to supply pod's fsgroup to CSI driver on mount

Stage: Graduating to Stable
Feature group: storage
Feature gate: DelegateFSGroupToCSIDriver Default value: false

This enhancement proposes providing the CSI driver with the fsgroup of the pods as an explicit field, so the CSI driver can be the one applying this natively on mount time.

Read more in our "Kubernetes 1.22 - What's new?" article.

Other enhancements in Kubernetes 1.26

#3466 Kubernetes component health SLIs

Stage: Net new to Alpha
Feature group: instrumentation
Feature gate: ComponentSLIs Default value: false

There isn't a standard format to query the health data of Kubernetes components.

Starting with Kubernetes 1.26, a new endpoint /metrics/slis will be available on each component exposing their Service Level Indicator (SLI) metrics in Prometheus format.

For each component, two metrics will be exposed:

A gauge, representing the current state of the healthcheck.
A counter, recording the cumulative counts observed for each healthcheck state.

With this information, you can check the overtime status for the Kubernetes internals, e.g.:

kubernetes_healthcheck{name="etcd",type="readyz"}

And create an alert for when something's wrong, e.g.:

kubernetes_healthchecks_total{name="etcd",status="error",type="readyz"} > 0

#3498 Extend metrics stability

Stage: Net new to Alpha
Feature group: instrumentation
Feature gate: N/A

Metrics in Kubernetes are classified as alpha or stable. The stable ones are guaranteed to be maintained, providing you with the information to prepare your dashboards so they don't break unexpectedly when you upgrade your cluster.

In Kubernetes 1.26, two new classes are added:

beta: For metrics related to beta features. They may change or disappear, but they are in a more advanced development state than the alpha ones.
internal: Metrics for internal usage that you shouldn't worry about, either because they don't provide useful information for cluster administrators, or because they may change without notice.

You can check a full list of available metrics in the documentation.

#3515 OpenAPI v3 for kubectl explain

Stage: Net new to Alpha
Feature group: cli
Environment variable: KUBECTL_EXPLAIN_OPENAPIV3 Default value: false

This enhancement allows kubectl explain to gather the data from OpenAPIv3 instead of v2.

In OpenAPIv3, some data can be represented in a better way, like CustomResourceDefinitions (CDRs).

Internal work is also being made to improve how kubectl explain prints the output.

#1440 kubectl events

Stage: Graduating to Beta
Feature group: cli
Feature gate: N/A

A new kubectl events command is available that will enhance the current functionality of kubectl get events.

Read more in our "Kubernetes 1.23 - What's new?" article.

#3031 Signing release artifacts

Stage: Graduating to Beta
Feature group: release
Feature gate: N/A

This enhancement introduces a unified way to sign artifacts in order to help avoid supply chain attacks. It relies on the sigstore project tools, and more specifically cosign. Although it doesn’t add new functionality, it will surely help to keep our cluster more protected.

Read more in our "Kubernetes 1.24 - What's new?" article.

#3503 Host network support for Windows pods

Stage: Net new to Alpha
Feature group: windows
Feature gate: WindowsHostNetwork Default value: false

There is a weird situation in Windows pods where you can set hostNetwork=true for them, but it doesn't change anything. There isn't any platform impediment, the implementation was just missing.

Starting with Kubernetes 1.26, the kubelet can now request that Windows pods use the host's network namespace instead of creating a new pod network namespace.

This will come handy to avoid port exhaustion where there's large amounts of services.

#1981 Support for Windows privileged containers

Stage: Graduating to Stable
Feature group: windows
Feature gate: WindowsHostProcessContainers Default value: true

This enhancement brings the privileged containers feature available in Linux to Windows hosts.

Privileged containers have access to the host, as if they were running directly on it. Although they are not recommended for most of the workloads, they are quite useful for administration, security, and monitoring purposes.

Read more in our "Kubernetes 1.22 - What's new?" article.

That’s all for Kubernetes 1.26, folks! Exciting as always; get ready to upgrade your clusters if you are intending to use any of these features.

If you liked this, you might want to check out our previous ‘What’s new in Kubernetes’ editions:

Get involved in the Kubernetes community:

Visit the project homepage.
Check out the Kubernetes project on GitHub.
Get involved with the Kubernetes community.
Meet the maintainers on the Kubernetes Slack.
Follow @KubernetesIO on Twitter.

And if you enjoy keeping up to date with the Kubernetes ecosystem, subscribe to our container newsletter, a monthly email with the coolest stuff happening in the cloud-native ecosystem.

Understanding Kubernetes Limits and Requests

Javier Martínez — Mon, 21 Nov 2022 08:43:18 +0000

When working with containers in Kubernetes, it’s important to know what are the resources involved and how they are needed. Some processes will require more CPU or memory than others. Some are critical and should never be starved.

Knowing that, we should configure our containers and Pods properly in order to get the best of both.

In this article, we will see:

Introduction to Kubernetes Limits and Requests
Hands-on example
Kubernetes Requests
Kubernetes Limits
CPU particularities
Memory particularities
Namespace ResourceQuota
Namespace LimitRange
Conclusion

Introduction to Kubernetes Limits and Requests

Limits and Requests are important settings when working with Kubernetes. This article will focus on the two most important ones: CPU and memory.

Kubernetes defines Limits as the maximum amount of a resource to be used by a container. This means that the container can never consume more than the memory amount or CPU amount indicated.

Requests, on the other hand, are the minimum guaranteed amount of a resource that is reserved for a container.

Hands-on example

Let’s have a look at this deployment, where we are setting up limits and requests for two different containers on both CPU and memory.

kind: Deployment
apiVersion: extensions/v1beta1
…
template:
  spec:
    containers:
      - name: redis
        image: redis:5.0.3-alpine
        resources:
limits:
            memory: 600Mi
            cpu: 1
requests:
            memory: 300Mi
            cpu: 500m
      - name: busybox
        image: busybox:1.28
        resources:
limits:
            memory: 200Mi
            cpu: 300m
requests:
            memory: 100Mi
            cpu: 100m

Let’s say we are running a cluster with, for example, 4 cores and 16GB RAM nodes. We can extract a lot of information:

Pod effective request is 400 MiB of memory and 600 millicores of CPU. You need a node with enough free allocatable space to schedule the pod.
CPU shares for the redis container will be 512, and 102 for the busybox container. Kubernetes always assign 1024 shares to every core, so redis: 1024 * 0.5 cores ≅ 512 and busybox: 1024 * 0.1cores ≅ 102
Redis container will be OOM killed if it tries to allocate more than 600MB of RAM, most likely making the pod fail.
Redis will suffer CPU throttle if it tries to use more than 100ms of CPU in every 100ms, (since we have 4 cores, available time would be 400ms every 100ms) causing performance degradation.
Busybox container will be OOM killed if it tries to allocate more than 200MB of RAM, resulting in a failed pod.
Busybox will suffer CPU throttle if it tries to use more than 30ms of CPU every 100ms, causing performance degradation.

Kubernetes Requests

Kubernetes defines requests as a guaranteed minimum amount of a resource to be used by a container.

Basically, it will set the minimum amount of the resource for the container to consume.

When a Pod is scheduled, kube-scheduler will check the Kubernetes requests in order to allocate it to a particular Node that can satisfy at least that amount for all containers in the Pod. If the requested amount is higher than the available resource, the Pod will not be scheduled and remain in Pending status.

For more information about Pending status, check Understanding Kubernetes Pod pending problems.

In this example, in the container definition we set a request for 100m cores of CPU and 4Mi of memory:

resources:
   requests:
        cpu: 0.1
        memory: 4Mi

Requests are used:

When allocating Pods to a Node, so the indicated requests by the containers in the Pod are satisfied.
At runtime, the indicated amount of requests will be guaranteed as a minimum for the containers in that Pod.

Kubernetes Limits

Kubernetes defines limits as a maximum amount of a resource to be used by a container.

This means that the container can never consume more than the memory amount or CPU amount indicated.

    resources:
      limits:
        cpu: 0.5
        memory: 100Mi

Limits are used:

When allocating Pods to a Node. If no requests are set, by default, Kubernetes will assign requests = limits.
At runtime, Kubernetes will check that the containers in the Pod are not consuming a higher amount of resources than indicated in the limit.

CPU particularities

CPU is a compressible resource, meaning that it can be stretched in order to satisfy all the demand. In case that the processes request too much CPU, some of them will be throttled.

CPU represents computing processing time, measured in cores.

You can use millicores (m) to represent smaller amounts than a core (e.g., 500m would be half a core)
The minimum amount is 1m
A Node might have more than one core available, so requesting CPU > 1 is possible

Memory particularities

Memory is a non-compressible resource, meaning that it can’t be stretched in the same manner as CPU. If a process doesn’t get enough memory to work, the process is killed.

Memory is measured in Kubernetes in bytes.

You can use, E, P, T, G, M, k to represent Exabyte, Petabyte, Terabyte, Gigabyte, Megabyte and kilobyte, although only the last four are commonly used. (e.g., 500M, 4G)
Warning: don’t use lowercase m for memory (this represents Millibytes, which is ridiculously low)
You can define Mebibytes using Mi, as well as the rest as Ei, Pi, Ti (e.g., 500Mi)

A Mebibyte (and their analogues Kibibyte, Gibibyte,...) is 2 to the power of 20 bytes. It was created to avoid the confusion with the Kilo, Mega definitions of the metric system. You should be using this notation, as it's the canonical definition for bytes, while Kilo and Mega are multiples of 1000

Best practices

In very few cases should you be using limits to control your resources usage in Kubernetes. This is because if you want to avoid starvation (ensure that every important process gets its share), you should be using requests in the first place.

By setting up limits, you are only preventing a process from retrieving additional resources in exceptional cases, causing an OOM kill in the event of memory, and Throttling in the event of CPU (process will need to wait until the CPU can be used again).

For more information, check the article about OOM and Throttling.

If you’re setting a request value equal to the limit in all containers of a Pod, that Pod will get the Guaranteed Quality of Service.

Note as well, that Pods that have a resource usage higher than the requests are more likely to be evicted, so setting up very low requests cause more harm than good.For more information, check the article about Pod eviction and Quality of Service.

Namespace ResourceQuota

Thanks to namespaces, we can isolate Kubernetes resources into different groups, also called tenants.

With ResourceQuotas, you can set a memory or CPU limit to the entire namespace, ensuring that entities in it can’t consume more from that amount.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-demo
spec:
  hard:
    requests.cpu: 2
    requests.memory: 1Gi
    limits.cpu: 3
    limits.memory: 2Gi

requests.cpu: the maximum amount of CPU for the sum of all requests in this namespace
requests.memory: the maximum amount of Memory for the sum of all requests in this namespace
limits.cpu: the maximum amount of CPU for the sum of all limits in this namespace
limits.memory: the maximum amount of memory for the sum of all limits in this namespace

Then, apply it to your namespace:

kubectl apply -f resourcequota.yaml --namespace=mynamespace

You can list the current ResourceQuota for a namespace with:

kubectl get resourcequota -n mynamespace

Note that if you set up ResourceQuota for a given resource in a namespace, you then need to specify limits or requests accordingly for every Pod in that namespace. If not, Kubernetes will return a “failed quota” error:

Error from server (Forbidden): error when creating "mypod.yaml": pods "mypod" is forbidden: failed quota: mem-cpu-demo: must specify limits.cpu,limits.memory,requests.cpu,requests.memory

In case you try to add a new Pod with container limits or requests that exceed the current ResourceQuota, Kubernetes will return an “exceeded quota” error:

Error from server (Forbidden): error when creating "mypod.yaml": pods "mypod" is forbidden: exceeded quota: mem-cpu-demo, requested: limits.memory=2Gi,requests.memory=2Gi, used: limits.memory=1Gi,requests.memory=1Gi, limited: limits.memory=2Gi,requests.memory=1Gi

Namespace LimitRange

ResourceQuotas are useful if we want to restrict the total amount of a resource allocatable for a namespace. But what happens if we want to give default values to the elements inside?

LimitRanges are a Kubernetes policy that restricts the resource settings for each entity in a namespace.

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-constraint
spec:
  limits:
  - default:
      cpu: 500m
    defaultRequest:
      cpu: 500m
    min:
      cpu: 100m
    max:
      cpu: "1"
    type: Container

default: Containers created will have this value if none is specified.
min: Containers created can’t have limits or requests smaller than this.
max: Containers created can’t have limits or requests bigger than this.

Later, if you create a new Pod with no requests or limits set, LimitRange will automatically set these values to all its containers:

    Limits:
      cpu:  500m
    Requests:
      cpu:  100m

Now, imagine that you add a new Pod with 1200M as limit. You will receive the following error:

Error from server (Forbidden): error when creating "pods/mypod.yaml": pods "mypod" is forbidden: maximum cpu usage per Container is 1, but limit is 1200m

Note that by default, all containers in Pod will effectively have a request of 100m CPU, even with no LimitRanges set.

Conclusion

Choosing the optimal limits for our Kubernetes cluster is key in order to get the best of both energy consumption and costs.

Oversizing or dedicating too many resources for our Pods may lead to costs skyrocketing.

Undersizing or dedicating very few CPU or Memory will lead to applications not performing correctly, or even Pods being evicted.

As mentioned, Kubernetes limits shouldn’t be used, except in very specific situations, as they may cause more harm than good. There’s a chance that a Container is killed in case of Out of Memory, or throttled in case of Out of CPU.

For requests, use them when you need to ensure a process gets a guaranteed share of a resource.

Rightsize your Kubernetes resources with Sysdig Monitor

With Sysdig Monitor new feature, cost advisor, you can optimize your Kubernetes costs

Memory requests
CPU requests

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

The four Golden Signals of Kubernetes monitoring

Javier Martínez — Fri, 28 Oct 2022 09:20:10 +0000

Golden Signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective: Latency, Traffic, Errors and Saturation. By focusing on these, you can be quicker at detecting potential problems that might be directly affecting the behavior of the application.

Google introduced the term "Golden Signals" to refer to the essential metrics that you need to measure in your applications. They are the following:

Errors - rate of requests that fail.
Saturation - consumption of your system resources.
Traffic - amount of use of your service per time unit.
Latency - the time it takes to serve a request.

This is just a set of essential signals to start monitoring in your system. In other words, if you’re wondering which signals to monitor, you will need to look at these four first.

Enter: Goldilocks and the four Monitoring Signals

Once upon a time, there was a little girl called Goldilocks, who lived at the other side of the wood and had been sent on an errand by her mother, passed by the house, and looked in at the window…

Errors

Goldilocks then tried the little chair, which belonged to the Little Bear, and found it just right, but she sat in it so hard that she broke it.

The error rate for the chairs is ⅓

The Errors golden signal measures the rate of requests that fail.

Note that measuring the bulk amount of errors might not be the best course of action. If your application has a sudden peak of requests, then logically the amount of failed requests may increase.

That’s why usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.

If you’re managing a web application, typically you will discriminate between those calls returning HTTP status in the 400-499 range (client errors) and 500-599 (server errors).

Measuring errors in Kubernetes

One thermometer for the errors happening in Kubernetes is the Kubelet. You can use several Kubernetes State Metrics in Prometheus to measure the amount of errors.

The most important one is kubelet_runtime_operations_errors_total, which indicates low level issues in the node, like problems with container runtime.

If you want to visualize errors per operation, you can use kubelet_runtime_operations_total to divide.

Errors example

Here's the Kubelet Prometheus metric for error rate in a Kubernetes cluster:

sum(rate(kubelet_runtime_operations_errors_total{cluster="",
job="kubelet", metrics_path="/metrics"}[$__rate_interval])) 
by (instance, operation_type)

Saturation

Goldilocks tasted the porridge in the dear little bowl, and it was just right, and it tasted so good that she tasted and tasted, and tasted and tasted until she was full.

After eating one small bowl, Goldilocks is unable to eat more. That’s saturation.

Saturation measures the consumption of your system resources, usually as a percentage of the maximum capacity. Examples include:

CPU usage
Disk space
Memory usage
Network bandwidth

In the end, cloud applications run on machines, which have a limited amount of these resources.

In order to correctly measure, you should be aware of the following:

What are the consequences if the resource is depleted? It could be that your entire system is unusable because this space has run out. Or maybe further requests are throttled until the system is less saturated.
Saturation is not always about resources about to be depleted. It’s also about over-resourcing, or allocating a higher quantity of resources than what is needed. This one is crucial for cost savings.

Measuring saturation in Kubernetes

Since saturation depends on the resource being observed, you can use different metrics for Kubernetes entities:

node_cpu_seconds_total to measure machine CPU utilization.
container_memory_usage_bytes to measure the memory utilization at container level (paired with container_memory_max_usage_bytes).
The amount of Pods that a Node can contain is also a Kubernetes resource.

Saturation example

Here’s a PromQL example of a Saturation signal, measuring CPU usage percent in a Kubernetes node.

100 - (avg by (instance) (rate(node_cpu_seconds_total{}[5m])) * 100)

Traffic

And the Middle-sized Bear said:
“Somebody has been tumbling my bed!”
And the Little bear piped:
“Somebody has been tumbling my bed, and here she is!”

One of the beds is being used, but none should be being used instead. That’s an unusual traffic.

Traffic measures the amount of use of your service per time unit.

In essence, this will represent the usage of your current service. This is important not only for business reasons, but also to detect anomalies.

Is the amount of requests too high? This could be due to a peak of users or because of a misconfiguration causing retries.

Is the amount of requests too low? That may reflect that one of your systems is failing.

Still, traffic signals should always be measured with a time reference. As an example, this blog receives more visits from Tuesday to Thursday.

Depending on your application, you could be measuring traffic by:

Requests per minute for a web application
Queries per minute for a database application
Endpoint requests per minute for an API

Traffic example

Here’s a Google Analytics chart displaying traffic distributed by hour:

Latency

At that, Goldilocks woke in a fright, and jumped out of the window and ran away as fast as her legs could carry her, and never went near the Three Bears’ snug little house again.

Goldilocks ran down the stairs in just two seconds. That’s a very low latency.

Latency is defined as the time it takes to serve a request.

Average latency

When working with latencies, your first impulse may be to measure average latency, but depending on your system that might not be the best idea. There may be very fast or very slow requests distorting the results.

Instead, consider using a percentile, like p99, p95, and p50 (also known as median) to measure how the fastest 99%, 95%, or 50% of requests, respectively, took to complete.

Failed vs. successful

When measuring latency, it’s also important to discriminate between failed and successful requests, as failed ones might take sensibly less time than the correct ones.

Apdex Score

As described above, latency information may not be informative enough:

Some users might perceive applications as slower, depending on the action they are performing.
Some users might perceive applications as slower, based on the default latencies of the industry.

This is where the Apdex (Application Performance Index) comes in. It’s defined as:

Where t is the target latency that we consider as reasonable.

Satisfied will represent the amount of users with requests under the target latency.
Tolerant will represent the amount of non-satisfied users with requests below four times the target latency.
Frustrated will represent the amount of users with requests above the tolerant latency.

The output for the formula will be an index from 0 to 1, indicating how performant our system is in terms of latency.

Measuring latency in Kubernetes

In order to measure the latency in your Kubernetes cluster, you can use metrics like http_request_duration_seconds_sum.

You can also measure the latency for the api-server by using Prometheus metrics like apiserver_request_duration_seconds.

Latency example

Here’s an example of a Latency PromQL query for the 95% best performing HTTP requests in Prometheus:

histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m]))
by (le))

RED Method

The RED Method was created by Tom Wilkie, from Weaveworks. It is heavily inspired by the Golden Signals and it’s focused on microservices architectures.

RED stands for:

Rate
Error
Duration

Rate measures the number of requests per second (equivalent to Traffic in the Golden Signals).

Error measures the number of failed requests (similar to the one in Golden Signals).

Duration measures the amount of time to process a request (similar to Latency in Golden Signals).

USE Method

The USE Method was created by Brendan Gregg and it’s used to measure infrastructure.

USE stands for:

Utilization
Saturation
Errors

That means for every resource in your system (CPU, disk, etc.), you need to check the three elements above.

Utilization is defined as the percentage of usage for that resource.

Saturation is defined as the queue for requests in the system.

Errors is defined as the number of errors happening in the system.

While it may not be intuitive, Saturation in Golden Signals is not similar to the Saturation in USE, but rather Utilization.

A practical example of Golden signals in Kubernetes

As an example to illustrate the use of Golden Signals, here’s a simple go application example with Prometheus instrumentation. This application will apply a random delay between 0 and 12 seconds in order to give usable information of latency. Traffic will be generated with curl, with several infinite loops.

An histogram was included to collect metrics related to latency and requests. These metrics will help us obtain the initial three Golden Signals: latency, request rate and error rate. To obtain saturation directly with Prometheus and node-exporter, use percentage of CPU in the nodes.


File: main.go
-------------
package main
import (
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "time"
    "github.com/gorilla/mux"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
    //Prometheus: Histogram to collect required metrics
    histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "greeting_seconds",
        Help:    "Time take to greet someone",
        Buckets: []float64{1, 2, 5, 6, 10}, //Defining small buckets as this app should not take more than 1 sec to respond
    }, []string{"code"}) //This will be partitioned by the HTTP code.
    router := mux.NewRouter()
    router.Handle("/sayhello/{name}", Sayhello(histogram))
    router.Handle("/metrics", promhttp.Handler()) //Metrics endpoint for scrapping
    router.Handle("/{anything}", Sayhello(histogram))
    router.Handle("/", Sayhello(histogram))
    //Registering the defined metric with Prometheus
    prometheus.Register(histogram)
    log.Fatal(http.ListenAndServe(":8080", router))
}
func Sayhello(histogram *prometheus.HistogramVec) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        //Monitoring how long it takes to respond
        start := time.Now()
        defer r.Body.Close()
        code := 500
        defer func() {
            httpDuration := time.Since(start)
            histogram.WithLabelValues(fmt.Sprintf("%d", code)).Observe(httpDuration.Seconds())
        }()
        if r.Method == "GET" {
            vars := mux.Vars(r)
            code = http.StatusOK
            if _, ok := vars["anything"]; ok {
                //Sleep random seconds
                rand.Seed(time.Now().UnixNano())
                n := rand.Intn(2) // n will be between 0 and 3
                time.Sleep(time.Duration(n) * time.Second)
                code = http.StatusNotFound
                w.WriteHeader(code)
            }
            //Sleep random seconds
            rand.Seed(time.Now().UnixNano())
            n := rand.Intn(12) //n will be between 0 and 12
            time.Sleep(time.Duration(n) * time.Second)
            name := vars["name"]
            greet := fmt.Sprintf("Hello %s \n", name)
            w.Write([]byte(greet))
        } else {
            code = http.StatusBadRequest
            w.WriteHeader(code)
        }
    }
}

The application was deployed in a Kubernetes cluster with Prometheus and Grafana, and generated a dashboard with Golden Signals. In order to obtain the data for the dashboards, these are the PromQL queries:

Latency:

sum(greeting_seconds_sum)/sum(greeting_seconds_count)  //Average
histogram_quantile(0.95, sum(rate(greeting_seconds_bucket[5m])) by (le)) //Percentile p95

Request rate:

sum(rate(greeting_seconds_count{}[2m]))  //Including errors
rate(greeting_seconds_count{code="200"}[2m])  //Only 200 OK requests

Errors per second:

sum(rate(greeting_seconds_count{code!="200"}[2m]))

Saturation:

100 - (avg by (instance) (irate(node_cpu_seconds_total{}[5m])) * 100)

Conclusion

Golden Signals, RED, and USE are just guidelines on what you should be focusing on when looking at your systems. But these are just the bare minimum on what to measure.

Understand the errors in your system. They will be a thermometer of all the other metrics, as they will point to any unusual behavior. And remember that you need to correctly mark requests as erroneous, but only the ones that should be exceptionally incorrect. Otherwise, your system will be prone to false positives or false negatives.

Measure latency of your requests. Try to understand your bottlenecks and what the negative experiences are when latency is higher than expected.

Visualize saturation and understand the resources involved in your solution. What are the consequences if a resource gets depleted?

Measure traffic to understand your usage curves. You will be able to find the best time to take down your system for an update, or you could be alerted when there’s an unexpected amount of users.

Once metrics are in place, it’s important to set up alerts, which will notify you in case any of these metrics reach a certain threshold.

Track golden signals easily with Sysdig Monitor

With Sysdig Monitor, you can quickly review the golden signals in your system, out of the box.

Review easily the Latency, Errors, Saturation and Traffic for the Pods in your cluster. And thanks to its Container Observability with eBPF, you can do this without adding any app or code instrumentation.

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

Kubernetes ErrImagePull and ImagePullBackOff in detail

Javier Martínez — Wed, 05 Oct 2022 19:34:01 +0000

Pod statuses like ImagePullBackOff or ErrImagePull are common when working with containers.

ErrImagePull is an error happening when the image specified for a container can’t be retrieved or pulled.

ImagePullBackOff is the waiting grace period while the image pull is fixed.

In this article, we will take a look at:

Container Images
Pulling Images
Image Pull Policy
ErrImagePull
Debugging ErrImagePull
Monitoring Image Pull Errors
Other Image Errors

Container Images

One of the greatest strengths of containerization is the ability to run any particular image in seconds. A container is a group of processes executing in isolation from the underlying system. A container image contains all the resources needed to run those processes: the binaries, libraries, and any necessary configuration.

A container registry is a repository for container images, where there are two basic actions:

Push: upload an image so it’s available in the repo
Pull: download an image to use it in a container

The docker CLI will be used in the examples for this article, but you can use any tool that implements the Open Container Initiative Distribution specs for all the container registry interactions.

Pulling images

Images can be defined by name. In addition, a particular version of an image can be labeled with a specific name or tag. It can also be identified by its digest, a hash of the content.

The tag latest refers to the most recent version of a given image.

Pull images by name

By only providing the name for the image, the image with tag latest will be pulled

docker pull nginx
kubectl run mypod nginx

Pull images by name + tag

If you don’t want to pull the latest image, you can provide a specific release tag:

docker pull nginx:1.23.1-alpine
kubectl run mypod nginx:1.23.1-alpine

For more information, you can check this article about tag mutability.

Pull images by digest

A digest is sha256 hash of the actual image. You can pull images using this digest, then verify its authenticity and integrity with the downloaded file.

docker pull sha256:d164f755e525e8baee113987bdc70298da4c6f48fdc0bbd395817edf17cf7c2b
kubectl run mypod --image=nginx:sha25645b23dee08af5e43a7fea6c4cf9c25ccf269ee113168c19722f87876677c5

Image Pull Policy

Kubernetes features the ability to set an Image Pull Policy (imagePullPolicy field) for each container. Based on this, the way the kubelet retrieves the container image will differ.

There are three different values for imagePullPolicy:

Always
IfNotPresent
Never

Always

With imagePullPolicy set to Always, kubelet will check the repository every time when pulling images for this container.

IfNotPresent

With imagePullPolicy set to IfNotPresent, kubelet will only pull images from the repository if it doesn’t exist in the node locally.

Never

With imagePullPolicy set to Never, kubelet will never try to pull images from the image registry. If there’s an image cached locally (pre-pulled), it will be used to start the container.

If the image is not present locally, Pod creation will fail with ErrImageNeverPull error.

Note that you can modify the entire image pull policy of your cluster by using the AlwaysPullImages admission controller.

Default Image Pull Policy

If you omit the imagePullPolicy and the tag is latest, imagePullPolicy is set to Always.
If you omit the imagePullPolicy and the tag for the image, imagePullPolicy is set to Always.
If you omit the imagePullPolicy and the tag is set to a value different than latest, imagePullPolicy is set to IfNotPresent.

ErrImagePull

When Kubernetes tries to pull an image for a container in a Pod, things might go wrong. The status ErrImagePull is displayed when kubelet tried to start a container in the Pod, but something was wrong with the image specified in your Pod, Deployment, or ReplicaSet manifest.

Imagine that you are using kubectl to retrieve information about the Pods in your cluster:

$ kubectl get pods
NAME    READY   STATUS             RESTARTS   AGE
goodpod 1/1     Running            0          21h
mypod   0/1     ErrImagePull       0          4s

Which means:

Pod is not in READY status
Status is ErrImagePull

Additionally, you can check the logs for containers in your Pod:

$ kubectl logs mypod --all-containers
Error from server (BadRequest): container "mycontainer" in pod "mypod" is waiting to start: trying and failing to pull image

In this case, this is pointing to a 400 Error (BadRequest), since probably the image indicated is not available or doesn’t exist.

ImagePullBackOff

ImagePullBackOff is a Kubernetes waiting status, a grace period with an increased back-off between retries. After the back-off period expires, kubelet will try to pull the image again.

This is similar to the CrashLoopBackOff status, which is also a grace period between retries after an error in a container. Back-off time is increased each retry, up to a maximum of five minutes.

Note that ImagePullBackOff is not an error. As mentioned, it’s just a status reason that is caused by a problem when pulling the image.

$ kubectl get pods
NAME    READY   STATUS             RESTARTS   AGE
goodpod 1/1     Running            0          21h
mypod   0/1     ImagePullBackOff   0          84s

Which means:

Pod is not in READY status
Status is ImagePullBackOff
Unlike CrashLoopBackOff, there are no restarts (technically Pod hasn’t even started)

$ kubectl describe pod mypod
State:          Waiting
Reason:       ImagePullBackOff
...
Warning  Failed     3m57s (x4 over 5m28s)  kubelet            Error: ErrImagePull
Warning  Failed     3m42s (x6 over 5m28s)  kubelet            Error: ImagePullBackOff
Normal   BackOff    18s (x20 over 5m28s)   kubelet            Back-off pulling image "failed-image"

Debugging ErrImagePull and ImagePullBackOff

There are several potential causes of why you might encounter an Image Pull Error. Here are some examples:

Wrong image name
Wrong image tag
Wrong image digest
Network problem or image repo not available
Pulling from a private registry but not imagePullSecret was provided

This is just a list of possible causes, but it’s important to note that there might be many others based on your solution. The best course of action would be to check:

$ kubectl describe pod podname

$ kubect logs podname –all-containers

$ kubectl get events --field-selector involvedObject.name=podname

In the following example you can see how to dig into the logs, where an image error is found.

Other image errors

ErrImageNeverPull

This error appears when kubelet fails to pull an image in the node and the imagePullPolicy is set to Never. In order to fix it, either change the Pull Policy to allow images to be pulled externally or add the correct image locally.

Pending

Remember that an ErrImagePull and the associated ImagePullBackOff may be different from a Pending status on your Pod.

Pending status, most likely, is the result of kube-scheduler not being able to assign your Pod to a working or eligible Node.

Monitoring Image Pull Errors in Prometheus

Using Prometheus and Kube State Metrics (KSM), we can easily track our Pods with containers in ImagePullBackOff or ErrImagePull statuses.

kube_pod_container_status_waiting_reason{reason="ErrImagePull"}
kube_pod_container_status_waiting_reason{reason="ImagePullBackOff"}

In fact, these two metrics are complementary, as we can see in the following Prometheus queries:

The Pod is shifting between the waiting period in ImagePullBackOff and the image pull retrial returning an ErrImagePull.

Also, if you’re using containers with ImagePullPolicy set to Never, remember that you need to track the error as ErrImageNeverPull.

kube_pod_container_status_waiting_reason{reason="ErrImageNeverPull"}

Conclusion

Container images are a great way to kickstart your cloud application needs. Thanks to them, you have access to thousands of curated applications that are ready to be started and scaled.

However, due to misconfiguration, misalignments, or repository problems, image errors might start appearing. A container can’t start properly if the image definition is malformed or there are errors on the setup.

Kubernetes provides a graceful period in case of an image pull error. This Image Pull Backoff is quite useful, as it gives you time to fix the problem in the image definition. But you need to be aware when this happens in your cluster and what does it mean each time.

Troubleshoot Image Pull Errors with Sysdig Monitor

With the Advisor feature of Sysdig Monitor, you can easily review Image Pull Errors happening in your Kubernetes cluster.

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

Understanding Kubernetes Evicted Pods

Javier Martínez — Tue, 20 Sep 2022 15:55:44 +0000

What does it mean that Kubernetes Pods are evicted? They are terminated, usually the result of not having enough resources. But why does this happen?

Eviction is a process where a Pod assigned to a Node is asked for termination. One of the most common cases in Kubernetes is Preemption, where in order to schedule a new Pod in a Node with limited resources, another Pod needs to be terminated to leave resources to the first one.

Also, Kubernetes constantly checks resources and evicts Pods if needed, a process called Node-pressure eviction.

Every day, thousands of Pods are evicted from their homes. Stranded and confused, they have to abandon their previous lifestyle. Some of them even become nodeless. The current society, imposing higher demands of CPU and memory, is part of the problem.

During this article, you will discover:

Reasons why Pods are evicted: Preemption and Node-pressure
Preemption eviction
Pod Priority Classes
Node-pressure eviction
Quality of Service Classes
Other types of eviction
Kubernetes Pod eviction monitoring in Prometheus

Reasons why Pods are evicted: Preemption and Node-pressure

There are several reasons why Pod eviction can happen in Kubernetes. The most important ones are:

Preemption
Node-pressure eviction

Preemption eviction

Preemption is the following process: if a new Pod needs to be scheduled but doesn’t have any suitable Node with enough resources, then kube-scheduler will check if by evicting (terminating) some Pods with lower priority the new Pod can be part of that Node.

Let’s first understand how Kubernetes scheduling works.

Pod Scheduling

Kubernetes Scheduling is the process where Pods are assigned to nodes.

By default, there’s a Kubernetes entity responsible for scheduling, called kube-scheduler which will be running in the control plane. The Pod will start in the Pending state until a matching node is found.

The process of assigning a Pod to a Node follows this sequence:

Filtering
Scoring

Filtering

During the Filtering step, kube-scheduler will select all Nodes where the current Pod might be placed. Features like Taints and Tolerations will be taken into account here. Once finished, it will have a list of suitable Nodes for that Pod.

Scoring

During the Scoring step, kube-scheduler will take the resulting list from the previous step and assign a score to each of the nodes. This way, candidate nodes are ordered from most suitable to least. In case two nodes have the same score, kube-scheduler orders them randomly.

But, what happens if there are no suitable Nodes for a Pod to run? When that’s the case, Kubernetes will start the preemption process, trying to evict lower priority Pods in order for the new one to be assigned.

Pod Priority Classes

How can I prevent a particular Pod from being evicted in case of a preemption process? Chances are, a specific Pod is critical for you and should never be terminated.

That’s why Kubernetes features Priority Classes.

A Priority Class is a Kubernetes object that allows us to map numerical priority values to specific Pods. Those with a higher value are classified as more important and less likely to be evicted.

You can query current Priority Classes using:

kubectl get priorityclasses
kubectl get pc

NAME                      VALUE        GLOBAL-DEFAULT   AGE
system-cluster-critical   2000000000   false            2d
system-node-critical      2000001000   false            2d

Priority Class example

Let’s do a practical example using the Berry Club comic from Mr. Lovenstein:

There are three Pods representing blueberry, raspberry and strawberry:

NAME         READY   STATUS             RESTARTS   AGE
blueberry    1/1     Running            0          4h41m
raspberry    1/1     Running            0          58m
strawberry   1/1     Running            0          5h22m

And there are two Priority Classes: trueberry and falseberry. The first one will have a higher value indicating higher priority.

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: trueberry
value: 1000000
globalDefault: false
description: "This fruit is a true berry"

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: falseberry
value: 5000
globalDefault: false
description: "This fruit is a false berry"

Blueberry will have the trueberry priority class (value = 1000000)
raspberry and strawberry will both have the falseberry priority class (value = 5000)

This will mean that in case of a preemption, raspberry and strawberry are more likely to be evicted to make room for higher priority Pods.

Then assign the Priority Classes to Pods by adding this to the Pod definition:

 priorityClassName: trueberry

Let’s now try to add three more fruits, but with a twist. All of the new fruits will contain the higher Priority Class called trueberry.

Since the three new fruits have memory or CPU requirements that the node can’t satisfy, kubelet evicts all Pods with lower priority than the new fruits. Blueberry stays running as it has the higher priority class.

NAME         READY   STATUS             RESTARTS   AGE
banana       0/1     ContainerCreating  0          2s
blueberry    1/1     Running            0          4h42m
raspberry    0/1     Terminating        0          59m
strawberry   0/1     Terminating        0          5h23m
tomato       0/1     ContainerCreating  0          2s
watermelon   0/1     ContainerCreating  0          2s

This is the end result:

NAME         READY   STATUS             RESTARTS   AGE
banana       1/1     Running            0          3s
blueberry    1/1     Running            0          4h43m
tomato       1/1     Running            0          3s
watermelon   1/1     Running            0          3s

These are strange times for berry club...

Node-pressure eviction

Apart from preemption, Kubernetes also constantly checks node resources, like disk pressure, CPU or Out of Memory (OOM).

In case a resource (like CPU or memory) consumption in the node reaches a certain threshold, kubelet will start evicting Pods in order to free up the resource. Quality of Service (QoS) will be taken into account to determine the eviction order.

Quality of Service Classes

In Kubernetes, Pods are giving one of three QoS Classes, which will define how likely they are going to be evicted in case of lack of resources, from less likely to more likely:

Guaranteed
Burstable
BestEffort

How are these QoS Classes assigned to Pods? This is based on limits and requests for CPU and memory. As a reminder:

Limits: maximum amount of a resource that a container can use.
Requests: minimum desired amount of resources for a container to run.

For more information about limits and requests, please check Understanding Kubernetes limits and requests by example.

Guaranteed

A Pod is assigned with a QoS Class of Guaranteed if:

All containers in the Pod have both Limits and Requests set for CPU and memory.
All containers in the Pod have the same value for CPU Limit and CPU Request.
All containers in the Pod have the same value for memory Limit and memory Request.

A Guaranteed Pod won’t be evicted in normal circumstances to allocate another Pod in the node.

Burstable

A Pod is assigned with a QoS Class of Burstable if:

It doesn’t have QoS Class of Guaranteed.
Either Limits or Requests have been set for a container in the Pod.

A Burstable Pod can be evicted, but less likely than the next category.

BestEffort

A Pod will be assigned with a QoS Class of BestEffort if:

No Limits and Requests are set for any container in the Pod.

BestEffort Pods have the highest chance of eviction in case of a node-pressure process happening in the node.

Important: there may be other available resources in Limits and Requests, like ephemeral-storage, but they are not used for QoS Class calculation.

As mentioned, QoS Classes will be taken into account for node-pressure eviction. Here’s the process that happens internally.

The kubelet ranks the Pods to be evicted in the following order:

BestEffort Pods or Burstable Pods where usage exceeds requests
Burstable Pods where usage is below requests or Guaranteed Pods

Kubernetes will try to evict Pods from group 1 before group 2.

Some takeaways from the above:

If you add very low requests in your containers, their Pod is likely going to be assigned group 1, which means it's more likely to be evicted.
You can’t tell which specific Pod is going to be evicted, just that Kubernetes will try to evict ones from group 1 before group 2.
Guaranteed Pods are usually safe from eviction: kubelet won’t evict them in order to schedule other Pods. But if some system services need more resources, the kubelet will terminate Guaranteed Pods if necessary, always with the lowest priority.

Other types of eviction

This article is focused on preemption and node-pressure eviction, but Pods can be evicted in other ways as well. Examples include:

API-initiated eviction

You can request an on-demand eviction of a Pod in one of your nodes by using Kubernetes Eviction API.

Taint-based eviction

With Kubernetes Taints and Tolerations you can guide how your Pods should be assigned to Nodes. But if you apply a NoExecute taint to an existing Node, all Pods which are not tolerating it will be immediately evicted.

Node drain

There are times when Nodes become unusable or you don't want to work on them anymore. The command kubectl cordon prevents new Pods to be scheduled on it, but there’s also the possibility to completely empty all current Pods at once. If you run kubectl drain nodename, all Pods in the node will be evicted, respecting its graceful termination period.

Kubernetes Pod eviction monitoring in Prometheus

In your cloud solution, you can use Prometheus to easily monitor Pod evictions by doing:

kube_pod_status_reason{reason="Evicted"} > 0

This will display all evicted Pods in your cluster. You can also pair this with kube_pod_status_phase{phase="Failed"} in order to alert on those who were evicted after a failure in the Pod.

If you want to dig deeper, check the following articles for monitoring resources in Prometheus:

Conclusion

As you can see, eviction is just another feature from Kubernetes which allows you to control limited resources: in this case, the nodes that Pods will be using.

During preemption, Kubernetes will try to free up resources by evicting less priority Pods to schedule a new one. With Priority Classes you can control which Pods are more likely to keep running after preemption since there’s less chance that they will be evicted.

During execution, Kubernetes will check for Node-pressure and evict Pods if needed. With QoS classes you can control which Pods are more likely to be evicted in case of node-pressure.

Memory and CPU are important resources in your nodes, and you need to configure your Pods, containers and nodes to use the right amount of them. If you manage these resources accordingly, there could not only be a benefit in costs, but also you can ensure that the important processes will keep running, no matter how.

Get ahead of Pod eviction with Sysdig Monitor

With Sysdig Advisor, you can review cluster resource availability in order to prevent Pod eviction. Featuring:

CPU overcommitment metric
CPU Capacity
How to fix tips

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

What is Kubernetes CrashLoopBackOff? And how to fix it

Javier Martínez — Mon, 29 Aug 2022 11:49:48 +0000

CrashLoopBackOff is a Kubernetes state representing a restart loop that is happening in a Pod: a container in the Pod is started, but crashes and is then restarted, over and over again.

Kubernetes will wait an increasing back-off time between restarts to give you a chance to fix the error. As such, CrashLoopBackOff is not an error on itself, but indicates that there’s an error happening that prevents a Pod from starting properly.

Note that the reason why it’s restarting is because its restartPolicy is set to Always(by default) or OnFailure. The kubelet is then reading this configuration and restarting the containers in the Pod and causing the loop. This behavior is actually useful, since this provides some time for missing resources to finish loading, as well as for us to detect the problem and debug it – more on that later.

That explains the CrashLoop part, but what about the BackOff time? Basically, it’s an exponential delay between restarts (10s, 20s, 40s, …) which is capped at five minutes. When a Pod state is displaying CrashLoopBackOff, it means that it’s currently waiting the indicated time before restarting the pod again. And it will probably fail again, unless it’s fixed.

In this article you’ll see:

How to detect a CrashLoopBackOff in your cluster?

Most likely, you discovered one or more pods in this state by listing the pods with kubectl get pods:

$ kubectl get pods
NAME                     READY     STATUS             RESTARTS   AGE
flask-7996469c47-d7zl2   1/1       Running            1          77d
flask-7996469c47-tdr2n   1/1       Running            0          77d
nginx-5796d5bc7c-2jdr5   0/1       CrashLoopBackOff   2          1m
nginx-5796d5bc7c-xsl6p   0/1       CrashLoopBackOff   2          1m

From the output, you can see that the last two pods:

Are not in READY condition (0/1).
Their status displays CrashLoopBackOff.
Column RESTARTS displays one or more restarts.

These three signals are pointing to what we explained: pods are failing, and they are being restarted. Between restarts, there’s a grace period which is represented as CrashLoopBackOff.

You may also be “lucky” enough to find the Pod in the brief time it is in the Running or the Failed state.

Common reasons for a CrashLoopBackOff

It’s important to note that a CrashLoopBackOff is not the actual error that is crashing the pod. Remember that it’s just displaying the loop happening in the STATUS column. You need to find the underlying error affecting the containers.

Some of the errors linked to the actual application are:

Misconfigurations: Like a typo in a configuration file.
A resource is not available: Like a PersistentVolume that is not mounted.
Wrong command line arguments: Either missing, or the incorrect ones.
Bugs & Exceptions: That can be anything, very specific to your application.

And finally, errors from the network and permissions are:

You tried to bind an existing port.
The memory limits are too low, so the container is Out Of Memory killed.
Errors in the liveness probes are not reporting the Pod as ready.
Read-only filesystems, or lack of permissions in general.

Once again, this is just a list of possible causes but there could be many others.

Let’s now see how to dig deeper and find the actual cause.

How to debug, troubleshoot and fix a CrashLoopBackOff state

From the previous section, you understand that there are plenty of reasons why a pod ends up in a CrashLoopBackOff state. Now, how do you know which one is affecting you? Let’s review some tools you can use to debug it, and in which order to use it.

This could be our best course of action:

Check the pod description.
Check the pod logs.
Check the events.
Check the deployment.

1. Check the pod description – kubectl describe pod

The kubectl describe pod command provides detailed information of a specific Pod and its containers:

$ kubectl describe pod the-pod-name
Name:         the-pod-name
Namespace:    default
Priority:     0
…
State:          Waiting
Reason:       CrashLoopBackOff
Last State:     Terminated
Reason:       Error
…
Warning  BackOff                1m (x5 over 1m)   kubelet, ip-10-0-9-132.us-east-2.compute.internal  Back-off restarting failed container
…

From the describe output, you can extract the following information:

Current pod State is Waiting.
Reason for the Waiting state is “CrashLoopBackOff”.
Last (or previous) state was “Terminated”.
Reason for the last termination was “Error”.

That aligns with the loop behavior we’ve been explaining.

By using kubectl describe pod you can check for misconfigurations in:

The pod definition.
The container.
The image pulled for the container.
Resources allocated for the container.
Wrong or missing arguments.
…

…
Warning  BackOff                1m (x5 over 1m)   kubelet, ip-10-0-9-132.us-east-2.compute.internal  Back-off restarting failed container
…

In the final lines, you see a list of the last events associated with this pod, where one of those is "Back-off restarting failed container". This is the event linked to the restart loop. There should be just one line even if multiple restarts have happened.

2. Check the logs – kubectl logs

You can view the logs for all the containers of the pod:

kubectl logs mypod --all-containers

Or even a container in that pod:

kubectl logs mypod -c mycontainer

In case there’s a wrong value in the affected pod, logs may be displaying useful information.

3. Check the events – kubectl get events

They can be listed with:

kubectl get events

Alternatively, you can list all of the events of a single Pod by using:

kubectl get events --field-selector involvedObject.name=mypod

Note that this information is also present at the bottom of the describe pod output.

4. Check the deployment – kubectl describe deployment

You can get this information with:

kubectl describe deployment mydeployment

If there’s a Deployment defining the desired Pod state, it might contain a misconfiguration that is causing the CrashLoopBackOff.

Putting it all together

In the following example you can see how to dig into the logs, where an error in a command argument is found.

How to detect CrashLoopBackOff in Prometheus

If you’re using Prometheus for cloud monitoring, here are some tips that can help you alert when a CrashLoopBackOff takes place.

You can quickly scan the containers in your cluster that are in CrashLoopBackOff status by using the following expression (you will need Kube State Metrics):

kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1

Alternatively, you can trace the amount of restarts happening in pods with:

rate(kube_pod_container_status_restarts_total[5m]) > 0

Warning: Not all restarts happening in your cluster are related to CrashLoopBackOff statuses.

After every CrashLoopBackOff period there should be a restart (1), but there could be restarts not related with CrashLoopBackOff (2).

Afterwards, you could create a Prometheus Alerting Rule like the following to receive notifications if any of your pods are in this state:

- alert: RestartsAlert
  expr: rate(kube_pod_container_status_restarts_total[5m]) > 0
  for: 10m
  labels:
    severity: warning
  annotations:
    summary: Pod is being restarted
  description: Pod {{ $labels.pod }} in {{ $labels.namespace }} has a container {{ $labels.container }} which is being restarted

Conclusion

In this article, we have seen how CrashLoopBackOff isn’t an error by itself, but just a notification of the retrial loop that is happening in the pod.

We saw a description of the states it passes through, and then how to track it with kubectl commands.

Also, we have seen common misconfigurations that can cause this state and what tools you can use to debug it.

Finally, we reviewed how Prometheus can help us in tracking and alerting CrashLoopBackOff events in our pods.

Although not an intuitive message, CrashLoopBackOff is a useful concept that makes sense and is nothing to be afraid of.

Debug CrashLoopBackOff faster with Sysdig Monitor

Advisor, a new Kubernetes troubleshooting product in Sysdig Monitor, accelerates troubleshooting by up to 10x. Advisor displays a prioritized list of issues and relevant troubleshooting data to surface the biggest problem areas and accelerate time to resolution.

Try it for yourself for free for 30 days!