DEV Community: Sysdig

Top metrics for Elasticsearch monitoring with Prometheus

Javier Martínez — Tue, 09 May 2023 08:18:13 +0000

Starting the journey for Elasticsearch monitoring is crucial to get the right visibility and transparency over its behavior.

Elasticsearch is the most used search and analytics engine. It provides both scalability and redundancy to provide a high-availability search. As of 2023, more than sixty thousand companies of all sizes and backgrounds are using it as their search solution to track a diverse range of data, like analytics, logging, or business information.

By distributing data in JSON documents and indexing that data into several shards, Elastic search provides high availability, quick search, and redundancy capabilities.

In this article, we will evaluate the most important Prometheus metrics provided by the Elasticsearch exporter.

You will learn what are the main areas to focus on when monitoring an Elasticsearch system:

Start monitoring Elasticsearch with Prometheus.
How to monitor Golden Signals.
How to monitor infra metrics.
How to monitor index performance.
How to monitor search performance.
How to monitor cluster performance.
Advanced monitoring and next steps

How to start monitoring ElasticSearch with Prometheus

As usual, the easiest way to start your Prometheus monitoring journey with Elasticsearch is to use PromCat.io to find the best configs, dashboards, and alerts. The Elasticsearch setup guide in Promcat includes the Elasticsearch exporter with a series of out-of-box metrics that will be automatically scrapped to Prometheus. It also includes a collection of curated alerts and dashboards to start monitoring Elasticsearch right away.

You can combine these metrics with the Node Exporter to get more insights into your infrastructure. Also, if you're running Elasticsearch on Kubernetes, you can use KSM and CAdvisor to combine Kubernetes metrics with Elasticsearch metrics.

How to monitor Golden Signals in Elasticsearch

To review a bare minimum of important metrics, remember to check the so-called Golden Signals:

Errors.
Traffic.
Saturation.
Latency.

These represent a set of the essential metrics to look for in a system, in order to track black-box monitoring (focus only on what’s happening in the system, not why). In other words, Golden Signals will measure symptoms, not solutions to the current problem. This could be a good starting point for creating an Elasticsearch monitoring dashboard.

Errors

elasticsearch_cluster_health_status

Cluster health in Elasticsearch is measured by the colors green, yellow, and red, as follows:

Green: Data integrity is correct, no shard is missing.
Yellow: There’s at least one shard missing, but data integrity can be preserved due to replicas.
Red: A primary shard is missing or unassigned, and there’s a data loss.

With elasticsearch_cluster_health_status, you can quickly check the current situation for Elasticsearch data on a particular cluster. Remember that this won’t retrieve the actual causes of the data integrity loss, just that you need to act in order to prevent further problems.

Traffic

elasticsearch_indices_search_query_total

This metric is a counter with the total number of search queries executed, which by itself won’t give you much information as a number.

Consider as well using rate() or irate(), to detect sudden changes or spikes in traffic. Dig deeper into Prometheus queries with our Getting started with PromQL guide

Saturation

For a detailed latency analysis, check the section on How to monitor Elasticsearch infra metrics.

Latency

For a detailed latency analysis, check the section on How to monitor Elasticsearch index performance.

How to monitor Elasticsearch infra metrics

Infrastructure monitoring focuses on tracking the overall performance of the servers and nodes of a system. As with similar cloud applications, most of the effort will be spent on monitoring CPU and Memory consumption.

Monitoring Elasticsearch CPU

elasticsearch_process_cpu_percent

This is a gauge metric used to measure the current CPU usage percent (0-100) of the Elasticsearch process. Since chances are that you’re running several Elasticsearch nodes, you will need to track each one separately.

elasticsearch_indices_store_throttle_time_seconds_total

In case you’re using a file system as an index store, you can expect a certain level of delays in input and output operations. This metric represents how much your Elasticsearch index store is being throttled.

Since this is a counter metric that will only aggregate the total number of seconds, consider using rate or irate for an evaluation of how much it’s suddenly changing.

Monitoring Elasticsearch JVM Memory

Elasticsearch is based on Lucene, which is built in Java. This means that monitoring the Java Virtual Machine (JVM) memory is crucial to understand the current usage of the whole system.

elasticsearch_jvm_memory_used_bytes

This metric is a gauge that represents the memory usage in bytes for each area.

How to monitor Elasticsearch index performance

Indices in ElasticSearch partition the data as a logical namespace. Elasticsearch indexes documents in order to retrieve or search them as fast as possible.

Every time a new index is created, you can define the number of shards and replicas for it:

{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 1
  }
}

elasticsearch_indices_indexing_index_time_seconds_total

This metric is a counter of the seconds accumulated spent on indexing. It can give you a very approximated idea of the Elasticsearch indexing performance.

Note that you can divide this metric by elasticsearch_indices_indexing_index_total in order to get the average indexing time per operation.

elasticsearch_indices_refresh_time_seconds_total

For an index to be searchable, Elasticsearch needs a refresh to be executed. This is set up with the config index.refresh_interval, which is set by default to one minute.

This metric elasticsearch_indices_refresh_time_seconds_total represents a counter with the total time dedicated to refreshing in Elasticsearch.

In case you want to measure the average time for refresh, you can divide this metric by elasticsearch_indices_refresh_total.

How to monitor Elasticsearch search performance

While Elasticsearch promises near-instant query speed, chances are that in the real world, you may feel that is not the case. The number of shards, the storage solution chosen, or the cache configuration might impact search performance, and it’s crucial to track what is the current behavior.

Additionally, the usage of wildcards, joins or the number of fields being searched will affect drastically the overall processing time of search queries.

elasticsearch_indices_search_fetch_time_seconds

A counter metric aggregating the total amount of seconds dedicated to fetching results in search.

In case you want to retrieve the average fetch time per operation, just divide the result by elasticsearch_indices_search_fetch_total.

How to monitor Elasticsearch cluster performance

Apart from the usual cloud requirements, an Elasticsearch system would like to look at:

Number of shards.
Number of replicas.

As a rule of thumb, the ratio between the number of shards and GB of heap space should be less than 20.

Note as well that it’s suggested to have a separate cluster dedicated to monitoring.

elasticsearch_cluster_health_active_shards

This metric is a gauge that will indicate the number of active shards (both primary and replicas) from all the clusters.

elasticsearch_cluster_health_relocating_shards

Elasticsearch will dynamically move shards between nodes based on balancing or current usage. With this metric, you can control when this movement is happening.

Advanced Monitoring

Remember that the Prometheus exporter will give you a set of out-of-the-box metrics that are relevant enough to kickstart your monitoring journey. But the real challenge comes when you take the step to create your own custom metrics tailored to your application.

REST API

Additionally, mind that Elasticsearch provides a REST API that you can query for more fine-grained monitoring.

VisualVM

The Java VisualVM project is an advanced dashboard for Memory and CPU monitoring. It features advanced resource visualization, as well as process and thread utilization.

Download the Dashboards

You can download the dashboards with the metrics seen in this article through the Promcat official page.

This is a curated selection of the above metrics that can be easily integrated with your Grafana or Sysdig Monitor solution.

Conclusion

Elasticsearch is one of the most important search engines available, featuring high availability, high scalability, and distributed capabilities through redundancy.

Using the Elasticsearch exporter for Prometheus you can kickstart the monitoring journey in an easy way, by automatically receiving the important metrics directly.

As with many other applications, CPU, and Memory are crucial to understand system saturation. You should be aware of the current CPU throttling and the memory handling of the JVM.

Finally, it’s important to dig deeper into the particularities of Elasticsearch, like indices and search capabilities, to truly understand the challenges of monitoring and visualization.

Kubernetes CreateContainerConfigError and CreateContainerError

Javier Martínez — Thu, 23 Mar 2023 15:58:05 +0000

CreateContainerConfigError and CreateContainerError are two of the most prevalent Kubernetes errors found in cloud-native applications.

CreateContainerConfigError is an error happening when the configuration specified for a container in a Pod is not correct or is missing a vital part.

CreateContainerError is a problem happening at a later stage in the container creation flow. Kubernetes displays this error when it attempts to create the container in the Pod.

In this article, you will learn:

What is Kubernetes CreateContainerConfigError?
What is Kubernetes CreateContainerError?
Kubernetes container creation flow
Common causes for CreateContainerError and CreateConfigError
How to troubleshoot both errors
How to detect both errors in Prometheus

What is CreateContainerConfigError?

During the process to start a new container, Kubernetes first tries to generate the configuration for it. In fact, this is handled internally by calling a method called generateContainerConfig, which will try to retrieve:

Container command and arguments
Relevant persistent volumes for the container
Relevant ConfigMaps for the container
Relevant secrets for the container

Any problem in the elements above will result in a CreateContainerConfigError.

What is CreateContainerError?

Kubernetes throws a CreateContainerError when there’s a problem in the creation of the container, but unrelated with configuration, like a referenced volume not being accessible or a container name already being used.

Similar to other problems like CrashLoopBackOff, this article only covers the most common causes, but there are many others depending on your current application.

How you can detect CreateContainerConfigError and CreateContainerError

You can detect both errors by running kubectl get pods:

NAME  READY STATUS                     RESTARTS AGE

mypod 0/1   CreateContainerConfigError 0        11m

As you can see from this output:

Pod is not ready: container has an error.
There are no restarts: these two errors are not like CrashLoopBackOff, where automatic retrials are in place.

Kubernetes container creation flow

In order to understand CreateContainerError and CreateContainerConfligError, we need to first know the exact flow for container creation.

Kubernetes follows the next steps every time a new container needs to be started:

Pull the image.
Generate container configuration.
Precreate container.
Create container.
Pre-start container.
Start container.

As you can see, steps 2 and 4 are where a CreateContainerConfig and CreateContainerErorr might appear, respectively.

Common causes for CreateContainerError and CreateContainerConfigError

Not found ConfigMap

Kubernetes ConfigMaps are a key element to store non-confidential information to be used by Pods as key-value pairs.

When adding a ConfigMap reference in a Pod, you are effectively indicating that it should retrieve specific data from it. But, if a Pod references a non-existent ConfigMap, Kubernetes will return a CreateContainerConfigError.

Not found Secret

Secrets are a more secure manner to store sensitive information in Kubernetes. Remember, though, this is just raw data encoded in base64, so it’s not really encrypted, just obfuscated.

In case a Pod contains a reference to a non-existent secret, Kubelet will throw a CreateContainerConfigError, indicating that necessary data couldn’t be retrieved in order to form container config.

Container name already in use

While an unusual situation, in some cases a conflict might occur because a particular container name is already being used. Since every docker container should have a unique name, you will need to either delete the original or rename the new one being created.

How to troubleshoot CreateContainerError and CreateContainerConfigError

While the causes for an error in container creation might vary, you can always rely on the following methods to troubleshoot the problem that’s preventing the container from starting.

Describe Pods

With kubectl describe pod, you can retrieve the detailed information for the affected Pod and its containers:

Containers:
  mycontainer:
    Container ID:
    Image:          nginx
    Image ID:
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:  3
---
Volumes:
  config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      myconfigmap
    Optional:  false

Get logs from containers

Use kubectl logs to retrieve the log information from containers in the Pod. Note that for Pods with multiple containers, you need to use the –all-containers parameter:

Error from server (BadRequest): container "mycontainer" in pod "mypod" is waiting to start: CreateContainerConfigError

Check the events

You can also run kubectl get events to retrieve all the recent events happening in your Pods. Remember that the describe pods command also displays the particular events at the end.

Terminal windows for the kubectl commands used to troubleshoot a CreateContainerConfigError

How to detect CreateContainerConfigError and CreateContainerError in Prometheus

When using Prometheus + kube-state-metrics, you can quickly retrieve Pods that have containers with errors at creation or config steps:

kube_pod_container_status_waiting_reason{reason="CreateContainerConfigError"} > 0
kube_pod_container_status_waiting_reason{reason="CreateContainerError"} > 0

Other similar errors

Pending

Pending is a Pod status that appears when the Pod couldn’t even be started. Note that this happens at schedule time, so Kube-scheduler couldn’t find a node because of not enough resources or not proper taints/tolerations config.

ContainerCreating

ContainerCreating is another waiting status reason that can happen when the container could not be started because of a problem in the execution (e.g: No command specified)

Error from server (BadRequest): container "mycontainer" in pod "mypod" is waiting to start: ContainerCreating

RunContainerError

This might be a similar situation to CreateContainerError, but note that this happens during the run step and not the container creation step.

A RunContainerError most likely points to problems happening at runtime, like attempts to write on a read-only volume.

CrashLoopBackOff

Remember that CrashLoopBackOff is not technically an error, but the waiting time grace period that is added between retrials.

Unlike CrashLoopBackOff events, CreateContainerError and CreateContainerConfigError won’t be retried automatically.

Conclusion

In this article, you have seen how both CreateContainerConfigError and CreateContainerError are important messages in the Kubernetes container creation process. Being able to detect them and understand at which stage they are happening is crucial for the day-to-day debugging of cloud-native services.

Also, it’s important to know the internal behavior of the Kubernetes container creation flow and what is errors might appear at each step.

Finally, CreateContainerConfigError and CreateContainerError might be mistaken with other different Kubernetes errors, but these two happen at container creation stage and they are not automatically retried.

Troubleshoot CreateContainerError with Sysdig Monitor

With Sysdig Monitor’s Advisor, you can easily detect which containers are having CreateContainerConfigError or CreateContainerError problems in your Kubernetes cluster.

Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

Monitoring with Custom Metrics

Javier Martínez — Thu, 02 Mar 2023 10:19:53 +0000

Custom metrics are application-level or business-related tailored metrics, as opposed to the ones that come directly out-of-the-box from monitoring systems like Prometheus (e.g: kube-state-metrics or node exporter)

By kickstarting a monitoring project with Prometheus, you might realize that you get an initial set of out-of-the-box metrics with just Node Exporter and Kube State Metrics. But, this will only get you so far since you will just be performing black box monitoring. How can you go to the next level and observe what’s beyond?

They are an essential part of the day-to-day monitoring of cloud-native systems, as they provide an additional dimension to the business and app level.

Metrics provided by an exporter
Tailored metrics designed by the customer
An aggregate from previous existing metrics

In this article, you will see:

Why custom metrics are important

Custom metrics allow companies to:

Monitor Key Performance Indicators (KPIs).
Detect issues faster.
Track resource utilization.
Measure latency.
Track specific values from their services and systems.

Examples of custom metrics:

Latency of transactions in milliseconds.
Database open connections.
% cache hits / cache misses.
orders/sales in e-commerce site.
% of slow responses.
% of responses that are resource intensive.

As you can see, any metrics retrieved from an exporter or created ad hoc will fit into the definition for custom metric.

When to use Custom Metrics

Autoscaling

By providing specific visibility over your system, you can define rules on how the workload should scale.

Horizontal autoscaling: add or remove replicas of a Pod.
Vertical autoscaling: modify limits and requests of a container.
Cluster autoscaling: add or remove nodes in a cluster.

If you want to dig deeper, check this article about autoscaling in Kubernetes.

Latency monitoring

Latency measures the time it takes for a system to serve a request. This monitoring golden signal is essential to understand what the end-user experience for your application is.

These are considered custom metrics as they are not part of the out-of-the-box set of metrics coming from Kube State Metrics or Node Exporter. In order to measure latency, you might want to either track individual systems (database, API) or end-to-end.

Application level monitoring

Kube-state-metrics or node-exporter might be a good starting point for observability, but they just scratch the surface as they perform black-box monitoring. By instrumenting your own application and services, you create a curated and personalized set of metrics for your own particular case.

Considerations when creating Custom Metrics

Naming

Check for any existing convention on naming, as they might be either colliding with existing names or confusing. Custom metric name is the first description for its purpose.

Labels

Thanks to labels, we can add parameters to our metrics, as we will be able to filter and refine through additional characteristics. Cardinality is the number of possible values for each label and since each combination of possible values will require a time series entry, that can increase resources drastically. Choosing the correct labels carefully is key to avoiding this cardinality explosion, which is one of the causes of resource spending spikes.

Costs

Custom metrics may have some costs associated with them depending on the monitoring system you are using. Double-check what is the dimension used to scale costs:

Number of time series
Number of labels
Data storage

Custom Metric lifecycle

In case the Custom Metric is related to a job or a short-living script, consider using Pushgateway.

Kubernetes Metric API

One of the most important features of Kubernetes is the ability to scale the workload based on the values of metrics automatically.

Metrics API are defined in the official repository from Kubernetes:

metrics.k8s.io
custom.metrics.k8s.io
external.metrics.k8s.io

Creating new metrics

You can set new metrics by calling the K8s metrics API as follows:

curl -X POST \
  -H 'Content-Type: application/json' \
  http://localhost:8001/api/v1/namespaces/custom-metrics/services/custom-metrics-apiserver:http/proxy/write-metrics/namespaces/default/services/kubernetes/test-metric \
  --data-raw '"300m"'

Prometheus custom metrics

As we mentioned, every exporter that we include in our Prometheus integration will account for several custom metrics.

Check the following post for a detailed guide on Prometheus metrics.

Challenges when using custom metrics

Cardinality explosion

While the resources consumed by some metrics might be negligible, the moment these are available to be used with labels in queries, things might get out of hand.

Cardinality refers to the cartesian products of metrics and labels. The result will be the amount of time series entries that need to be used for that single metric.

Also, every metric will be scraped and stored in a time series database based on your scrape_interval. The higher this value, the higher the amount of time series entries.

All these factors will eventually lead to:

Higher resource consumption.
Higher storage demand.
Monitoring performance degradation.

Moreover, most common monitoring tools don’t give visibility on current cardinality of metrics or costs associated.

Exporter over usage

Exporters are a great way to include relevant metrics to your system. With them, you can easily instrument relevant metrics bound to your microservices and containers. But with great power comes great responsibility. Chances are that many of the metrics included in the package may not be relevant to your business at all.

By enabling custom metrics and exporters in your solution, you may end up having a burst in the amount of time series database entries.

Cost spikes

Because of the elements explained above, monitoring costs could increase suddenly, as your current solution might be consuming more resources than expected, or your current monitoring solution has certain thresholds that were surpassed.

Alert fatigue

With metrics, most companies and individuals would love to start adding alerts and notifications when their values exceed certain thresholds. However, this could lead to higher notification sources and a reduced attention span.
Learn more about Alert Fatigue and how to mitigate it.

Conclusion

Custom metrics represent the next step for cloud-native monitoring as they represent the core of business observability. While using Prometheus along kube-state-metrics and node exporter is a nice starting step, eventually companies and organizations will need to take the next step and create tailored and on-point metrics to suit their needs.

ChatGPT creating OSS security rules and plugins

Miguel — Mon, 27 Feb 2023 09:44:31 +0000

Can OpenAI ChatGPT become a contributor for an open source project like Falco? Is this the end of GitHub Copilot?

ChatGPT is SO trendy that I overheard my grandma talking about it to her friends the other day. We’re seeing more and more uses of this AI for real world applications. That made us think…

Falco, the first runtime security project to join CNCF as an incubation-level project, needs contributors. Would OpenAI’s ChatGPT language model be able to write security rules or plugins for Falco, and become a contributor?

ChatGPT as a potential new Falco partner

Jokes aside, in this article we explore the viability of ChatGPT as a tool for organizations to create effective and efficient security rules. Would those new rules in Falco ensure their containers and hosts are protected from potential threats?

What is runtime security for ChatGPT?

In order to begin, we need to understand what runtime security is. Rather than using a search engine, we will ask ChatGPT in order to speed up the process.

According to ChatGPT:

“Runtime security involves monitoring and protecting software systems in real-time as they are running, detecting, and responding to security threats, and implementing mitigation techniques to prevent further damage. The goal of runtime security is to identify and respond to security threats and vulnerabilities before they can be exploited and cause harm.”

A good definition, but let’s see if we can go deeper.

What is Falco for ChatGPT?

ChatGPT summarized the open source project Falco in a clear and concise manner.

Rather than copying the information from the landing page of falco.org, ChatGPT provided useful context as to how Falco utilizes eBPF to achieve low-overhead when detecting security threats from data collected within the Linux Kernel.

At this point, we understand what runtime security is, and how Falco can be used to detect anomalous runtime security issues. Now that we are familiar with open source Falco, let’s ask ChatGPT to write us some useful Falco rules.

Asking ChatGPT to create a Falco rule

Now, let’s ask ChatGPT if the language model is capable of writing OSS Falco security rules.

Based on the below screenshot, does ChatGPT looks like a useful contributor to the Falco community?

At this point, we are happy with the answer that was returned.

There was a correctly-formatted Falco rule and the language model also returned some added context as to how the rule will work.

My only concern is that the first rule they created is similar to a rule that already exists in the Falco community rules feed:

- rule: Terminal shell in container
  desc: A shell was used as the entrypoint/exec point into a container with an attached terminal.
  condition: >
    spawned_process and container
    and shell_procs and proc.tty != 0
    and container_entrypoint
    and not user_expected_terminal_shell_in_container_conditions
  output: >
    A shell was spawned in a container with an attached terminal (user=%user.name user_loginuid=%user.loginuid %container.info
    shell=%proc.name parent=%proc.pname cmdline=%proc.cmdline pid=%proc.pid terminal=%proc.tty container_id=%container.id image=%container.image.repository)
  priority: NOTICE
  tags: [container, shell, mitre_execution, T1059]Code language: Perl (perl)

The above Falco community rule includes different use cases; proc.name is not just sh, it is a long list which is contained in shell_procs. As a result, this would lead to fewer false/positive detections and reduce the attackers’ chances of bypassing the rule. If the rules are too generic, it can potentially capture expected behavior.

When asking our questions to ChatGPT, we need to be more precise to generate an accurate rule. For example, we would ask ChatGPT to create a Falco rule that detects suspicious login activity on a Linux workstation between certain hours of the day.

Again, we like how the rule looks.

Since Falco is designed to handle Linux system calls, there is no need to explicitly mention the workstation OS type. However, ChatGPT nicely mentioned that the rule triggers for activity on Linux workstations because we specifically requested this. We will copy the code snippet and paste it below so that we can dissect it further:

- rule: Detect suspicious login activity during off-hours
  desc: Detects login sessions initiated during off-hours on a Linux workstation
  condition: (evt.time > "2022-12-31T02:00:00.000Z" and evt.time < "2022-12-31T07:00:00.000Z") and (evt.type=execve and evt.argc=3 and evt.argv[2]=login)
  output: Suspicious login activity detected during off-hours: user=%user.name command=%proc.cmdline
  priority: WARNINGCode language: Perl (perl)

The Falco rule uses the below system call activity:

evt.time – This is the event timestamp. It’s between T02:00 (2 a.m.) and T07:00 (7 a.m.).
evt.type – This is the name of the event, for example, ‘open’ or ‘read.’ In this case, it’s execve. The execve event executes the program referred to by pathname.

If you are ever unsure about a certain argument used, what it means, or how to use it going forward, you can ask ChatGPT to elaborate on its findings without re-writing the entire question.

Since ChatGPT is a language model, it does a great job of not just providing rules, but also providing clarity on its findings. With this additional context provided by ChatGPT, we are happy with how this rule turned out.

Since we don’t have any business need for this specific rule, let’s use ChatGPT to solve some real business problems.

ChatGPT, MITRE ATT&CK, and Falco

Continuing the conversation, we got more technical with ChatGPT and tried to combine two areas of expertise: Falco and MITRE.

The MITRE ATT&CK framework for Enterprise environments is BIG! As a result, it can be hard to provide extensive coverage of all Tactics, Techniques, and Sub-Techniques for Linux Systems.

Since ChatGPT can read and interpret large values of operational data, it speeds up the process of building Falco rules to better align with this widely-used risk framework.

In the Falco community rules feed, there was no existing rule aligned to the Technique ID ‘T1529.’ For this technique ID, the adversaries may shutdown or reboot the workstation to interrupt access to workstations, or aid in the destruction of those systems. When requesting a rule that detects system shutdown or reboot, we also want to request the appropriate tagging for rules alignment with the MITRE ATT&CK framework. Surprisingly, ChatGPT answered with an incorrect tactic and technique associated with that technique ID.

The technique Cloud Service Dashboard is assigned to the Tactic ‘Discovery’ and the associated Technique ID T1538. Whereas, the technique ID T1529 is associated with shutdown/reboot activity, this would be aligned with the Tactic ‘Impact.’

For the first time, ChatGPT made an obvious mistake in its answer. When we confronted ChatGPT, it immediately apologized and provided an amended answer that looks more like the Falco rule we would expect.

This regained my trust in ChatGPT becoming an approved Falco contributor.

However, since we cannot guarantee that ChatGPT is going to return the correct rule, we also need to validate that the rule conditions are valid.

Again, I’ve pasted the findings into the following snippet field for further inspection. As mentioned by ChatGPT, this rule checks for execve events where the second argument (evt.argv[1]) contains either shutdown or reboot. This indicates that the process is attempting to shut down or reboot the system, which is a technique used to disrupt normal system operation and, therefore, correctly aligns with the MITRE tactic and technique.

- rule: Detect T1529 - System Shutdown/Reboot
  desc: Detects attempts to shut down or reboot the system
  condition: (evt.type=execve and (evt.argv[1] contains "shutdown" or evt.argv[1] contains "reboot"))
  output: "Detected attempt to shut down or reboot the system. T1529 - System Shutdown/Reboot detected"
  priority: WARNING
  tags: [tactic=impact, technique=T1529, technique_id=T1529]Code language: Perl (perl)

So far, we have learned that we cannot rely on ChatGPT to contribute Falco rules without being vetted by an experienced Falco user.

That said, ChatGPT has quickly contributed rules that can be used to address regulatory frameworks and/or risk frameworks such as MITRE ATT&CK. The injected tags allow users to categorize and track detections of this technique within your security management tooling.

How to detect cryptomining with ChatGPT and Falco

The rules we have created so far are fairly simplistic. In order to test the true power of ChatGPT, we need to ask it for help creating more complex Falco rules involving additional abstractions such as Macros and Lists.

An example that we were working on recently was the creation of a small list of known cryptomining binaries for a CNCF Livestream. We would like to see how ChatGPT addresses this request.

We were disappointed with this response.

While the syntax is valid, the default approach from ChatGPT is always to list the process names within the Falco rule, rather than creating a list of known binaries, and mapping this to the Falco rules via a referenced Macro.

We can ask ChatGPT to specifically reference the binaries in the List.

Funnily, ChatGPT was even more confused by this instruction to the point where it started appending syntax that is foreign to the Falco rules syntax.

At this point, the rule would no longer work and ChatGPT is losing credibility as a valid contributor to the Falco project.

As an experienced Falco user, I had to explain that ChatGPT misunderstood my request and that further evaluation is required. It’s not that ChatGPT is unable to answer the request, but it can misunderstand certain aspects of the request depending on our phrasing.

That’s why your request might require further fine tuning, but we can see that ChatGPT got there in the end.

ChatGPT has given us a correctly-formatted Falco rule, which is a great foundation for further development. However, the rule is certainly not foolproof.

There are many examples of cryptomining binaries other than ‘xmrig’ – though xmrig is certainly the most common example. The value here is creating an extensive, up-to-date list of all common binaries so we can provide as much security coverage as possible. We mention some of these binaries in the following Falco blog.

Can ChatGPT create Falco plugins?

Don’t get disappointed. Let’s see if ChatGPT is able to help us create a plugin for Falco.

It’s super important to understand how ChatGPT responds to generic commands. When we asked if ChatGPT can create Falco plugins, it said, “I do not have the ability to write or compile code. However, I can assist you in writing the code for a Falco plugin.”

It’s also worth noting that ChatGPT explains the supported protocol (gRPC) and the languages that support it (C++, Go, or Python).

In that case, we just need to ask for guidance in how to write a Falco plugin. We need the request to be for a specific service for this request to be effective. Since LastPass is in the news lately, we will ask ChatGPT to help us create a LastPass Plugin.

Amazingly, the below Python code snippet was provided to help configure a LastPass plugin with the appropriate gRPC protocol that we mentioned previously.

import grpc
from falco_proto import event_pb2
from falco_proto import event_pb2_grpc

class LastPassPlugin(event_pb2_grpc.EventServiceServicer):
    def HandleEvent(self, request, context):
        event = request.event
        if event.event_type == "executed" and event.output.find("lastpass-cli") != -1:
            print("LastPass CLI was executed")
        return event_pb2.HandleEventResponse()

server = grpc.server(futures.ThreadPoolExecutor(max_workers=10))
event_pb2_grpc.add_EventServiceServicer_to_server(LastPassPlugin(), server)
server.add_insecure_port('[::]:50051')
server.start()
server.wait_for_termination()Code language: Perl (perl)

This code sets up a gRPC server and implements a custom HandleEvent method that is called whenever a Falco event occurs. The method checks if the event is an “executed” event and if the output contains the string “lastpass-cli.” If both conditions are met, it prints a message indicating that LastPass CLI was executed.

Note that this is just a basic example. ChatGPT clearly explains that this was provided as guidance and therefore you will likely need to modify the code to meet your specific requirements.

For more information on creating Falco plugins, we would recommend referring to the official Falco documentation and the gRPC protocol documentation.

Can ChatGPT contribute to the Falco project?

Unfortunately, no!

As ChatGPT explained to us, it can help with the rule creation. But as an AI language model, it is not authorized to create pull requests. As a result, ChatGPT cannot be officially included as a contributor to the open source project. However, project contributors and community members can rely on ChatGPT to validate their rule formatting, identify discrepancies in misconfigured rules, as well as provide insights on how a rule should be formatted to address a framework requirement.

Conclusion

ChatGPT is a powerful language model that can assist in creating Falco security rules. With its vast knowledge of various topics and its ability to generate text, it can provide helpful guidance and examples of how to create a rule that detects a specific threat. However, while it can be a valuable resource, ChatGPT should not be trusted to fully automate the creation of security rules.

The accuracy and relevance of the information it provides can be limited by its training data and its knowledge cutoff, and it may not have the expertise or context to make informed decisions about the specific security needs of an organization. Additionally, security rule creation is an ongoing process that requires constant monitoring, tuning, and updating to keep up with new threats and changes in technology.

Therefore, it is an option to use ChatGPT and consult security experts to verify and refine the rules before deploying them in a production environment.

Prometheus Alertmanager best practices

Javier Martínez — Thu, 09 Feb 2023 09:59:54 +0000

Have you ever fallen asleep to the sounds of your on-call team in a Zoom call? If you’ve had the misfortune to sympathize with this experience, you likely understand the problem of Alert Fatigue firsthand.

During an active incident, it can be exhausting to tease the upstream root cause from downstream noise while you’re context switching between your terminal and your alerts.

This is where Alertmanager comes in, providing a way to mitigate each of the problems related to Alert Fatigue.

In this article, you will learn:

What Alert Fatigue is
What AlertManager is
Routing
Inhibition
Silencing and Throttling
Grouping
Notification Template

Alert Fatigue

Alert Fatigue is the exhaustion of frequently responding to unprioritized and unactionable alerts. This is not sustainable in the long term. Not every alert is so urgent that it should wake up a developer. Ensuring that an on-call week is sustainable must prioritize sleep as well.

Was an engineer woken up more than twice this week?
Can the resolution be automated or wait until morning?
How many people were involved?

Companies often focus on response time and how long a resolution takes but how do they know the on-call process is not contributing to burn out?

Pain Point	Feature	Alertmanager
Send alerts to the right team	Routing	Labeled alerts are routed to the corresponding receiver
Too many alerts at once	Inhibition	Alerts can inhibit other alerts (e.g., Datacenter down alert inhibits downtime alert)
False positive on an Alert	Silencing	Temporarily silence an alert, especially when performing scheduled maintenance
Alerts are too frequent	Throttling	Customizable back-off options to avoid re-notifying too frequently
Unorganized alerts	Grouping	Logically group alerts by labels such as 'environment=dev' or 'service=broker'
Notifications are unstructured	Notification Template	Standardize alerts to a template so that alerts are structured across services

Alertmanager

Prometheus Alertmanager is the open source standard for translating alerts into alert notifications for your engineering team. Alertmanager challenges the assumption that a dozen alerts should result in a dozen alert notifications. By leveraging the features of Alertmanager, dozens of alerts can be distilled into a handful of alert notifications, allowing on-call engineers to context switch less by thinking in terms of incidents rather than alerts.

Routing

Routing is the ability to send alerts to a variety of receivers including Slack, Pagerduty, and email. It is the core feature of Alertmanager.

route:
  receiver: slack-default            # Fallback Receiver if no routes are matched
  routes:
    - receiver: pagerduty-logging
      continue: true
    - match:
      team: support
      receiver: jira
    - match:
      team: on-call
      receiver: pagerduty-prod

Here, an alert with the label {team:on-call} was triggered. Routes are matched from top to bottom with the first receiver being pagerduty-logging, a receiver for your on-call manager to track all alerts at the end of each month. Since the alert does not have a {team:support} label, the matching continues to {team:on-call} where the alert is properly routed to the pagerduty-prod receiver. The default route, slack-default, is specified at the top of the routes, in case no matches are found.

Inhibition

Inhibition is the process of muting downstream alerts depending on their label set. Of course, this means that alerts must be systematically tagged in a logical and standardized way, but that's a human problem, not an Alertmanager one. While there is no native support for warning thresholds, the user can take advantage of labels and inhibit a warning when the critical condition is met.

This has the unique advantage of supporting a warning condition for alerts that don't use a scalar comparison. It's all well and good to warn at 60% CPU usage and alert at 80% CPU usage, but what if we wanted to craft a warning and alert that compares two queries? This alert triggers when a node has more pods than its capacity.

(sum by (kube_node_name) (kube_pod_container_status_running)) > 
on(kube_node_name) kube_node_status_capacity_pods

We can do exactly this by using inhibition with Alertmanager. In the first example, an alert with the label {severity=critical} will inhibit an alert of {severity=warning} if they share the same region, and alertname.

In the second example, we can also inhibit downstream alerts when we know they won't be important in the root cause. It might be expected that a Kafka consumer behaves anomalously when the Kafka producer doesn't publish anything to the topic.

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['region','alertname']
  - source_match:
      service: 'kafka_producer'
    target_match:
      service: 'kafka_consumer'
    equal: ['environment','topic']

Silencing and Throttling

Now that you've woken up at 2 a.m. to exactly one root cause alert, you may want to acknowledge the alert and move forward with remediation. It’s too early to resolve the alert but alert re-notifications don’t give any extra context. This is where silencing and throttling can help.

Silencing allows you to temporarily snooze an alert if you're expecting the alert to trigger for a scheduled procedure, such as database maintenance, or if you've already acknowledged the alert during an incident and want to keep it from renotifying while you remediate.

Throttling solves a similar pain point but in a slightly different fashion. Throttles allow the user to tailor the renotification settings with three main parameters:

group_wait
group_interval
repeat_interval

When Alert #1 and Alert #3 are initially triggered, Alertmanager will use group_wait to delay by 30 seconds before notifying. After an initial alert has been triggered, any new alert notifications are delayed by group_interval . Since there was no new alert for the next 90 seconds, there was no notification. Over the subsequent 90 seconds however, Alert #2 was triggered and a notification of Alert #2 and Alert #3 was sent. In order to not forget about the current alerts if no new alert has been triggered, repeat_interval can be configured to a value, such as 24 hours, so that the currently triggered alerts send a re-notifications every 24 hours.

Grouping

Grouping in Alertmanager allows multiple alerts sharing a similar label set to be sent at the same time- not to be confused with Prometheus grouping, where alert rules in a group are evaluated in sequential order. By default, all alerts for a given route are grouped together. A group_by field can be specified to logically group alerts.

route:
  receiver: slack-default            # Fallback Receiver if no routes are matched
  group_by: [env]
  routes:
    - match:
        team: on-call
      Group_by: [region, service]
      receiver: pagerduty-prod

Alerts that have the label {team:on-call} will be grouped by both region and service. This allows users to immediately have context that all of the notifications within this alert group share the same service and region. Grouping with information such as instance_id or ip_address tends to be less useful, since it means that every unique instance_id or ip_address will produce its own notification group. This may produce noisy notifications and defeat the purpose of grouping.

If no grouping is configured, all alerts will be part of the same alert notification for a given route.

Notification Template

Notification templates offer a way to customize and standardize alert notifications. For example, a notification template can use labels to automatically link to a runbook or include useful labels for the on-call team to build context. Here, app and alertname labels are interpolated into a path that links out to a runbook. Standardizing on a notification template can make the on-call process run more smoothly since the on-call team may not be the direct maintainers of the microservice that is paging.

Manage alerts with a click with Sysdig Monitor

As organizations grow, maintaining Prometheus and Alertmanager can become difficult to manage across teams. Sysdig Monitor makes this easy with Role-Based Access Control where teams can focus on the metrics and alerts most important to them. We offer a turn-key solution where you can manage your alerts from a single pane of glass. With Sysdig Monitor you can spend less time maintaining Prometheus Alertmanager and spend more time monitoring your actual infrastructure. Come chat with industry experts in monitoring and alerting and we'll get you up and running.

Kubernetes OOM and CPU Throttling

Javier Martínez — Thu, 26 Jan 2023 09:56:52 +0000

Introduction

When working with Kubernetes, Out of Memory (OOM) errors and CPU throttling are the main headaches of resource handling in cloud applications. Why is that?

CPU and Memory requirements in cloud applications are ever more important, since they are tied directly to your cloud costs.

With limits and requests, you can configure how your pods should allocate memory and CPU resources in order to prevent resource starvation and adjust cloud costs.

In case a Node doesn’t have enough resources, Pods might get evicted via preemption or node-pressure.
When a process runs Out Of Memory (OOM), it’s killed since it doesn’t have the required resources.
In case CPU consumption is higher than the actual limits, the process will start to be throttled.

But, how can you actively monitor how close your Kubernetes Pods to OOM and CPU throttling?

Kubernetes OOM

Every container in a Pod needs memory to run.

Kubernetes limits are set per container in either a Pod definition or a Deployment definition.

All modern Unix systems have a way to kill processes in case they need to reclaim memory. This will be marked as Error 137 or OOMKilled.

   State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Thu, 10 Oct 2019 11:04:03 +0200
      Finished:     Thu, 10 Oct 2019 11:14:11 +0200

This Exit Code 137 means that the process used more memory than the allowed amount and had to be terminated.

This is a feature present in Linux, where the kernel sets an oom_score value for the process running in the system. Additionally, it allows setting a value called oom_score_adj, which is used by Kubernetes to allow Quality of Service. It also features an OOM Killer, which will review the process and terminate those that are using more memory than they should.

Note that in Kubernetes, a process can reach any of these limits:

A Kubernetes Limit set on the container.
A Kubernetes ResourceQuota set on the namespace.
The node’s actual Memory size.

Memory overcommitment

Limits can be higher than requests, so the sum of all limits can be higher than node capacity. This is called overcommit and it is very common. In practice, if all containers use more memory than requested, it can exhaust the memory in the node. This usually causes the death of some pods in order to free some memory.

Monitoring Kubernetes OOM

When using node exporter in Prometheus, there’s one metric called node_vmstat_oom_kill. It’s important to track when an OOM kill happens, but you might want to get ahead and have visibility of such an event before it happens.

Instead, you can check how close a process is to the Kubernetes limits:

(sum by (namespace,pod,container)
(rate(container_cpu_usage_seconds_total{container!=""}[5m])) / sum by 
(namespace,pod,container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

Kubernetes CPU throttling

CPU Throttling is a behavior where processes are slowed when they are about to reach some resource limits.

Similar to the memory case, these limits could be:

A Kubernetes Limit set on the container.
A Kubernetes ResourceQuota set on the namespace.
The node’s actual Memory size.

Think of the following analogy. We have a highway with some traffic where:

CPU is the road.
Vehicles represent the process, where each one has a different size.
Multiple lanes represent having several cores.
A request would be an exclusive road, like a bike lane.

Throttling here is represented as a traffic jam: eventually, all processes will run, but everything will be slower.

CPU process in Kubernetes

CPU is handled in Kubernetes with shares. Each CPU core is divided into 1024 shares, then divided between all processes running by using the cgroups (control groups) feature of the Linux kernel.

If the CPU can handle all current processes, then no action is needed. If processes are using more than 100% of the CPU, then shares come into place. As any Linux Kernel, Kubernetes uses the CFS (Completely Fair Scheduler) mechanism, so the processes with more shares will get more CPU time.

Unlike memory, a Kubernetes won't kill Pods because of throttling.

You can check CPU stats in /sys/fs/cgroup/cpu/cpu.stat

CPU overcommitment

As we saw in the limits and requests article, it’s important to set limits or requests when we want to restrict the resource consumption of our processes. Nevertheless, beware of setting up total requests larger than the actual CPU size, as this means that every container should have a guaranteed amount of CPU.

Monitoring Kubernetes CPU throttling

You can check how close a process is to the Kubernetes limits:

(sum by (namespace,pod,container)(rate(container_cpu_usage_seconds_total
{container!=""}[5m])) / sum by (namespace,pod,container)
(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

In case we want to track the amount of throttling happening in our cluster, cadvisor provides container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total. With these two, you can easily calculate the % of throttling in all CPU periods.

Best practices

Beware of limits and requests

Limits are a way to set up a maximum cap on resources in your node, but these need to be treated carefully, as you might end up with a process throttled or killed.

Prepare against eviction

By setting very low requests, you might think this will grant a minimum of either CPU or Memory to your process. But kubelet will evict first those Pods with usage higher than requests first, so you’re marking those as the first to be killed!

In case you need to protect specific Pods against preemption (when kube-scheduler needs to allocate a new Pod), assign Priority Classes to your most important processes.

Throttling is a silent enemy

By setting unrealistic limits or overcommitting, you might not be aware that your processes are being throttled, and performance impacted. Proactively monitor your CPU usage and know your actual limits in both containers and namespaces.

Wrapping up

Here’s a cheat sheet on Kubernetes resource management for CPU and Memory. This summarizes the current article plus these ones which are part of the same series:

Rightsize your Kubernetes Resources with Sysdig Monitor

With Sysdig Monitor’s new feature, Cost Advisor, you can optimize your Kubernetes costs:

Memory requests
CPU requests

With our out-of-the-box Kubernetes Dashboards, you can discover underutilized resources
in a couple of clicks.

Try it free for 30 days!

Getting started with kubectl plugins

Miguel — Wed, 18 Jan 2023 16:46:11 +0000

Let's dig deeper into this list of Kubectl plugins that we strongly feel will be very useful for anyone, especially security engineers.

Kubernetes, by design, is incredibly customizable. Kubernetes supports custom configurations for specific use case scenarios. This eliminates the need to apply patches to underlying features. Plugins are the means to extend Kubernetes features and deliver out-of-the-box offerings.

What are Kubernetes Plugins?

Users can install and write extensions for kubectl, the Kubernetes command line tool.

By observing the core kubectl commands as essential building blocks for interacting with a Kubernetes cluster, a cluster administrator can think of plugins as a means of utilizing these building blocks to create more complex behavior.

Plugins extend kubectl with new sub-commands, allowing for new and custom features not included in the main distribution of kubectl.

Why are plugins useful for security operations?

Kubernetes plugins provide countless security benefits to the platform. Incident responders can develop additional functionality “on the fly” in their language of choice.

Since Kubernetes features often fall short in cases where businesses need to achieve “out-of-scope” functionality, teams will often need to implement their own custom operations.

Potential security considerations for Kubernetes plugins

While custom implementations add functionality that is not necessarily provided out-of-the-box with kubectl, these plugins are not always as secure as we would like them to be. This article aims to address the most common or useful Kubernetes plugins for improving your security posture.

Managing plugins with Krew

Krew is a plugin manager maintained by the Kubernetes Special Interest Group (SIG) CLI community. Krew makes it easy to use kubectl plugins and helps you discover, install, and manage them on your machine. It is similar to tools like apt, dnf, or brew. Today, over 200 kubectl plugins are available on Krew - and that number is only increasing. Some projects are actively used and some get deprecated over time, but are still accessible via Krew.

Command to install kubectl plugins via Krew:

kubectl krew install <PLUGIN_NAME>

Kubectl plugins available via the Krew plugin index are not audited, which can cause a problem in the supply chain. As mentioned earlier, the Krew plugin index houses hundreds of kubectl plugins:

https://krew.sigs.k8s.io/plugins/

When you install and run third-party plugins, you are doing this at your own risk. At the end of the day, kubectl plugins are just arbitrary programs running in your shell.

Finally, we want to share our top 15 kubectl plugins that will improve your security posture in your Kubernetes cluster.

1. Stern plugin

Link to GitHub Repository

Stern is a kubectl plugin that works a lot like ‘tail -f’ in Linux. Unlike kubectl log -f, which has its own limitations around input parameters, Stern allows you to specify both the Pod ID and the Container ID as regular expressions.

Any match will be followed and the output is multiplexed together, prefixed with the Pod and Container ID, and color-coded for human consumption (colors are stripped if piping to a file).

You can install Stern with the below Krew command:

kubectl krew install stern

Command to tail an appname in Stern:

kubectl stern appname

This will match any pod containing the word service and listen to all containers within it. If you only want to see traffic to the server container, you could do:

kubectl stern --container

This will stream the logs of all the server containers, even if running in multiple pods.

One interesting security use case for the Stern plugin is to look at authentication activity to your Kubernetes clusters. To show the authentication activity within the last 15 minutes with relevant highlighted timestamps, run the following command:

kubectl stern -t --since 15m auth

2. RBAC-tool

Link to GitHub Repository

Role-based access control (RBAC) is a method of regulating access to computer or network resources based on the roles of individual users within your organization. The RBAC-tool simplifies querying and the creation of RBAC policies.

You can install the RBAC-tool with the below Krew command:

kubectl krew install rbac-tool

If you are unfamiliar with how RBAC roles are assigned to different Kubernetes components, the visualization command generates an insightful graph of all RBAC decisions.

kubectl rbac-tool viz --cluster-context nigel-douglas-cluster

The above command scans the cluster with the kubeconfig context 'nigel-douglas-cluster.' These graphs are useful for showing a visual before-and-after of permissions assigned to service accounts.

There are multiple commands other than ‘viz’ provided by the RBAC-tool plugin. The most useful is the ‘who-can’ command. This shows which subjects have RBAC permissions to perform an action denoted by ‘VERB’ (Create, Read, Update, or Delete) on an object.

To see who can read a secret resource by the name ‘important-secret,’ run the below command:

kubectl rbac-tool who-can get secret/important-secret

3. Cilium Plugin

Link to GitHub Repository

Cilium is a network security project that continues to grow in popularity due to its powerful eBPF dataplane. Since Kubernetes is not designed with any specific CNI (Network) Plugin in mind, it can be deciduous trying to manage the Cilium agent via kubectl. That’s why the Cilium team released the Cilium kubectl plugin.

You can install the Cilium plugin with the below Krew command:

kubectl krew install cilium

As a basic first step, you can do a connectivity check for a single node powered by Cilium networking via the below command:

kubectl cilium connectivity test --single-node <node>

This doesn’t just provide improved operational visibility - it’s incredibly beneficial to network security engineers. For instance, if Cilium is unable to communicate with core components such as ‘Hubble,’ this will show-up in the connectivity test.

Hubble provides network, service, and security observability for Kubernetes. Being able to quickly diagnose a connection error, such as “connection refused,” improves the overall visibility of threats and provides the centralized network event view required to maintain regulatory compliance. If you want to dig deeper into network policies, discover how to prevent a Denial of Service (DoS) attack on Kubernetes.

4. Kube Policy Advisor

Link to GitHub Repository

The kube-policy-advisor plugin suggests PodSecurityPolicies and Open Policy Agent (OPA) Policies for your Kubernetes cluster. While PodSecurityPolicies are deprecated, and therefore should not be used, OPA is very much a recommended tool for admission controller.

You can install advise-policy with the below Krew command:

kubectl krew install advise-policy

This kubectl plugin provides security and compliance checks for Kubernetes clusters. It can help identify potential security risks and violations of best practices in a cluster's configuration, and provide recommendations for how to remediate those issues. Some examples of the types of checks that kube-policy-advisor can perform include:

Ensures pods are running with minimal privileges and are not granted unnecessary permissions.
Checks that secrets and other sensitive data are not stored in plain text or checked into source control.
Verifies that network policies are in place to protect against unauthorized access to resources.
Evaluates the security of container images and ensures that they come from trusted sources.

In Kubernetes, Admission Controllers enforce semantic validation of objects during create, update, and delete operations. With OPA, you can enforce custom policies on Kubernetes objects without recompiling or reconfiguring the Kubernetes API server.

kube-policy-advisor is a tool that makes it easier to create OPA Policy from either a live K8s environment or from a single .yaml file containing a pod specification (Deployment, DaemonSet, Pod, etc.). In the below command, the plugin inspects any given namespace to print a report or OPA Policy.

kubectl advise-policy inspect --namespace=<ns>

Note: If you do not enter a given namespace, it will generate the OPA Policy for all network namespace.

By using kube-policy-advisor, you can help ensure that your Kubernetes cluster is secure and compliant with best practices, which can help protect your applications and data from potential threats.

5. Kubectl-ssm-secret

Link to GitHub Repository

The kubectl-ssm-secret plugin allows admins to import or export their Kubernetes Secrets to or from an AWS SSM Parameter Store path. A Kubernetes Secret is sensitive information – such as a password or access key – that is used within a Kubernetes environment. It’s important to be able to safely control these sensitive credentials when transmitting between Kubernetes and AWS cloud.

You can install the ssm-secret plugin with the below Krew command:

kubectl krew install ssm-secret

Secrets are not unique to Kubernetes, of course. You use Secrets’ data in virtually every type of modern application environment or platform. In the case of the ssm-secret plugin, all parameters found under a given parameter store path can be imported into a single kubernetes secret as “StringData.”

This is incredibly useful if you are reprovisioning clusters or namespaces and need to provision the same secrets over and over. Also, it could be useful to backup/restore your LetsEncrypt or other certificates.

If an AWS parameter at path /foo/bar contains a secret value, and the parameter /foo/passwd contains a secure password, we can view the keys and values in parameter store using the kubectl ssm-secret list subcommand:

kubectl ssm-secret list --ssm-path /foo

Those output parameters can then be imported with the following import command:

kubectl ssm-secret import foo --ssm-path /foo

Security considerations

You must specify a single parameter store path for this plugin to work. It will not recursively search more than one level under a given path. As a result, the plugin is highly opinionated, and users run the risk of failing to import/export secrets to the correct path if they don’t track these paths correctly.

6. Kubelogin

Link to GitHub Repository

If you’re running Kubectl versions v.1.12 or higher, Kubelogin (also known as kubectl-login) is a useful security plugin for logging into clusters via the CLI. It achieves this through OpenID Connect providers like DEX. OpenID Connect is a simple identity layer on top of the OAuth 2.0 protocol. It allows Clients to verify the identity of the End-User based on the authentication performed by an Authorization Server, as well as to obtain basic profile information about the End-User in an interoperable and REST-like manner.

You can install the kubectl-login plugin with the below Krew command:

kubectl krew install kubectl-login

Your OpenID Connect provider must have the default callback endpoint for the Kubernetes API Client listed within the OpenID configuration:

http://localhost:33768/auth/callback

This Kubectl plugin takes the OpenID Connect (OIDC) issuer URL from your .kube/config, so it must be placed in your .kube/config. Once you have made this change to the kubeconfig file, you can proceed to use your username assigned to your OIDC provider:

kubectl login nigeldouglas-oidc

After this command is executed in your CLI, the browser will be opened with a redirect to the OpenID Connect Provider login page. The tokens in your kubeconfig file will be replaced after a successful authentication on the OIDC provider’s end.

7. Kubectl-whisper-secret

Link to GitHub Repository

We mentioned the importance of securing sensitive credentials like ‘Secrets’ using the kubectl-ssm-secret plugin. The whisper-secret plugin focuses on creating those secrets with improved privacy. The plugin allows users to create secrets with secure input prompts to prevent information leakages through terminal (bash) history, shoulder surfing attacks, etc.

You can install the whisper-secret plugin with the below Krew command:

kubectl krew install whisper-secret

’kubectl create secret’ has a few sub-commands we use most often that can possibly leak sensitive information in multiple ways, as mentioned above. For example, you can connect to a Docker registry via the ’kubectl create secret’ command with a plain-text password for authentication.

kubectl create secret docker-registry my-secret --docker-password nigelDouglasP@ssw0rD

’kubectl whisper-secret’ plugin allows users to create secrets with a secure input prompt for fields like --from-literal and --docker-password that contain sensitive information.

kubectl whisper-secret docker-registry my-secret --docker-password -- -n nigel-test --docker-username <insert-password>

You are then prompted to enter the Docker password, but this is not inserted into the command itself. This way, the password will not show-up in the bash history as a plain text value, increasing security.

8. Kubectl-capture

Link to GitHub Repository

Sysdig open source (Sysdig Inspect) is a powerful tool for container troubleshooting, performance tuning, and security investigation. The team at Sysdig created a kubectl plugin which triggers a packet capture in the underlying host which is running a pod.

You can install the kubectl-capture plugin with the below Krew command:

kubectl krew install kubectl-capture

Packet captures are incredibly useful for incident response and forensics in Kubernetes. The capture file is created for a duration of time and is downloaded locally in order to use it with Sysdig Inspect, a powerful open source interface designed to intuitively navigate the data-dense Sysdig captures that contain granular system, network, and application activity of a Linux system.

Simply run the below command against any running pod in the cluster:

kubectl capture kinsing-78f5d695bd-bcbd8

When the capture container is being spun, it takes some time to compile the Sysdig Kernel module and capture system calls. Once completed, you can read the content within the Sysdig Inspect UI from your workstation:

With these tools, it will be much easier for the analyst to find the source of the problem or to audit what happened. If you want to go deeper, you can read container troubleshooting with Sysdig Inspect or triaging malicious containers.

9. Kubectl-trace

Link to GitHub Repository

kubectl-trace is a kubectl plugin that allows you to schedule the execution of bpftrace programs in your Kubernetes cluster. In short, Kubectl-trace plugin is a tool for distributed tracing in Kubernetes clusters. It allows you to trace the execution of requests as they pass through different components of a cluster, including pods, services, and ingress controllers.

You can install the kubectl-trace plugin with the below Krew command:

kubectl krew install trace

One potential security benefit of using the Kubectl-trace plugin is that it can help you identify and troubleshoot issues related to request handling within a cluster. For example, if you suspect that a particular request is being blocked or slowed down due to some issue in the cluster, you can use Kubectl-trace to track the request as it travels through the cluster and identify the source of the problem.

This plugin runs a program that probes a tracepoint on the node of choice:

kubectl trace run <node-name> -e "tracepoint:syscalls:sys_enter_* { @[probe] = count(); }"

Another potential security benefit is that Kubectl-trace can help you understand how requests are being handled within a cluster, which can be useful for identifying potential vulnerabilities or misconfigurations. For example, if you see that a request is being handled by a pod or service that has been compromised, you can use Kubectl-trace to track the request and identify the source of the issue.

Overall, the Kubectl-trace plugin can be a useful tool for improving the security of a Kubernetes cluster by helping to identify and address issues related to request handling and execution.

10. Access-matrix

Link to GitHub Repository

Access-matrix (often referred to as ‘Rakkess’) is a kubectl plugin that shows an access matrix for your server resources.

You can install the access-matrix plugin with the below Krew command:

kubectl krew install access-matrix

Simply run the below command to see the Create, Read, Update & Delete (CRUD) permissions for all resources in the ‘default’ network namespace:

kubectl rakkess –n default

Some roles only apply to resources with a specific name. To review such configurations, provide the resource name as an additional argument. For example, show access rights for the ConfigMap called sysdig-controller in namespace sysdig-agent:

kubectl access-matrix r cm sysdig-controller -n sysdig-agent --verbs=all

As rakkess resource needs to query Roles, ClusterRoles, and their bindings, it usually requires administrative cluster access.

11. Rolesum

Link to GitHub Repository

The Rolesum kubectl plugin is a tool for generating a summary of the roles and permissions defined in a Kubernetes cluster. It allows you to see all of the roles and permissions that have been defined in a cluster, along with the users and groups that have been granted those roles. Summarize RBAC roles for the specified subject (ServiceAccount, User, and Group).

You can install the rolesum plugin with the below Krew command:

kubectl krew install rolesum

One potential security benefit of using the Rolesum kubectl plugin is that it can help you identify and understand the roles and permissions that have been defined in a cluster. This can be useful for ensuring that appropriate access controls have been put in place, and for identifying potential vulnerabilities or misconfigurations.

You can summarize roles bound to the "nigeldouglas" ServiceAccount.

By default, rolesum looks for serviceaccounts. There’s no need to specify any flag.

kubectl rolesum nigeldouglas

Another potential security benefit is that Rolesum can help you quickly identify users and groups that have been granted certain roles or permissions, which can be useful for troubleshooting issues or for performing security assessments.

For example, you can summarize roles bound to the "staging" group.

kubectl rolesum -k Group staging

Overall, the Rolesum kubectl plugin can be a useful tool for improving the security of a Kubernetes cluster by helping you understand and manage the roles and permissions that have been defined in the cluster.

12. Cert-Manager

Link to GitHub Repository

Cert-Manager is a Kubectl plugin that provides automatic management of Transport Layer Security (TLS) certificates within a cluster. It allows you to easily provision, manage, and renew TLS certificates for your applications without having to manually handle the certificate signing process.

You can install the cert-manager plugin with the below Krew command:

kubectl krew install cert-manager

One potential security benefit of using cert-manager is that it can help you ensure that your applications are using valid, up-to-date TLS certificates. This can be important for protecting the confidentiality and integrity of communication between your applications and their users.

Another potential security benefit is that cert-manager can help you automate the process of obtaining and renewing TLS certificates, which can reduce the risk of certificate expiration or mismanagement.

Overall, the cert-manager kubectl plugin can be a useful tool for improving the security of a Kubernetes cluster by helping you to manage TLS certificates in a secure and automated manner. The cert-manager plugin is loosely based upon the work of kube-lego and has borrowed some wisdom from other similar projects, such as kube-cert-manager.

13. np-viewer

Link to GitHub Repository

The kubectl-np-viewer plugin is a tool for visualizing the network topology of a Kubernetes cluster. It allows you to view the connections between pods, services, and other resources within a cluster in a graphical format.

You can install the np-viewer plugin with the below Krew command:

kubectl krew install np-viewer

Unlike the Cilium plugin we mentioned earlier, the kubectl-np-viewer plugin helps users understand and visualize the communication patterns within a cluster regardless of the CNI plugin used. The Cilium plugin only helps manage Cilium resources, such as the Cilium network policy. By viewing the default Kubernetes network policies, teams who are starting off with Kubernetes networking benefit from useful visibility into potential vulnerabilities or misconfigurations, such as pods that are communicating with unintended resources or are exposed to the internet.

The below command prints network policies rules affecting a specific pod in the current namespace:

kubectl np-viewer -p pod-name

Similarly, a potential security benefit from the kubectl-np-viewer plugin is that it helps users troubleshoot network issues within a cluster. For example, if you are experiencing connectivity issues between pods or services, you can use the plugin to visualize the connections between those resources and identify the source of the problem across all network namespace.

The below command prints all network policies rules for all namespaces:

kubectl np-viewer --all-namespaces

Overall, the kubectl-np-viewer plugin can be a useful tool for improving the security of a Kubernetes cluster by helping you to understand and monitor the network topology of the cluster. Not all businesses have moved to advanced network policy implementations, such as Calico and Cilium. While users are exploring the Kubernetes Network Policy implementation, they can better understand how their policies control potentially unwanted/malicious traffic within their cluster with this security plugin.

14. ksniff

Link to GitHub Repository

The ksniff kubectl plugin is a tool for capturing and analyzing network traffic in a Kubernetes cluster. It can be used to troubleshoot network issues, monitor traffic patterns, and perform security assessments.

You can install the ksniff plugin with the below Krew command:

kubectl krew install ksniff

One benefit of using ksniff is that it allows you to capture and analyze traffic without having to directly access the nodes in a cluster. This can be helpful in situations where you don't have direct access to the nodes, or where you want to minimize the potential impact of capturing traffic on the cluster.

Another benefit is that ksniff can be used to capture traffic between pods and services, which can be useful for understanding how applications communicate within a cluster. This is helpful for troubleshooting issues, optimizing performance, and identifying potential security vulnerabilities.

Overall, the ksniff kubectl plugin can be a useful tool for improving the security of a Kubernetes cluster by helping to identify and address network-related issues and vulnerabilities. It achieves this by sniffing on Kubernetes pods with existing technologies, such as TCPdump and WireShark.

15. Inspektor-Gadget

Link to GitHub Repository

Inspektor-Gadget is one of the most useful kubectl plugins. The plugin executes within the user's system and as a DaemonSet when deployed within the cluster. It is actually a collection of tools (or gadgets) to debug and inspect Kubernetes resources and applications.

You can install the gadget plugin with the belowKkrew command:

kubectl krew install gadget

You can deploy one or more gadgets. Example gadgets are categorized into:

Advice (Generates seccomp profiles and network policies for the cluster)
Audit (Traces the system calls that the seccomp profile sends to the audit log)
Profile (Analyzes Block I/O through distributed latency and CPU Perf by sampled stack traces)
Snapshot (Gather information about running processes and TCP/UDP sockets)
Top (Periodically report block device I/O activity, eBPF runtime stats, and read/write activity by file)
Trace (Trace almost all activity from DNS queries/responses to OOMkill triggering a process kill)

It manages the packaging, deployment, and execution of eBPF programs in a Kubernetes cluster, including many based on BPF Compiler Collection (BCC) tools, as well as some developed specifically for use in Inspektor Gadget. It automatically maps low-level kernel primitives to high-level Kubernetes resources, making it easier and quicker to find the relevant information.

To “Advise” on a Kubernetes Network Policy based on network trace activity, run the below command:

kubectl gadget advise network-policy report --input ./networktrace.log > network-policy.yaml

To “Audit” a seccomp profile based on pods, namespaces, syscalls, and code, run the below command:

kubectl gadget audit seccomp -o custom-columns=namespace,pod,syscall,code

DIY kubectl plugins

You can write a plugin in any programming language or script that allows you to write command-line commands. There is no plugin installation or pre-loading required, which makes compiling these plugins rather simple.

Plugin executables receive the inherited environment from the kubectl binary. The plugin will then determine which command path it wishes to implement based on the name – for example, a plugin named kubectl-sysdig provides a command kubectl sysdig.

You must install the plugin executable somewhere in your PATH.

A plugin script would look something like this:

#!/bin/bash
# optional argument handling
if [[ "$1" == "version" ]]
then
    echo "1.0.0"
    exit 0
fi
# optional argument handling
if [[ "$1" == "config" ]]
then
    echo "$KUBECONFIG"
    exit 0
fi
echo "I am a plugin named kubectl-sysdig"

For a complete guide on building Kubectl plugins, check out the official Kubernetes documentation.

Final considerations on kubectl plugins

At the time of writing this blog post, there were 208* kubectl plugins currently accessible on Krew. Those kubectl plugins are accessible to developers across all major platforms, like MacOS, Linux, and Windows. While these plugins often address clear limitations over the default kubectl utility for operational tasks and security auditing, they also open a bunch of new security gaps for your Kubernetes cluster.

From a security standpoint, we discussed 15 of the most useful kubectl plugins for giving security teams better visibility for incident response and forensics in Kubernetes. However, as we add more plugins into the environment, we are also adding additional un-audited binaries that could be compromised. Krew does not provide an obligation to audit these binaries for known vulnerabilities or insecure configurations.

Some security implications of using kubectl plugins include:

Plugin vulnerabilities: If a kubectl plugin has a vulnerability, it can potentially be exploited by an attacker to gain access to your Kubernetes cluster.
Insecure plugin installation: If a plugin is installed from an untrusted source, it could contain malicious code that could compromise the security of your cluster.
Privilege escalation: kubectl plugins run with the same privileges as the kubectl command, so if a plugin is compromised, it could potentially be used to escalate privileges and gain access to sensitive resources in your cluster.
Data leakage: If a kubectl plugin is not properly secured, it could potentially leak sensitive data from your cluster.

To mitigate these risks, it is important to only install kubectl plugins from trusted sources and to regularly update and patch any plugins you have installed. It is also a good idea to regularly review the plugins you have installed and remove any that are no longer needed.

If you don’t feel like a specific plugin adds value to your cluster, it would be wise to remove it just in case.

Kubernetes Services: ClusterIP, Nodeport and LoadBalancer

Javier Martínez — Fri, 09 Dec 2022 10:03:19 +0000

Pods are ephemeral. And they are meant to be. They can be seamlessly destroyed and replaced if using a Deployment. Or they can be scaled at some point when using Horizontal Pod Autoscaling (HPA).

This means we can’t rely on the Pod IP address to connect with applications running in our containers internally or externally, as the Pod might not be there in the future.

You may have noticed that Kubernetes Pods get assigned an IP address:

stable-kube-state-metrics-758c964b95-6fnbl               1/1     Running   0          3d20h   100.96.2.5      ip-172-20-54-111.ec2.internal   <none>           <none>
stable-prometheus-node-exporter-4brgv                    1/1     Running   0          3d20h   172.20.60.26    ip-172-20-60-26.ec2.internal

This is a unique and internal IP for this particular Pod, but there’s no guarantee that this IP will exist in the future, due to the Pod's nature.

Services

A Kubernetes Service is a mechanism to expose applications both internally and externally.

Every service will create an everlasting IP address that can be used as a connector.

Additionally, it will open a port that will be linked with a targetPort. Some services can create ports in every Node, and even external IPs to create connectors outside the cluster.

With the combination of both IP and Port, we can create a way to uniquely identify an application.

Creating a service

Every service has a selector that filters that will link it with a set of Pods in your cluster.

spec:
  selector:
    app.kubernetes.io/name: myapp

So all Pods with the label myapp will be linked to this service.

There are three port attributes involved in a Service configuration:

  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30036
    protocol: TCP

port: the new service port that will be created to connect to the application.
targetPort: application port that we want to target with the services requests.
nodePort: this is a port in the range of 30000-32767 that will be open in each node. If left empty, Kubernetes selects a free one in that range.
protocol: TCP is the default one, but you can use others like SCTP or UDP.

You can review services created with:

kubectl get services
kubectl get svc

Types of services

Kubernetes allows the creation of these types of services:

ClusterIP (default)
Nodeport
LoadBalancer
ExternalName

Let’s see each of them in detail.

ClusterIP

This is the default type for service in Kubernetes.

As indicated by its name, this is just an address that can be used inside the cluster.

Take, for example, the initial helm installation for Prometheus Stack. It installs Pods, Deployments, and Services for the Prometheus and Grafana ecosystem.

NAME                                      TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
alertmanager-operated                     ClusterIP   None            <none>        9093/TCP,9094/TCP,9094/UDP   3m27s
kubernetes                                ClusterIP   100.64.0.1      <none>        443/TCP                      18h
prometheus-operated                       ClusterIP   None            <none>        9090/TCP                     3m27s
stable-grafana                            ClusterIP   100.66.46.251   <none>        80/TCP                       3m29s
stable-kube-prometheus-sta-alertmanager   ClusterIP   100.64.23.19    <none>        9093/TCP                     3m29s
stable-kube-prometheus-sta-operator       ClusterIP   100.69.14.239   <none>        443/TCP                      3m29s
stable-kube-prometheus-sta-prometheus     ClusterIP   100.70.168.92   <none>        9090/TCP                     3m29s
stable-kube-state-metrics                 ClusterIP   100.70.80.72    <none>        8080/TCP                     3m29s
stable-prometheus-node-exporter           ClusterIP   100.68.71.253   <none>        9100/TCP                     3m29s

This creates a connection using an internal Cluster IP address and a Port.

But, what if we need to use this connector from outside the Cluster? This IP is internal and won’t work outside.

This is where the rest of the services come in…

NodePort

A NodePort differs from the ClusterIP in the sense that it exposes a port in each Node.

When a NodePort is created, kube-proxy exposes a port in the range 30000-32767:

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  selector:
    app: myapp
  type: NodePort
  ports:
  - port: 80
    targetPort: 8080
    nodePort: 30036
    protocol: TCP

NodePort is the preferred element for non-HTTP communication.

The problem with using a NodePort is that you still need to access each of the Nodes separately.

So, let’s have a look at the next item on the list…

LoadBalancer

A LoadBalancer is a Kubernetes service that:

Creates a service like ClusterIP
Opens a port in every node like NodePort
Uses a LoadBalancer implementation from your cloud provider (your cloud provider needs to support this for LoadBalancers to work).

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  ports:
  - name: web
    port: 80
  selector:
    app: web
  type: LoadBalancer
my-service                                LoadBalancer   100.71.69.103   <pending>     80:32147/TCP                 12s
my-service                                LoadBalancer   100.71.69.103   a16038a91350f45bebb49af853ab6bd3-2079646983.us-east-1.elb.amazonaws.com   80:32147/TCP                 16m

In this case, Amazon Web Service (AWS) was being used, so an external IP from AWS was created.

Then, if you use kubectl describe my-service, you will find that several new attributes were added:

Name:                     my-service
Namespace:                default
Labels:                   <none>
Annotations:              <none>
Selector:                 app.kubernetes.io/name=pegasus
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       100.71.69.103
IPs:                      100.71.69.103
LoadBalancer Ingress:     a16038a91350f45bebb49af853ab6bd3-2079646983.us-east-1.elb.amazonaws.com
Port:                     <unset>  80/TCP
TargetPort:               9376/TCP
NodePort:                 <unset>  32147/TCP
Endpoints:                <none>
Session Affinity:         None
External Traffic Policy:  Cluster

The main difference with NodePort is that LoadBalancer can be accessed and will try to equally assign requests to Nodes.

ExternalName

The ExternalName service was introduced due to the need of connecting to an element outside of the Kubernetes cluster. Think of it not as a way to connect to an item within your cluster, but as a connector to an external element of the cluster.

This serves two purposes:

It creates a single endpoint for all communications to that element.
In case that external service needs to be replaced, it’s easier to switch by just modifying the ExternalName, instead of all connections.

apiVersion: v1
kind: Service
metadata:
  name: myservice
spec:
  ports:
    - name: web
      port: 80
  selector:
    app: web
  type: ExternalName
  externalName: db.myexternalserver.com

Conclusion

Services are a key aspect of Kubernetes, as they provide a way to expose internal endpoints inside and outside of the cluster.

ClusterIP service just creates a connector for in-node communication. Use it only in case you have a specific application that needs to connect with others in your node.

NodePort and LoadBalancer are used for external access to your applications. It’s preferred to use LoadBalancer to equally distribute requests in multi-pod implementations, but note that your vendor should implement load balancing for this to be available.

Apart from these, Kubernetes provides Ingresses, a way to create an HTTP connection with load balancing for external use.

Debug service golden signals with Sysdig Monitor

With Sysdig Monitor, you can quickly debug:

Error rate
Saturation
Traffic
Latency

And thanks to its Container Observability with eBPF, you can do this without adding any app or code instrumentation. Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

How attackers use exposed Prometheus server to exploit Kubernetes clusters

Miguel — Fri, 02 Dec 2022 11:53:43 +0000

What is the main thing we want to explain in this article? It’s simple; don’t expose your metrics for free.

Sometimes we think about deep and complex defense methods and that’s fine. We don’t know why, but we always forget about the base. Don’t expose your data. By default, your Prometheus server can allow anyone to make queries to get information from your Kubernetes Cluster.

This is not something new. In 2018, Tesla had a cryptocurrency mining application in their cloud account, and the initial access was an exposed Kubernetes Dashboard with credentials in the clear.

Moreover, we are not the first to talk about (in) security in monitoring tools. Here are three good examples:

With this in mind, are exposed Prometheus servers a real attack surface?

Prometheus exposed in the wild

One of the most important steps in any pentest, ethical hacking, or real attack is gathering as much information you can get from the target.

The fastest way to check if something is exposed on the internet is to query Google. The specific queries to gather information are denominated Google Dorking and, in our case, is something trivial to get real exposed Prometheus.

A cooler way to find exposed Prometheus servers is using search engines. We used the most common ones to check how many servers we could access:

Search Engine	Number exposed Prometheus server
Shodan	31,679
Censys	61,854
Fofa	161,274

At this point, we would like to clarify a critical fact.

Disclaimer: We have not used an actual exposed Prometheus server to consult or prepare for this talk. We performed all testing in our demo environment and strongly recommend always following security best practices.

After that, what can we do if we have access to a Prometheus server and have access to the fingerprint Kubernetes?

Prometheus’ exporters and fingerprint Kubernetes

Prometheus is the de facto monitoring standard in Kubernetes. All the Kubernetes components of the control plane generate Prometheus metrics out of the box, and many Kubernetes distributions come with Prometheus installed by default including a series of standard exporters, generally:

Node Exporter for infrastructure and host metrics.
KSM Exporter for Kubernetes objects state metrics.

An exporter is an application that generates metrics from other applications or systems that do not expose Prometheus metrics natively.

Cloud provider, where are you?

Imagine that you have a possible target in www.example.com.

All you know is that this site is a web page with users and a little e-commerce section. Under that domain, you find an open exposed Prometheus. The first thing you can do is try to identify the cloud provider where the site is hosted.

You can use the metric node_dmi_info from the Node Exporter. This metric is very interesting, as it gives information about each Kubernetes node:

System vendor: It exposes the cloud vendor’s name. Some example values could be “Amazon EC2” or “Tencent Cloud.”
Product name: Useful to identify both the cloud provider and the product used, as we can find some popular product names from the AWS EC2 catalog (like “m5.xlarge“) or other vendors’ products.

But the cloud provider, even if interesting, is still so vague. You can gather more information if you focus on networking. You can start with the node_network_info metric from the Node Exporter. And even more, you can narrow your search if you filter only the Ethernet interfaces.

Why only Ethernet ones? Because usually, they are the ones that the host identifies as physical network connections and are used to connect the host with the outside world and other machines.

node_network_info{device=~'eth.+'}

This query provides the following information:

IP address of each host.
Device ID.
Availability zone of the cloud provider.
ID of the VPC (Virtual Private Cloud).

Here is an example of some possible values:

    address="06:d5:XX:XX:XX:XX"
    broadcast="ff:ff:ff:ff:ff:ff"
    device="eth0"
    instance="172.31.XX.XX:9100"
    instance_az="us-west-2a"
    instance_id="i-XXXXX"
    instance_name="XXX-XXX"
    instance_type="c5.xlarge"
    instance_vpc="vpc-XXXXXXX"
    operstate="up"

You can also get more information, like the hostname of each node with the metric kube_node_info from KSM.

The long and windy road to the pod

This was all about physical info, but how can we get from outside the web page to a pod in the cluster? The answer to this question is in the ingress and the services.

The ingress controllers in Kubernetes act as reverse proxies and allow redirecting different paths of the URL to different Kubernetes services. These services normally act as load balancers in front of a set of pods that expose a port for connections. The metric kube_ingress_path from KSM will give you information about the URL paths and the associated services of the ingress controllers in your cluster.

This way, you can know that the path /api/users/login goes to the Kubernetes service users-login of the namespace api. Funny, right?

Load balancer services are a special kind of Kubernetes service. Cloud providers use those load balancer services to expose the service to the outside world. As an example, when you create a load-balancer service in an AWS Kubernetes cluster, it creates an ELB (Elastic Load Balancer) instance bound to the service.

This promQL query will give you information about all the load-balancer services in a Kubernetes cluster:

kube_service_info * on (service) group_left group by (service,type) (kube_service_spec_type{type="LoadBalancer"})

To guess what pods are behind each service, you have two options. You can check the metric kube_pod_labels from KSM. These labels are the ones that the service usually uses to select the pods that will serve the requests, but unfortunately, there is not a direct way to get the association between pods and service in pure KSM.

However, if you are lucky enough, the cluster will have installed the OpenCost exporter, a tool that helps infrastructure engineers understand the costs of their cloud usage. This exporter generates an interesting metric called service_selector_labels, which directly gives you the association between the service and the labels that the pod needs to have to be part of that particular service.

This promQL query will give you the labels of each workload used for matching in services:

avg by (namespace,label_app,owner_name)(kube_pod_labels{app="cost-model"} * on(namespace,pod) group_left(owner_name) kube_pod_owner{job="kube-state-metrics"})

While this other one will give you the labels that each service uses to find the pods:

avg by (namespace,label_app, service)(service_selector_labels)

Being a many-to-many association, there is not an easy way to collect all this info in a single promQL query, but the info is there, and it’s easy to make a quick correlation between services and pods.

This way, we have all the points of the path from the URL to the pods: the path of the URL (thanks to the ingress), and pods serving the requests (thanks to the services and labels of the pods).

Logical song of the cluster

You used the metric kube_node_info to get information on the nodes, but now, you are also interested in making a logical map of namespaces, workloads, and pods inside the Kubernetes cluster.

This is easy by using the KSM metrics. The metric kube_namespace_status_phase gives you all the namespaces in the cluster. From there, you can go down with the following metrics for each of the different workload types:

kube_deployment_spec_replicas
kube_daemonset_status_desired_number_scheduled
kube_statefulset_replicas
kube_replicaset_spec_replicas
kube_cronjob_info

After that, you can get info on the pods using kube_pod_info, and associating them with their workloads with kube_pod_owner in the following promQL:

kube_pod_info * on(namespace,pod) group_left(owner_name) kube_pod_owner

Finally, you can even get the container inside each pod with the metric kube_pod_container_info. For example, a pod called postgres-db can have two containers named postgresql and postgres-exporter.

But there is more. You can not only know the namespace and workload of a pod, you can also discover the node where it is living thanks to the label node of the metric kube_pod_info. Why is this important? Keep reading.

The boulevard of broken nodes

You used the metric kube_node_info before to get the hostname of each node, but this metric has more surprises to unfold.

Two labels of this metric will give us full information about the Operative System image used to build the node and the detailed kernel version.

os_image
kernel_version

A quick search on CVE for “Ubuntu 18.04.4 LTS” or “Linux 3.10.0-1160.59.1.el7.x86_64” will give a possible attacker a good set of exploits to use if they can get access to the machine.

Let’s talk about K8s

You have done a good job gathering information about the cluster so far. Namespaces, pods, services, and more But what about Kubernetes itself? There is a set of processes in Kubernetes itself that are just there, and we don’t even think about them unless they start causing problems. We are talking about the Kubernetes control plane.

What would you say if we tell you that there is a metric that specifies the specific version of each of the components of the control plane? While presenting Prometheus, we said that Kubernetes control plane components were exposing natively metrics. Well, one of those metrics is kubernetes_build_info. This gives you information about, not only the full (major and minor) version of each component, but also the git commit and the build date.

This is great if you want to know if a concrete vulnerability affects one of the control plane components of the cluster (among other things).

We have a secret…

Everybody loves secrets, especially attackers. In KSM, there is a metric called kube_secret_info that gives you information about the namespace, node, and name of the secrets of your cluster.

But if you are interested in knowing the content of the secrets, you can use this query:

kube_secret_annotations{kubectl_kubernetes_io_last_applied_configuration != ""}

Why? Well, this is somehow embarrassing. In some older versions of kubectl, it used to save the last applied configuration in an annotation. This was being made for every object, including secrets. This had the effect that, even if the secret was only accessible by the service accounts and role bindings that you can imagine, Prometheus can expose the content of the secret in plain text in that metric.

On images and registries

Do you think you had enough? There is one more interesting thing you can get from KSM. The metric kube_pod_container_info has an interesting piece of information in these labels:

image: name and tag of the image of the container (for example docker.io/library/cassandra:3.11.6)
image_id: name, tag, and hash of the image of the container

This gives you information about:

Application used.
Registry used to pull the image.
Image used.
Tag of the image.
Hash that identifies uniquely the image independently of its tag.

Summary Kubernetes fingerprint

Let’s see what you’ve done so far. You gathered information about:

Cloud provider.
Kubernetes control plane components versions.
Network path from the outside to pods.
Nodes hostnames and IPs.
Operative system and kernel versions.
Logical structure of the cluster namespaces, workloads, and pods.
Images used for the containers, from the source repository to the image tag.
Annotations and names of the secrets of the cluster.

All this information is enough to make a good surface attack analysis of the cluster.

Ninja mode!

Do you want to hear something funny? We gathered all this information and most likely, there is not a trace of all the queries that we did to get it. Prometheus can register logs of the queries, but that’s disabled by default. You can even check if your activity is being logged with this metric:

prometheus_engine_query_log_enabled

Inside the attackers’ minds

Now, attackers just need to know what their target is. In 99% of attacks, it’s money, but how to get the money from the victim, that’s the point where the attacker’s path is defined.

In the talk, we exposed three examples and in each of them, the tools and services exploited are different. The important thing is that we already know where the weaknesses are.

Leak sensitive data

In the first scenario, the exposed application is running on a Kubernetes cluster and the attacker wants to access the data without authorization. The first thing the attacker could check is if the application can be exploited through normal pentesting techniques, for example, with SQLmap the attacker can try to gain access to the data.

But if this does not work, what is the next step?

The attacker can check if the container has vulnerable dependencies or if the image used could be exploited, then see if the components or the node itself are exploitable. But everything seems to be fine. There are no CVE matches and no known exploits that could be used to gain initial access.

What’s next? Well, Prometheus exposed the image and registry that the attacker accessed, but what about attacking the supply chain? In this case, we have two scenarios:

Official/private registration: In this case, the attacker could use similar image names, such as homographs, visually similar by using different Unicode groups, to trick the target. Another technique could be to abuse an insider to manually change the exposed image. In this case, it depends on the financial gain of the attacker.
Third-party registry: In this case, one of the methods could be social engineering, using tools like BeeF to create a specific phishing or fake page to get the login credentials and change the image to a new one with a known and exploitable vulnerability and wait for the deployment. One more thing is this is not magic or 100% successful. If the company scans the images in the deployment, it could be detected!

Cryptomining

In this scenario, one of the most relevant in the last years with the era of cloud, the attacker would like to get access to the cloud account where the application or Kubernetes cluster are deployed. The attacker could take two paths. The long path was to identify one app exposed via Ingress-controller that has a known vulnerability easily exploited via HTTP and obtain a Remote Code Execution inside the container.

The vulnerability exploited in this case will be the infamous log4shell.

Once the attacker has access to the container, they don’t even need to gather more information about the cluster or the node because Prometheus exposed this information as well. From that, we could directly exploit another vulnerability to escape to the container and get full access to the node without using more tools or scanning, evading typical defense methods.

Note: This is not 100% successful. If runtime security is used and the shell within the container is detected as malicious behavior, the incident will be detected before impacting resources.

Now that the attacker has full control of the node, they will be able to deploy containers to run cryptominers, or find cloud credentials in configuration files or env variables to gain initial access.

But this is a long way, what’s the short way? Well, it is possible for Prometheus to directly expose credentials to these cloud providers in the same way that the Kubernetes Dashboard did in the past. Therefore, the attacker only needs to query the information via query and get the API keys in clear text.

Ransomware

Yes, ransomware in Kubernetes is not typical but not impossible. The scenario is similar to the previous one. We need to get write access and for that, we need to jump or move between namespaces.

In this case, we find another application with a different vulnerability, Spring Cloud, but with the same purpose: to get a shell inside the container.

Once inside, we know that a Kubernetes component is an old vulnerable version that we can exploit to get access to etcd, and with that, full access to the namespaces.

The curious thing here is after the data is encrypted, the attacker needs to ask for a ransom through some channel. In a typical scenario, our PC would be locked and the screen would show instructions to pay via BTC or ETH, but inside the container. We hate to share ideas with the bad guys, but one option could be to deploy a container with a modified UI and force ingress to display this in front of the actual application.

Conclusion

We might think that metrics are not important from a security perspective, but we demonstrated that’s not true. Kubernetes and Prometheus advise problems with exposing your data to the world, but regardless, these problems are still widespread.

Following the security best practices in every part of our chain leads to being safe from most security incidents. Otherwise, we will change the typical scenario with a long battle between attackers and defenders for a speedrun.

We will have to continue to fight with new vulnerabilities that impact our services and also a plan against insiders. But let’s at least make things difficult for them.

If you want to see the talk:

The slides are available here.

Kubernetes 1.26 What's new?

Javier Martínez — Thu, 01 Dec 2022 09:19:05 +0000

Kubernetes 1.26 is about to be released, and it comes packed with novelties! Where do we begin?

This release brings 37 enhancements, on par with the 40 in Kubernetes 1.25 and the 46 in Kubernetes 1.24. Of those 37 enhancements, 11 are graduating to Stable, 10 are existing features that keep improving, 16 are completely new, and one is a deprecated feature.

Watch out for all the deprecations and removals in this version!

Two new features stand out in this release that have the potential to change the way users interact with Kubernetes: Being able to provisioning volumes with snapshots from other namespaces.

There are also new features aimed at high performance workloads, like science researching or machine learning: Better what physical CPU cores your workloads run on.

Also, other features will make life easier for cluster administrators, like support for OpenAPIv3.

We are really hyped about this release!

There is plenty to talk about, so let's get started with what’s new in Kubernetes 1.25.

Kubernetes 1.26 – Editor’s pick:

These are the features that look most exciting to us in this release (ymmv):

#3294 Provision volumes from cross-namespace snapshots

The VolumeSnapshot feature allows Kubernetes users provision volumes from volume snapshots, providing great benefits for users and applications, like enabling database administrators to snapshot a database before any critical operation, or the ability to develop and implement backup solutions.

Starting in Kubernetes 1.26 as an Alpha feature, users will be able to create a PersistentVolumeClaim from a VolumeSnapshot across namespaces, breaking the initial limitation of having both objects in the same namespace.

This enhancement comes to eliminate the constraints that prevented users and applications from operating on fundamental tasks, like saving a database checkpoint when applications and services are in different namespaces.

Víctor Hernando - Sr. Technical Marketing Manager at Sysdig

#3488 CEL for admission control

Finally, a practical implementation of the validation expression language from Kubernetes 1.25!

By defining rules for the admission controller as Kubernetes objects, we can start forgetting about managing webhooks, simplifying the setup of our clusters. Not only that, but implementing Kubernetes security is a bit easier now.

We love to see these user-friendly improvements. They are the key to keep growing Kubernetes adoption.

Víctor Jiménez Cerrada - Content Engineering Manager at Sysdig

#3466 Kubernetes component health SLIs

Since Kubernetes 1.26, you can configure Service Level Indicator (SLI) metrics for the Kubernetes components binaries. Once you enable them, Kubernetes will expose the SLI metrics in the /metrics/slis endpoint - so you won't need a Prometheus exporter. This can take Kubernetes monitoring to another level making it easier to create health dashboards and configure PromQL alerts to assure your cluster's stability.

Jesús Ángel Samitier - Integrations Engineer at Sysdig

#2371 cAdvisor-less, CRI-full container and Pod stats

Currently, to gather metrics from containers, such as CPU or memory consumed, Kubernetes relies on cAdvisor. This feature presents an alternative, enriching the CRI API to provide all the metrics from the containers, allowing more flexibility and better accuracy. After all, it's the Container Runtime who best knows the behavior of the container.

This feature represents one more step on the roadmap to remove cAdvisor from Kubernetes code. However, during this transition, cAdvisor will be modified not to generate the metrics added to the CRI API, avoiding duplicated metrics with possible different and incoherent values.

David de Torres Huerta – Engineer Manager at Sysdig

#3063 Dynamic resource allocation

This new Kubernetes release introduces a new Alpha feature which will provide extended resource management for advanced hardware. As a cherry on top, it comes with a user-friendly API to describe resource requests. With the increasing demand to process different hardware components, like GPU or FPGA, and the need to set up initialization and cleanup, this new feature will speed up Kubernetes adoption in areas like scientific research or edge computing.

Javier Martínez - Devops Content Engineer at Sysdig

#3545 Improved multi-numa alignment in Topology Manager

This is yet another feature aimed at high performance workloads, like those involved in scientific computing. We are seeing the new CPU manager taking shape since Kubernetes 1.22 and 1.23, enabling developers to keep their workloads close to where their data is stored in memory, improving performance. Kubernetes 1.26 goes a step further, opening the door to further customizations for this feature. After all, not all workloads and CPU architectures are the same.

The future of HPC on Kubernetes is looking quite promising, indeed.

Vicente J. Jiménez Miras – Security Content Engineer at Sysdig

#3335 Allow StatefulSet to control start replica ordinal numbering

StatefulSets in Kubernetes often are critical backend services, like clustered databases or message queues.
This enhancement, seemingly a trivial numbering change, allows for greater flexibility and enables new techniques for rolling cross-namespace or even cross-cluster migrations of the replicas of the StatefulSet without any downtime. While the process might seem a bit clunky, involving careful definition of PodDisruptionBudgets and the moving of resources relative to the migrating replica, we can surely envision tools (or existing operators enhancements) that automate these operations for seamless migrations, in stark contrast with the cold-migration strategy (shutdown-backup-restore) that is currently possible.

Daniel Simionato - Security Content Engineer at Sysdig

#3325 Auth API to get self user attributes

This new feature coming to alpha will simplify cluster Administrator's work, especially when they are managing multiple clusters. It will also assist in complex authentication flows, as it lets users query their user information or permissions inside the cluster.

Also, this includes whether you are using a proxy (Kubernetes API server fills in the userInfo after all authentication mechanisms are applied) or impersonating (you receive the details and properties for the user that was impersonated), so you will have your user information in a very easy way.

Miguel Hernández - Security Content Engineer at Sysdig

#3352 Aggregated Discovery

This is a tiny change for the users, but one step further on cleaning the Kubernetes internals and improving its performance. Reducing the number of API calls by aggregating them (or at least on the discovery part) is a nice solution to a growing problem. Hopefully, this will provide a small break to cluster administrators.

Devid Dokash - Content Engineering Intern at Sysdig

Deprecations

A few beta APIs and features have been removed in Kubernetes 1.26, including:

Deprecated API versions that are no longer served, and you should use a newer one:

CRI v1alpha2, use v1 (containerd version 1.5 and older are not supported).
flowcontrol.apiserver.k8s.io/v1beta1, use v1beta2.
autoscaling/v2beta2, use v2.

Deprecated. Implement an alternative before the next release goes out:

In-tree GlusterFS driver.
kubectl --prune-whitelist, use --prune-allowlist instead.
kube-apiserver --master-service-namespace.
Several unused options for kubectl run: --cascade, --filename, --force, --grace-period, --kustomize, --recursive, --timeout, --wait.
CLI flag pod-eviction-timeout.
The apiserver_request_slo_duration_seconds metric, use apiserver_request_sli_duration_seconds.

Removed. Implement an alternative before upgrading:

Legacy authentication for Azure and Google Cloud is deprecated.
The userspace proxy mode.
Dynamic kubelet configuration.
Several command line arguments related to logging.
in-tree OpenStack (cinder volume type), use the CSI driver.

Other changes you should adapt your configs for:

Pod Security admission: the pod-security warn level will now default to the enforce level.
kubelet: The default cpuCFSQuotaPeriod value with the cpuCFSQuotaPeriod flag enabled is now 100µs instead of 100ms.
kubelet: The --container-runtime-endpoint flag cannot be empty anymore.
kube-apiserver: gzip compression switched from level 4 to level 1.
Metrics: Changed preemption_victims from LinearBuckets to ExponentialBuckets.
Metrics: etcd_db_total_size_in_bytes is renamed to apiserver_storage_db_total_size_in_bytes.
Metrics: kubelet_kubelet_credential_provider_plugin_duration is renamed kubelet_credential_provider_plugin_duration.
Metrics: kubelet_kubelet_credential_provider_plugin_errors is renamed kubelet_credential_provider_plugin_errors.
Removed Windows Server, Version 20H2 flavors from various container images.
The e2e.test binary no longer emits JSON structs to document progress.

You can check the full list of changes in the Kubernetes 1.26 release notes. Also, we recommend the Kubernetes Removals and Deprecations In 1.26 article, as well as keeping the deprecated API migration guide close for the future.

#281 Dynamic Kubelet configuration

Feature group: node

After being in beta since Kubernetes 1.11, the Kubernetes team has decided to deprecate DynamicKubeletConfig instead of continuing its development.

This feature was marked for deprecation in 1.21, then removed from the Kubelet in 1.24. Now in 1.26, it has been completely removed from Kubernetes.

Kubernetes 1.26 API

#3352 Aggregated discovery

Stage: Net new to Alpha
Feature group: api-machinery
Feature gate: AggregatedDiscoveryEndpoint Default value: false

Every Kubernetes client like kubectl needs to discover what APIs and versions of those APIs are available in the kubernetes-apiserver. For that, they need to make a request per each API and version, which causes a storm of requests.

This enhancement aims to reduce all those calls to just two.

Clients can include as=APIGroupDiscoveryList to the Accept field of their requests to the /api and /apis endpoints. Then, the server will return an aggregated document (APIGroupDiscoveryList) with all the available APIs and their versions.

#3488 CEL for admission control

Stage: Net new to Alpha
Feature group: api-machinery

Feature gate: ValidatingAdmissionPolicy Default value: false

Building on #2876 CRD validation expression language from Kubernetes 1.25, this enhancement provides a new admission controller type (ValidatingAdmissionPolicy) that allows implementing some validations without relying on webhooks.

These new policies can be defined like:

 apiVersion: admissionregistration.k8s.io/v1alpha1
 kind: ValidatingAdmissionPolicy
 metadata:
   name: "demo-policy.example.com"
 Spec:
   failurePolicy: Fail
   matchConstraints:
     resourceRules:
     - apiGroups:   ["apps"]
       apiVersions: ["v1"]
       operations:  ["CREATE", "UPDATE"]
       resources:   ["deployments"]
   validations:
     - expression: "object.spec.replicas <= 5"

This policy would deny requests for some deployments with 5 replicas or less.

Discover the full power of this feature in the docs.

#1965 kube-apiserver identity

Stage: Graduating to Beta
Feature group: api-machinery
Feature gate: APIServerIdentity Default value: true

In order to better control which kube-apiservers are alive in a high availability cluster, a new lease / heartbeat system has been implemented.

Read more in our "What's new in Kubernetes 1.20" article.

Apps in Kubernetes 1.26

#3017 PodHealthyPolicy for PodDisruptionBudget

Stage: Net new to Alpha
Feature group: apps
Feature gate: PDBUnhealthyPodEvictionPolicy Default value: false

A PodDisruptionBudget allows you to communicate some minimums to your cluster administrator to make maintenance tasks easier, like "Do not destroy more than one of these" or "Keep at least two of these alive".

However, this only takes into account if the pods are running, not if they are healthy. It may happen that your pods are Running but not Ready, and a PodDisruptionBudget may be preventing its eviction.

This enhancement expands these budget definitions with the status.currentHealthy, status.desiredHealthy, and spec.unhealthyPodEvictionPolicy extra fields to help you define how to manage unhealthy pods.

$ kubectl get poddisruptionbudgets example-pod
apiVersion: policy/v1
kind: PodDisruptionBudget
[...]
status:
  currentHealthy: 3
  desiredHealthy: 2
  disruptionsAllowed: 1
  expectedPods: 3
  observedGeneration: 1
  unhealthyPodEvictionPolicy: IfHealthyBudget

#3335 Allow StatefulSet to control start replica ordinal numbering

Stage: Net new to Alpha
Feature group: apps
Feature gate: StatefulSetStartOrdinal Default value: false

StatefulSets in Kubernetes currently number their pods using ordinal numbers, with the first replica being 0 and the last being spec.replicas.

This enhancement adds a new struct with a single field to the StatefulSet manifest spec, spec.ordinals.start, which allows to define the starting number for the replicas controlled by the StatefulSet.

This is useful, for example, in cross-namespace or cross-cluster migrations of StatefulSet, where a clever use of PodDistruptionBudgets (and multi-cluster services) can allow a controlled rolling migration of the replicas avoiding any downtime to the StatefulSet.

#3329 Retriable and non-retriable Pod failures for Jobs

Stage: Graduating to Beta
Feature group: apps
Feature gate: JobPodFailurePolicy Default value: true Feature gate: PodDisruptionsCondition Default value: true

This enhancement allows us to configure a .spec.podFailurePolicy on the Jobs's spec that determines whether the Job should be retried or not in case of failure. This way, Kubernetes can terminate Jobs early, avoiding increasing the backoff time in case of infrastructure failures or application errors.

Read more in our "What's new in Kubernetes 1.25" article.

#2307 Job tracking without lingering Pods

Stage: Graduating to Stable
Feature group: apps
Feature gate: JobTrackingWithFinalizers Default value: true

With this enhancement, Jobs will be able to remove completed pods earlier, freeing resources in the cluster.

Read more in our "Kubernetes 1.22 - What's new?" article.

Kubernetes 1.26 Auth

#3325 Auth API to get self user attributes

Stage: Net new to Alpha
Feature group: auth
Feature gate: APISelfSubjectAttributesReview Default value: false

This new feature is extremely useful when a complicated authentication flow is used in a Kubernetes cluster, and you want to know all your userInfo, after all authentication mechanisms are applied.

Executing kubectl alpha auth whoami will produce the following output:

apiVersion: authentication.k8s.io/v1alpha1
kind: SelfSubjectReview
status:
  userInfo:
    username: jane.doe
    uid: b79dbf30-0c6a-11ed-861d-0242ac120002
    groups:
    - students
    - teachers
    - system:authenticated
    extra:
      skills:
      - reading
      - learning
      subjects:
      - math
      - sports

In summary, we are now allowed to do a typical /me to know our own permissions once we are authenticated in the cluster.

#2799 Reduction of secret-based service account tokens

Stage: Graduating to Beta
Feature group: auth
Feature gate: LegacyServiceAccountTokenNoAutoGeneration Default value: true

API credentials are now obtained through the TokenRequest API, are stable since Kubernetes 1.22, and are mounted into Pods using a projected volume. They will be automatically invalidated when their associated Pod is deleted.

Read more in our "Kubernetes 1.24 - What's new?" article.

Network in Kubernetes 1.26

#3453 Minimizing iptables-restore input size

Stage: Net new to Alpha
Feature group: network
Feature gate: MinimizeIPTablesRestore Default value: false

This enhancement aims to improve the performance of kube-proxy. It will do so by only sending the rules that have changed on the calls to iptables-restore, instead of the whole set of rules.

#1669 Proxy terminating endpoints

Stage: Graduating to Beta
Feature group: network
Feature gate: ProxyTerminatingEndpoints Default value: true

This enhancement prevents traffic drops during rolling updates by sending all external traffic to both ready and not ready terminating endpoints (preferring the ready ones).

Read more in our "Kubernetes 1.22 - What's new?" article.

#2595 Expanded DNS configuration

Stage: Graduating to Beta
Feature group: network
Feature gate: ExpandedDNSConfig Default value: true

With this enhancement, Kubernetes allows up to 32 DNS in the search path, and an increased number of characters for the search path (up to 2048), to keep up with recent DNS resolvers.

Read more in our "Kubernetes 1.22 - What's new?" article.

#1435 Support of mixed protocols in Services with type=LoadBalancer

Stage: Graduating to Stable
Feature group: network
Feature gate: MixedProtocolLBService Default value: true

This enhancement allows a LoadBalancer Service to serve different protocols under the same port (UDP, TCP). For example, serving both UDP and TCP requests for a DNS or SIP server on the same port.

Read more in our "Kubernetes 1.20 - What's new?" article.

#2086 Service internal traffic policy

Stage: Graduating to Stable
Feature group: network
Feature gate: ServiceInternalTrafficPolicy Default value: true

You can now set the spec.trafficPolicy field on Service objects to optimize your cluster traffic:

With Cluster, the routing will behave as usual.
When set to Topology, it will use the topology-aware routing.
With PreferLocal, it will redirect traffic to services on the same node.
With Local, it will only send traffic to services on the same node.

Read more in our "Kubernetes 1.21 - What's new?" article.

#3070 Reserve service IP ranges for dynamic and static IP allocation

Stage: Graduating to Stable
Feature group: network
Feature gate: ServiceIPStaticSubrange Default value: true

This update to the --service-cluster-ip-range flag will lower the risk of having IP conflicts between Services using static and dynamic IP allocation, and at the same time, keep the compatibility backwards.

Read more in our "What's new in Kubernetes 1.24" article.

Kubernetes 1.26 Nodes

#2371 cAdvisor-less, CRI-full container and Pod stats

Stage: Major change to Alpha
Feature group: node
Feature gate: PodAndContainerStatsFromCRI Default value: false

This enhancement summarizes the efforts to retrieve all the stats about running containers and pods from the Container Runtime Interface (CRI), removing the dependencies from cAdvisor.

Starting with 1.26, the metrics on /metrics/cadvisor are gathered by CRI instead of cAdvisor.

Read more in our "Kubernetes 1.23 - What's new?" article.

#3063 Dynamic resource allocation

Stage: Net new to Alpha
Feature group: node
Feature gate: DynamicResourceAllocation Default value: false

Traditionally, the Kubernetes scheduler could only take into account CPU and memory limits and requests. Later on, the scheduler was expanded to also take storage and other resources into account. However, this is limiting in many scenarios.

For example, what if the device needs initialization and cleanup, like an FPGA; or what if you want to limit the access to the resource, like a shared GPU?

This new API covers those scenarios of resource allocation and dynamic detection, using the new ResourceClaimTemplate and ResourceClass objects, and the new resourceClaims field inside Pods.

apiVersion: v1
 kind: Pod
[...]
 spec:
   resourceClaims:
   - name: resource0
     source:
       resourceClaimTemplateName: resource-claim-template
   - name: resource1
     source:
       resourceClaimTemplateName: resource-claim-template
[...]

The scheduler can keep track of these resource claims, and only schedule Pods in those nodes with enough resources available.

#3386 Kubelet evented PLEG for better performance

Stage: Net new to Alpha
Feature group: node
Feature gate: EventedPLEG Default value: false

The aim of this enhancement is to reduce the CPU usage of the kubelet when keeping track of all the pod states.

It will partially reduce the periodic polling that the kubelet performs, instead relying on notifications from the Container Runtime Interface (CRI) as much as possible.

If you are interested in the implementation details, you may want to take a look at the KEP.

#3545 Improved multi-NUMA alignment in topology manager

Stage: Net new to Alpha
Feature group: node
Feature gate: TopologyManagerPolicyOptions Default value: false Feature gate: TopologyManagerPolicyBetaOptions Default value: false
Feature gate: TopologyManagerPolicyAlphaOptions Default value: false

This is an improvement for TopologyManager to better handle Non-Uniform Memory Access (NUMA) nodes. For some high-performance workloads, it is very important to control in which physical CPU cores they run. You can significantly improve performance if you avoid memory jumping between the caches of the same chip, or between sockets.

A new topology-manager-policy-options flag for kubelet will allow you to pass options and modify the behavior of a topology manager.

Currently, only one alpha option is available:

When prefer-closest-numa-nodes=true is passed along, the Topology Manager will align the resources on either a single NUMA node or the minimum number of NUMA nodes possible.

As new options may be added in the future, several feature gates have been added so you can choose to focus only on the stable ones:

TopologyManagerPolicyOptions: Will enable the topology-manager-policy-options flag and the stable options.
TopologyManagerPolicyBetaOptions: Will also enable the beta options.
TopologyManagerPolicyAlphaOptions: Will also enable the alpha options.

#2133 Kubelet credential provider

Stage: Graduating to Stable
Feature group: node
Feature gate: KubeletCredentialProviders Default value: true

This enhancement replaces in-tree container image registry credential providers with a new mechanism that is external and pluggable.

Read more in our "Kubernetes 1.20 - What's new?" article.

#3570 Graduate to CPUManager to GA

Stage: Graduating to Stable
Feature group: node
Feature gate: CPUManager Default value: true

The CPUManager is the Kubelet component responsible for assigning pod containers to sets of CPUs on the local node.

It was introduced in Kubernetes 1.8, and graduated to beta in release 1.10. For 1.26, the core CPUManager has been deemed stable, while experimentation continues with the additional work on its policies.

#3573 Graduate DeviceManager to GA

Stage: Graduating to Stable
Feature group: node
Feature gate: DevicePlugins Default value: true

The DeviceManager in the Kubelet is the component managing the interactions with the different Device Plugins.

Initially introduced in Kubernetes 1.8 and moved to beta stage in release 1.10, the Device Plugin framework saw widespread adoption and is finally moving to GA in 1.26.

This framework allows the use of external devices (e.g., NVIDIA GPUs, AMD GPUS, SR-IOV NICs) without modifying core Kubernetes components.

Scheduling in Kubernetes 1.26

#3521 Pod scheduling readiness

Stage: Net new to Alpha
Feature group: schedulingFeature gate: PodSchedulingReadiness Default value: false

This enhancement aims to optimize scheduling by letting the Pods define when they are ready to be actually scheduled.

Not all pending Pods are ready to be scheduled. Some stay in a miss-essential-resources state for some time, which causes extra work in the scheduler.

The new .spec.schedulingGates of a Pod allows to identify when they are ready for scheduling:

apiVersion: v1
 kind: Pod
[...]
 spec:
   schedulingGates:
   - name: foo
   - name: bar
[...]

When any scheduling gate is present, the Pod won't be scheduled.

You can check the status with:

$ kubectl get pod test-pod
 NAME       READY   STATUS            RESTARTS   AGE
 test-pod   0/1     SchedulingGated   0          7s

#3094 Take taints/tolerations into consideration when calculating PodTopologySpread skew

Stage: Graduating to Beta
Feature group: scheduling
Feature gate: NodeInclusionPolicyInPodTopologySpread Default value: true

As we discussed in our "Kubernetes 1.16 - What's new?" article, the topologySpreadConstraints fields, along with maxSkew, allow you to spread your workloads across nodes. A new NodeInclusionPolicies field allows taking into account NodeAffinity and NodeTaint when calculating this pod topology spread skew.

Read more in our "What's new in Kubernetes 1.25" article.

Kubernetes 1.26 storage

#3294 Provision volumes from cross-namespace snapshots

Stage: Net new to Alpha
Feature group: storage
Feature gate: CrossNamespaceVolumeDataSource Default value: false

Prior to Kubernetes 1.26, users were able to provision volumes from snapshots thanks to the VolumeSnapshot feature. While this is a great and super useful feature. it had some limitations, like the inability to bind a PersistentVolumeClaim to VolumeSnapshots from other namespaces.

This enhancement breaks this limitation and allows Kubernetes users to provision volumes from snapshots across namespaces.

If you want to use the cross-namespace VolumeSnapshot feature, you’ll have to first create a ReferenceGrant object, and then a PersistentVolumeClaim binding to the VolumeSnapshot. Here, you’ll find a simple example of both objects for learning purposes.

---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: ReferenceGrant
metadata:
  name: test
  namespace: default
spec:
  from:
  - group: ""
    kind: PersistentVolumeClaim
    namespace: nstest1
  to:
  - group: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: testsnapshot
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: testvolumeclaim
  namespace: nstest1
spec:
  storageClassName: mystorageclass
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 2Gi
  dataSourceRef2:
    apiGroup: snapshot.storage.k8s.io
    kind: VolumeSnapshot
    name: testsnapshot
    namespace: default
  volumeMode: Filesystem

#2268 Non-graceful node shutdown

Stage: Graduating to Beta
Feature group: storage
Feature gate: NodeOutOfServiceVolumeDetach Default value: true

This enhancement addresses node shutdown cases that are not detected properly, where the pods that are part of a StatefulSet will be stuck in terminating status on the shutdown node and cannot be moved to a new running node.

The pods will be forcefully deleted in this case, trigger the deletion of the VolumeAttachments, and new pods will be created on a different running node so that application can continue to function.

Read more in our "Kubernetes 1.24 - What's new?" article.

#3333 Retroactive default StorageClass assignement

Stage: Graduating to Beta
Feature group: storage
Feature gate: RetroactiveDefaultStorageClass Default value: false

This enhancement helps manage the case when cluster administrators change the default storage class. All PVCs without StorageClass that were created while the change took place will retroactively be set to the new default StorageClass.

Read more in our "What's new in Kubernetes 1.25" article.

#1491 vSphere in-tree to CSI driver migration

Stage: Graduating to Stable
Feature group: storage
Feature gate: CSIMigrationvSphere Default value: false

As we covered in our "What's new in Kubernetes 1.19" article, the CSI driver for vSphere has been stable for some time. Now, all plugin operations for vspherevolume are now redirected to the out-of-tree 'csi.vsphere.vmware.com' driver.

This enhancement is part of the #625 In-tree storage plugin to CSI Driver Migration effort.

#1885 Azure file in-tree to CSI driver migration

Stage: Graduating to Stable
Feature group: storage
Feature gate: InTreePluginAzureDiskUnregister Default value: true

This enhancement summarizes the work to move Azure File code out of the main Kubernetes binaries (out-of-tree).

Read more in our "Kubernetes 1.21 - What's new?" article.

#2317 Allow Kubernetes to supply pod's fsgroup to CSI driver on mount

Stage: Graduating to Stable
Feature group: storage
Feature gate: DelegateFSGroupToCSIDriver Default value: false

This enhancement proposes providing the CSI driver with the fsgroup of the pods as an explicit field, so the CSI driver can be the one applying this natively on mount time.

Read more in our "Kubernetes 1.22 - What's new?" article.

Other enhancements in Kubernetes 1.26

#3466 Kubernetes component health SLIs

Stage: Net new to Alpha
Feature group: instrumentation
Feature gate: ComponentSLIs Default value: false

There isn't a standard format to query the health data of Kubernetes components.

Starting with Kubernetes 1.26, a new endpoint /metrics/slis will be available on each component exposing their Service Level Indicator (SLI) metrics in Prometheus format.

For each component, two metrics will be exposed:

A gauge, representing the current state of the healthcheck.
A counter, recording the cumulative counts observed for each healthcheck state.

With this information, you can check the overtime status for the Kubernetes internals, e.g.:

kubernetes_healthcheck{name="etcd",type="readyz"}

And create an alert for when something's wrong, e.g.:

kubernetes_healthchecks_total{name="etcd",status="error",type="readyz"} > 0

#3498 Extend metrics stability

Stage: Net new to Alpha
Feature group: instrumentation
Feature gate: N/A

Metrics in Kubernetes are classified as alpha or stable. The stable ones are guaranteed to be maintained, providing you with the information to prepare your dashboards so they don't break unexpectedly when you upgrade your cluster.

In Kubernetes 1.26, two new classes are added:

beta: For metrics related to beta features. They may change or disappear, but they are in a more advanced development state than the alpha ones.
internal: Metrics for internal usage that you shouldn't worry about, either because they don't provide useful information for cluster administrators, or because they may change without notice.

You can check a full list of available metrics in the documentation.

#3515 OpenAPI v3 for kubectl explain

Stage: Net new to Alpha
Feature group: cli
Environment variable: KUBECTL_EXPLAIN_OPENAPIV3 Default value: false

This enhancement allows kubectl explain to gather the data from OpenAPIv3 instead of v2.

In OpenAPIv3, some data can be represented in a better way, like CustomResourceDefinitions (CDRs).

Internal work is also being made to improve how kubectl explain prints the output.

#1440 kubectl events

Stage: Graduating to Beta
Feature group: cli
Feature gate: N/A

A new kubectl events command is available that will enhance the current functionality of kubectl get events.

Read more in our "Kubernetes 1.23 - What's new?" article.

#3031 Signing release artifacts

Stage: Graduating to Beta
Feature group: release
Feature gate: N/A

This enhancement introduces a unified way to sign artifacts in order to help avoid supply chain attacks. It relies on the sigstore project tools, and more specifically cosign. Although it doesn’t add new functionality, it will surely help to keep our cluster more protected.

Read more in our "Kubernetes 1.24 - What's new?" article.

#3503 Host network support for Windows pods

Stage: Net new to Alpha
Feature group: windows
Feature gate: WindowsHostNetwork Default value: false

There is a weird situation in Windows pods where you can set hostNetwork=true for them, but it doesn't change anything. There isn't any platform impediment, the implementation was just missing.

Starting with Kubernetes 1.26, the kubelet can now request that Windows pods use the host's network namespace instead of creating a new pod network namespace.

This will come handy to avoid port exhaustion where there's large amounts of services.

#1981 Support for Windows privileged containers

Stage: Graduating to Stable
Feature group: windows
Feature gate: WindowsHostProcessContainers Default value: true

This enhancement brings the privileged containers feature available in Linux to Windows hosts.

Privileged containers have access to the host, as if they were running directly on it. Although they are not recommended for most of the workloads, they are quite useful for administration, security, and monitoring purposes.

Read more in our "Kubernetes 1.22 - What's new?" article.

That’s all for Kubernetes 1.26, folks! Exciting as always; get ready to upgrade your clusters if you are intending to use any of these features.

If you liked this, you might want to check out our previous ‘What’s new in Kubernetes’ editions:

Get involved in the Kubernetes community:

Visit the project homepage.
Check out the Kubernetes project on GitHub.
Get involved with the Kubernetes community.
Meet the maintainers on the Kubernetes Slack.
Follow @KubernetesIO on Twitter.

And if you enjoy keeping up to date with the Kubernetes ecosystem, subscribe to our container newsletter, a monthly email with the coolest stuff happening in the cloud-native ecosystem.

Understanding Kubernetes Limits and Requests

Javier Martínez — Mon, 21 Nov 2022 08:43:18 +0000

When working with containers in Kubernetes, it’s important to know what are the resources involved and how they are needed. Some processes will require more CPU or memory than others. Some are critical and should never be starved.

Knowing that, we should configure our containers and Pods properly in order to get the best of both.

In this article, we will see:

Introduction to Kubernetes Limits and Requests
Hands-on example
Kubernetes Requests
Kubernetes Limits
CPU particularities
Memory particularities
Namespace ResourceQuota
Namespace LimitRange
Conclusion

Introduction to Kubernetes Limits and Requests

Limits and Requests are important settings when working with Kubernetes. This article will focus on the two most important ones: CPU and memory.

Kubernetes defines Limits as the maximum amount of a resource to be used by a container. This means that the container can never consume more than the memory amount or CPU amount indicated.

Requests, on the other hand, are the minimum guaranteed amount of a resource that is reserved for a container.

Hands-on example

Let’s have a look at this deployment, where we are setting up limits and requests for two different containers on both CPU and memory.

kind: Deployment
apiVersion: extensions/v1beta1
…
template:
  spec:
    containers:
      - name: redis
        image: redis:5.0.3-alpine
        resources:
limits:
            memory: 600Mi
            cpu: 1
requests:
            memory: 300Mi
            cpu: 500m
      - name: busybox
        image: busybox:1.28
        resources:
limits:
            memory: 200Mi
            cpu: 300m
requests:
            memory: 100Mi
            cpu: 100m

Let’s say we are running a cluster with, for example, 4 cores and 16GB RAM nodes. We can extract a lot of information:

Pod effective request is 400 MiB of memory and 600 millicores of CPU. You need a node with enough free allocatable space to schedule the pod.
CPU shares for the redis container will be 512, and 102 for the busybox container. Kubernetes always assign 1024 shares to every core, so redis: 1024 * 0.5 cores ≅ 512 and busybox: 1024 * 0.1cores ≅ 102
Redis container will be OOM killed if it tries to allocate more than 600MB of RAM, most likely making the pod fail.
Redis will suffer CPU throttle if it tries to use more than 100ms of CPU in every 100ms, (since we have 4 cores, available time would be 400ms every 100ms) causing performance degradation.
Busybox container will be OOM killed if it tries to allocate more than 200MB of RAM, resulting in a failed pod.
Busybox will suffer CPU throttle if it tries to use more than 30ms of CPU every 100ms, causing performance degradation.

Kubernetes Requests

Kubernetes defines requests as a guaranteed minimum amount of a resource to be used by a container.

Basically, it will set the minimum amount of the resource for the container to consume.

When a Pod is scheduled, kube-scheduler will check the Kubernetes requests in order to allocate it to a particular Node that can satisfy at least that amount for all containers in the Pod. If the requested amount is higher than the available resource, the Pod will not be scheduled and remain in Pending status.

For more information about Pending status, check Understanding Kubernetes Pod pending problems.

In this example, in the container definition we set a request for 100m cores of CPU and 4Mi of memory:

resources:
   requests:
        cpu: 0.1
        memory: 4Mi

Requests are used:

When allocating Pods to a Node, so the indicated requests by the containers in the Pod are satisfied.
At runtime, the indicated amount of requests will be guaranteed as a minimum for the containers in that Pod.

Kubernetes Limits

Kubernetes defines limits as a maximum amount of a resource to be used by a container.

This means that the container can never consume more than the memory amount or CPU amount indicated.

    resources:
      limits:
        cpu: 0.5
        memory: 100Mi

Limits are used:

When allocating Pods to a Node. If no requests are set, by default, Kubernetes will assign requests = limits.
At runtime, Kubernetes will check that the containers in the Pod are not consuming a higher amount of resources than indicated in the limit.

CPU particularities

CPU is a compressible resource, meaning that it can be stretched in order to satisfy all the demand. In case that the processes request too much CPU, some of them will be throttled.

CPU represents computing processing time, measured in cores.

You can use millicores (m) to represent smaller amounts than a core (e.g., 500m would be half a core)
The minimum amount is 1m
A Node might have more than one core available, so requesting CPU > 1 is possible

Memory particularities

Memory is a non-compressible resource, meaning that it can’t be stretched in the same manner as CPU. If a process doesn’t get enough memory to work, the process is killed.

Memory is measured in Kubernetes in bytes.

You can use, E, P, T, G, M, k to represent Exabyte, Petabyte, Terabyte, Gigabyte, Megabyte and kilobyte, although only the last four are commonly used. (e.g., 500M, 4G)
Warning: don’t use lowercase m for memory (this represents Millibytes, which is ridiculously low)
You can define Mebibytes using Mi, as well as the rest as Ei, Pi, Ti (e.g., 500Mi)

A Mebibyte (and their analogues Kibibyte, Gibibyte,...) is 2 to the power of 20 bytes. It was created to avoid the confusion with the Kilo, Mega definitions of the metric system. You should be using this notation, as it's the canonical definition for bytes, while Kilo and Mega are multiples of 1000

Best practices

In very few cases should you be using limits to control your resources usage in Kubernetes. This is because if you want to avoid starvation (ensure that every important process gets its share), you should be using requests in the first place.

By setting up limits, you are only preventing a process from retrieving additional resources in exceptional cases, causing an OOM kill in the event of memory, and Throttling in the event of CPU (process will need to wait until the CPU can be used again).

For more information, check the article about OOM and Throttling.

If you’re setting a request value equal to the limit in all containers of a Pod, that Pod will get the Guaranteed Quality of Service.

Note as well, that Pods that have a resource usage higher than the requests are more likely to be evicted, so setting up very low requests cause more harm than good.For more information, check the article about Pod eviction and Quality of Service.

Namespace ResourceQuota

Thanks to namespaces, we can isolate Kubernetes resources into different groups, also called tenants.

With ResourceQuotas, you can set a memory or CPU limit to the entire namespace, ensuring that entities in it can’t consume more from that amount.

apiVersion: v1
kind: ResourceQuota
metadata:
  name: mem-cpu-demo
spec:
  hard:
    requests.cpu: 2
    requests.memory: 1Gi
    limits.cpu: 3
    limits.memory: 2Gi

requests.cpu: the maximum amount of CPU for the sum of all requests in this namespace
requests.memory: the maximum amount of Memory for the sum of all requests in this namespace
limits.cpu: the maximum amount of CPU for the sum of all limits in this namespace
limits.memory: the maximum amount of memory for the sum of all limits in this namespace

Then, apply it to your namespace:

kubectl apply -f resourcequota.yaml --namespace=mynamespace

You can list the current ResourceQuota for a namespace with:

kubectl get resourcequota -n mynamespace

Note that if you set up ResourceQuota for a given resource in a namespace, you then need to specify limits or requests accordingly for every Pod in that namespace. If not, Kubernetes will return a “failed quota” error:

Error from server (Forbidden): error when creating "mypod.yaml": pods "mypod" is forbidden: failed quota: mem-cpu-demo: must specify limits.cpu,limits.memory,requests.cpu,requests.memory

In case you try to add a new Pod with container limits or requests that exceed the current ResourceQuota, Kubernetes will return an “exceeded quota” error:

Error from server (Forbidden): error when creating "mypod.yaml": pods "mypod" is forbidden: exceeded quota: mem-cpu-demo, requested: limits.memory=2Gi,requests.memory=2Gi, used: limits.memory=1Gi,requests.memory=1Gi, limited: limits.memory=2Gi,requests.memory=1Gi

Namespace LimitRange

ResourceQuotas are useful if we want to restrict the total amount of a resource allocatable for a namespace. But what happens if we want to give default values to the elements inside?

LimitRanges are a Kubernetes policy that restricts the resource settings for each entity in a namespace.

apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-resource-constraint
spec:
  limits:
  - default:
      cpu: 500m
    defaultRequest:
      cpu: 500m
    min:
      cpu: 100m
    max:
      cpu: "1"
    type: Container

default: Containers created will have this value if none is specified.
min: Containers created can’t have limits or requests smaller than this.
max: Containers created can’t have limits or requests bigger than this.

Later, if you create a new Pod with no requests or limits set, LimitRange will automatically set these values to all its containers:

    Limits:
      cpu:  500m
    Requests:
      cpu:  100m

Now, imagine that you add a new Pod with 1200M as limit. You will receive the following error:

Error from server (Forbidden): error when creating "pods/mypod.yaml": pods "mypod" is forbidden: maximum cpu usage per Container is 1, but limit is 1200m

Note that by default, all containers in Pod will effectively have a request of 100m CPU, even with no LimitRanges set.

Conclusion

Choosing the optimal limits for our Kubernetes cluster is key in order to get the best of both energy consumption and costs.

Oversizing or dedicating too many resources for our Pods may lead to costs skyrocketing.

Undersizing or dedicating very few CPU or Memory will lead to applications not performing correctly, or even Pods being evicted.

As mentioned, Kubernetes limits shouldn’t be used, except in very specific situations, as they may cause more harm than good. There’s a chance that a Container is killed in case of Out of Memory, or throttled in case of Out of CPU.

For requests, use them when you need to ensure a process gets a guaranteed share of a resource.

Rightsize your Kubernetes resources with Sysdig Monitor

With Sysdig Monitor new feature, cost advisor, you can optimize your Kubernetes costs

Memory requests
CPU requests

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!

The four Golden Signals of Kubernetes monitoring

Javier Martínez — Fri, 28 Oct 2022 09:20:10 +0000

Golden Signals are a reduced set of metrics that offer a wide view of a service from a user or consumer perspective: Latency, Traffic, Errors and Saturation. By focusing on these, you can be quicker at detecting potential problems that might be directly affecting the behavior of the application.

Google introduced the term "Golden Signals" to refer to the essential metrics that you need to measure in your applications. They are the following:

Errors - rate of requests that fail.
Saturation - consumption of your system resources.
Traffic - amount of use of your service per time unit.
Latency - the time it takes to serve a request.

This is just a set of essential signals to start monitoring in your system. In other words, if you’re wondering which signals to monitor, you will need to look at these four first.

Enter: Goldilocks and the four Monitoring Signals

Once upon a time, there was a little girl called Goldilocks, who lived at the other side of the wood and had been sent on an errand by her mother, passed by the house, and looked in at the window…

Errors

Goldilocks then tried the little chair, which belonged to the Little Bear, and found it just right, but she sat in it so hard that she broke it.

The error rate for the chairs is ⅓

The Errors golden signal measures the rate of requests that fail.

Note that measuring the bulk amount of errors might not be the best course of action. If your application has a sudden peak of requests, then logically the amount of failed requests may increase.

That’s why usually monitoring systems focus on the error rate, calculated as the percent of calls that are failing from the total.

If you’re managing a web application, typically you will discriminate between those calls returning HTTP status in the 400-499 range (client errors) and 500-599 (server errors).

Measuring errors in Kubernetes

One thermometer for the errors happening in Kubernetes is the Kubelet. You can use several Kubernetes State Metrics in Prometheus to measure the amount of errors.

The most important one is kubelet_runtime_operations_errors_total, which indicates low level issues in the node, like problems with container runtime.

If you want to visualize errors per operation, you can use kubelet_runtime_operations_total to divide.

Errors example

Here's the Kubelet Prometheus metric for error rate in a Kubernetes cluster:

sum(rate(kubelet_runtime_operations_errors_total{cluster="",
job="kubelet", metrics_path="/metrics"}[$__rate_interval])) 
by (instance, operation_type)

Saturation

Goldilocks tasted the porridge in the dear little bowl, and it was just right, and it tasted so good that she tasted and tasted, and tasted and tasted until she was full.

After eating one small bowl, Goldilocks is unable to eat more. That’s saturation.

Saturation measures the consumption of your system resources, usually as a percentage of the maximum capacity. Examples include:

CPU usage
Disk space
Memory usage
Network bandwidth

In the end, cloud applications run on machines, which have a limited amount of these resources.

In order to correctly measure, you should be aware of the following:

What are the consequences if the resource is depleted? It could be that your entire system is unusable because this space has run out. Or maybe further requests are throttled until the system is less saturated.
Saturation is not always about resources about to be depleted. It’s also about over-resourcing, or allocating a higher quantity of resources than what is needed. This one is crucial for cost savings.

Measuring saturation in Kubernetes

Since saturation depends on the resource being observed, you can use different metrics for Kubernetes entities:

node_cpu_seconds_total to measure machine CPU utilization.
container_memory_usage_bytes to measure the memory utilization at container level (paired with container_memory_max_usage_bytes).
The amount of Pods that a Node can contain is also a Kubernetes resource.

Saturation example

Here’s a PromQL example of a Saturation signal, measuring CPU usage percent in a Kubernetes node.

100 - (avg by (instance) (rate(node_cpu_seconds_total{}[5m])) * 100)

Traffic

And the Middle-sized Bear said:
“Somebody has been tumbling my bed!”
And the Little bear piped:
“Somebody has been tumbling my bed, and here she is!”

One of the beds is being used, but none should be being used instead. That’s an unusual traffic.

Traffic measures the amount of use of your service per time unit.

In essence, this will represent the usage of your current service. This is important not only for business reasons, but also to detect anomalies.

Is the amount of requests too high? This could be due to a peak of users or because of a misconfiguration causing retries.

Is the amount of requests too low? That may reflect that one of your systems is failing.

Still, traffic signals should always be measured with a time reference. As an example, this blog receives more visits from Tuesday to Thursday.

Depending on your application, you could be measuring traffic by:

Requests per minute for a web application
Queries per minute for a database application
Endpoint requests per minute for an API

Traffic example

Here’s a Google Analytics chart displaying traffic distributed by hour:

Latency

At that, Goldilocks woke in a fright, and jumped out of the window and ran away as fast as her legs could carry her, and never went near the Three Bears’ snug little house again.

Goldilocks ran down the stairs in just two seconds. That’s a very low latency.

Latency is defined as the time it takes to serve a request.

Average latency

When working with latencies, your first impulse may be to measure average latency, but depending on your system that might not be the best idea. There may be very fast or very slow requests distorting the results.

Instead, consider using a percentile, like p99, p95, and p50 (also known as median) to measure how the fastest 99%, 95%, or 50% of requests, respectively, took to complete.

Failed vs. successful

When measuring latency, it’s also important to discriminate between failed and successful requests, as failed ones might take sensibly less time than the correct ones.

Apdex Score

As described above, latency information may not be informative enough:

Some users might perceive applications as slower, depending on the action they are performing.
Some users might perceive applications as slower, based on the default latencies of the industry.

This is where the Apdex (Application Performance Index) comes in. It’s defined as:

Where t is the target latency that we consider as reasonable.

Satisfied will represent the amount of users with requests under the target latency.
Tolerant will represent the amount of non-satisfied users with requests below four times the target latency.
Frustrated will represent the amount of users with requests above the tolerant latency.

The output for the formula will be an index from 0 to 1, indicating how performant our system is in terms of latency.

Measuring latency in Kubernetes

In order to measure the latency in your Kubernetes cluster, you can use metrics like http_request_duration_seconds_sum.

You can also measure the latency for the api-server by using Prometheus metrics like apiserver_request_duration_seconds.

Latency example

Here’s an example of a Latency PromQL query for the 95% best performing HTTP requests in Prometheus:

histogram_quantile(0.95, sum(rate(prometheus_http_request_duration_seconds_bucket[5m]))
by (le))

RED Method

The RED Method was created by Tom Wilkie, from Weaveworks. It is heavily inspired by the Golden Signals and it’s focused on microservices architectures.

RED stands for:

Rate
Error
Duration

Rate measures the number of requests per second (equivalent to Traffic in the Golden Signals).

Error measures the number of failed requests (similar to the one in Golden Signals).

Duration measures the amount of time to process a request (similar to Latency in Golden Signals).

USE Method

The USE Method was created by Brendan Gregg and it’s used to measure infrastructure.

USE stands for:

Utilization
Saturation
Errors

That means for every resource in your system (CPU, disk, etc.), you need to check the three elements above.

Utilization is defined as the percentage of usage for that resource.

Saturation is defined as the queue for requests in the system.

Errors is defined as the number of errors happening in the system.

While it may not be intuitive, Saturation in Golden Signals is not similar to the Saturation in USE, but rather Utilization.

A practical example of Golden signals in Kubernetes

As an example to illustrate the use of Golden Signals, here’s a simple go application example with Prometheus instrumentation. This application will apply a random delay between 0 and 12 seconds in order to give usable information of latency. Traffic will be generated with curl, with several infinite loops.

An histogram was included to collect metrics related to latency and requests. These metrics will help us obtain the initial three Golden Signals: latency, request rate and error rate. To obtain saturation directly with Prometheus and node-exporter, use percentage of CPU in the nodes.


File: main.go
-------------
package main
import (
    "fmt"
    "log"
    "math/rand"
    "net/http"
    "time"
    "github.com/gorilla/mux"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
    //Prometheus: Histogram to collect required metrics
    histogram := prometheus.NewHistogramVec(prometheus.HistogramOpts{
        Name:    "greeting_seconds",
        Help:    "Time take to greet someone",
        Buckets: []float64{1, 2, 5, 6, 10}, //Defining small buckets as this app should not take more than 1 sec to respond
    }, []string{"code"}) //This will be partitioned by the HTTP code.
    router := mux.NewRouter()
    router.Handle("/sayhello/{name}", Sayhello(histogram))
    router.Handle("/metrics", promhttp.Handler()) //Metrics endpoint for scrapping
    router.Handle("/{anything}", Sayhello(histogram))
    router.Handle("/", Sayhello(histogram))
    //Registering the defined metric with Prometheus
    prometheus.Register(histogram)
    log.Fatal(http.ListenAndServe(":8080", router))
}
func Sayhello(histogram *prometheus.HistogramVec) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        //Monitoring how long it takes to respond
        start := time.Now()
        defer r.Body.Close()
        code := 500
        defer func() {
            httpDuration := time.Since(start)
            histogram.WithLabelValues(fmt.Sprintf("%d", code)).Observe(httpDuration.Seconds())
        }()
        if r.Method == "GET" {
            vars := mux.Vars(r)
            code = http.StatusOK
            if _, ok := vars["anything"]; ok {
                //Sleep random seconds
                rand.Seed(time.Now().UnixNano())
                n := rand.Intn(2) // n will be between 0 and 3
                time.Sleep(time.Duration(n) * time.Second)
                code = http.StatusNotFound
                w.WriteHeader(code)
            }
            //Sleep random seconds
            rand.Seed(time.Now().UnixNano())
            n := rand.Intn(12) //n will be between 0 and 12
            time.Sleep(time.Duration(n) * time.Second)
            name := vars["name"]
            greet := fmt.Sprintf("Hello %s \n", name)
            w.Write([]byte(greet))
        } else {
            code = http.StatusBadRequest
            w.WriteHeader(code)
        }
    }
}

The application was deployed in a Kubernetes cluster with Prometheus and Grafana, and generated a dashboard with Golden Signals. In order to obtain the data for the dashboards, these are the PromQL queries:

Latency:

sum(greeting_seconds_sum)/sum(greeting_seconds_count)  //Average
histogram_quantile(0.95, sum(rate(greeting_seconds_bucket[5m])) by (le)) //Percentile p95

Request rate:

sum(rate(greeting_seconds_count{}[2m]))  //Including errors
rate(greeting_seconds_count{code="200"}[2m])  //Only 200 OK requests

Errors per second:

sum(rate(greeting_seconds_count{code!="200"}[2m]))

Saturation:

100 - (avg by (instance) (irate(node_cpu_seconds_total{}[5m])) * 100)

Conclusion

Golden Signals, RED, and USE are just guidelines on what you should be focusing on when looking at your systems. But these are just the bare minimum on what to measure.

Understand the errors in your system. They will be a thermometer of all the other metrics, as they will point to any unusual behavior. And remember that you need to correctly mark requests as erroneous, but only the ones that should be exceptionally incorrect. Otherwise, your system will be prone to false positives or false negatives.

Measure latency of your requests. Try to understand your bottlenecks and what the negative experiences are when latency is higher than expected.

Visualize saturation and understand the resources involved in your solution. What are the consequences if a resource gets depleted?

Measure traffic to understand your usage curves. You will be able to find the best time to take down your system for an update, or you could be alerted when there’s an unexpected amount of users.

Once metrics are in place, it’s important to set up alerts, which will notify you in case any of these metrics reach a certain threshold.

Track golden signals easily with Sysdig Monitor

With Sysdig Monitor, you can quickly review the golden signals in your system, out of the box.

Review easily the Latency, Errors, Saturation and Traffic for the Pods in your cluster. And thanks to its Container Observability with eBPF, you can do this without adding any app or code instrumentation.

Sysdig Advisor accelerates mean time to resolution (MTTR) with live logs, performance data, and suggested remediation steps. It’s the easy button for Kubernetes troubleshooting!

Try it free for 30 days!