DEV Community: Ana Cozma

How to Check TLS Configuration of URLs with Curl and Bash Script

Ana Cozma — Tue, 20 Aug 2024 14:22:06 +0000

If you are working in an Azure environment and you are using Azure Availability Tests you might run into the following Health Advisory event:

On 31 October 2024, in alignment with the Azure wide legacy TLS deprecation, TLS 1.0/1.1 protocol versions and the below listed TLS 1.2/1.3 legacy Cipher suites and > Elliptical curves will be retired for Application Insights availability tests.

For a list of deprecated versions and remaining supported versions have a look over the official documentation here.

But how do you quickly check which endpoint in your availability tests is impacted?

This was the initial scenario that led me to create and ultimately write this blog post, but these checks can be applied to any case where you need to retrieve and verify the TLS configuration of URLs. Whether you're ensuring compliance with security standards, troubleshooting connection issues, or simply gathering information for audits, the script can help you get the information you need.

Using Curl for one URL

You can use the curl command with the -v (verbose) option to see detailed information about the TLS handshake, including the TLS version by running the command below:

curl -s -v https://example.com 2>&1 | grep "SSL connection using"

Explanation of the command and its parameters:

Use curl to make a request to the specified URL.
The -s option makes curl silent, except for errors.
The -v option outputs verbose information.
The 2>&1 redirects the standard error (where verbose output is written) to standard output, allowing grep to filter it.
The grep "SSL connection using" command filters out the line containing the TLS version.

By running the command the output will be something like this:

* SSL connection using TLSv1.2 / ECDHE_RSA_AES_256_GCM_SHA384

But what if you have more than one URL you need to check? Running this command manually can be tiring and it will involve a lot of copy-pasting which can be time-consuming. So let's look into saving up some time.

Using Curl and a Bash script to loop through a list of URLs

We can take this a step further and create a script that will accept a list of URLs, loop through them and output the information we need. You can achieve this by following the steps below:

1. Create the file:

First, let's create a file check_tls_version.sh or you can name it however you want:

touch check_tls_version.sh

2. Make the Script Executable:

chmod +x check_tls_details.sh

3. Create the Script:

vi check_tls_version.sh

You can use vi, nano or just open in any Editor of your choice and paste inside the newly created file the code below:

#!/bin/bash

# Check if a file was provided as an argument
if [ -z "$1" ]; then
  echo "Usage: $0 <file_with_urls>"
  exit 1
fi

# Read the file line by line
while IFS= read -r url; do
  # Make sure the line is not empty
  if [ -n "$url" ]; then
    echo "Checking $url..."

    # Use curl to fetch the TLS details
    output=$(curl -s -v "$url" 2>&1)

    # Extract the TLS version
    tls_version=$(echo "$output" | grep "SSL connection using" | awk '{print $5}')

    # Extract the cipher suite
    cipher_suite=$(echo "$output" | grep "SSL connection using" | awk -F'/' '{print $2}' | xargs)

    # Extract the elliptic curve (if available)
    elliptic_curve=$(echo "$output" | grep "SSL certificate verify" | grep -o '(?<=using ).*(?= curve)')

    # Output the results
    echo "URL: $url"
    echo "TLS Version: ${tls_version:-Could not retrieve TLS version}"
    echo "Cipher Suite: ${cipher_suite:-Could not retrieve cipher suite}"
    if [ -n "$elliptic_curve" ]; then
      echo "Elliptic Curve: $elliptic_curve"
    else
      echo "Elliptic Curve: Could not retrieve elliptic curve or not applicable"
    fi
    echo ""
  fi
done < "$1"

In the sample above, I added comments on what each line does, but feel free to modify it to extract more or less of the information you are interested in.

4. Prepare a File with URLs:

Create a text file with one URL per line, e.g., urls.txt:

https://example.com
https://anotherexample.com

5. Run the Script:

Having both the script (heck_tls_details.sh) and our list of URLs we need to check(urls.txt), we can now run the script we created:

./check_tls_details.sh urls.txt

6. Sample Output:

The script will have an output similar to this one for each URL you provided:

Checking https://example.com...
URL: https://example.com
TLS Version: TLSv1.3
Cipher Suite: AEAD-AES128-GCM-SHA256
Elliptic Curve: X25519

Checking https://anotherexample.com...
URL: https://anotherexample.com
TLS Version: TLSv1.2
Cipher Suite: ECDHE-RSA-AES256-GCM-SHA384
Elliptic Curve: prime256v1

Explanation of the output parameters:

URL: The URLs being checked - the ones you provide in the urls.txt file.\
TLS Version: The TLS version used by the URL.\
Cipher Suite: The cipher suite used for the connection.\
Elliptic Curve: The elliptic curve used, if applicable.

Now that you have your data you can simply compare the TLS version, Cipher Suite or Elliptic Curve against the deprecated or supported versions and take appropriate actions to update them.

Lastly, I have used curl on Mac, but you can use the same on Windows by installing it from here.

Thank you for reading and I hope this helps someone out there with your use case!

Understanding and Mitigating the Latest OpenSSH Vulnerability (CVE-2024-6387) in AKS

Ana Cozma — Wed, 17 Jul 2024 12:46:47 +0000

Recently a new vulnerability in OpenSSH has been identified and the first question that popped into my mind was: How do I make sure my nodes are not affected by _this vulnerability?

In this blog post, I wanted to go over what the vulnerability is, how it can be exploited, explain how you can check if your Azure Kubernetes Service (AKS) is vulnerable to CVE-2024-6387 and what you can do about it, including different options for upgrading the VMSS image and how to choose between them.

Understand the vulnerability

CVE-2024-6387

CVE-2024-6387 is a critical unauthenticated RCE-as-root vulnerability that was identified in the OpenSSH server, sshd, in glibc-based Linux systems. If exploited, this vulnerability grants full root access, affects the default configuration and does not require user interaction thus it is classified as a High Severity.

This was identified on the 1st of July 2024.

The researchers who discovered it also noted that in 2006 OpenSSH faced this vulnerability known as CVE-2006-5051. While the 2006 one was patched, the bug has reappeared. This is why the latest, CVE-2024-6387, vulnerability is dubbed the "regreSSHion bug": we see a reintroduction of an issue that was fixed due to code changes.

CVE-2024-6387 vulnerability impacts the following OpenSSH server versions:

Open SSH version between 8.5p1 - 9.8p1 (excluding)
Open SSH versions earlier than 4.4p1, if they’ve not backport-patched against CVE-2006-5051 or patched against CVE-2008-4109

CVE-2024-6409

As of the 9th of July another vulnerability has been discovered: CVE-2024-6409.

This is a distinct vulnerability from the regreSSHion bug. The vulnerability allows an attacker to execute code within the privsep child process. This child process is a part of OpenSSH that runs with restricted privileges to limit the damage that can be done if it is compromised.

The vulnerability is caused by a race condition related to how signals are handled. This means that the privsep child process can be exploited because the timing of signal handling operations can be manipulated, leading to unintended behavior that allows code execution.

Impact OpenSSH versions 8.7p1 and 8.8p1 shipped with Red Hat Enterprise Linux 9.

Machines patched for CVE-2024-6387 will also be patched for CVE-2024-6409.

Suggested actions against the vulnerability

To protect against this vulnerability the main suggestion is to upgrade the package version using a command like or similar to apt upgrade opensshh-sftp-server, but if you cannot do this and you need a quick workaround then an option would be to set the LoginGraceTime SSH configuration parameter to 0 as recommended by Ubuntu.

Let's look into both recommendations and understand them a bit more and let's start with the workaround:

Set LoginGraceTime to 0

OpenSSH allows remote connections to the server machines. LoginGraceTime SSH server configuration parameter specifies the time allowed for successful authentication to the server.

This means that setting a longer Grace time period allows for more open unauthenticated connections to be made. Setting a shorter Grace time period can protect against a brute force attack in certain cases.

In the context of the identified vulnerability, this is important because the vulnerable code is called only when the LoginGraceTime timer triggers. So the reasoning is that by setting it to 0, which means no timeout, you prevent the timer from firing, the code will not be called and thus the vulnerability is eliminated.

But there is a caveat here.

While you eliminate the risk of calling the vulnerable code, and you are protected against brute force attacks, by setting this to 0 you are making sshd vulnerable to denial of service attacks. So it's good to consider your options carefully and the tradeoff when you are configuring these settings.

Denial of Service through MaxStartups Exhaustion Explained

MaxStartups is another sshd configuration that limits the number of concurrent unauthenticated connections.

If LoginGraceTime is set to 0, attackers can open numerous connections without being timed out. Since these connections won't be closed due to timeout, they will remain open indefinitely.

This can exhaust the allowed number of connections specified by MaxStartups, preventing legitimate users from accessing the SSH service.

Essentially, the server becomes overwhelmed with these open connections, leading to a denial of service for legitimate users (hence the denial of service).

This is why the main recommendation is to upgrade to a patched version of sshd where the underlying vulnerability has been addressed. This ensures that LoginGraceTime can be set to a reasonable value, and the server can handle connection attempts appropriately without being vulnerable to a DoS attack via MaxStartups exhaustion.

Upgrade to a patched version of `sshd`

Now onto the main fix and what this means for your virtual machine scale sets (VMSS) in the AKS context. When running AKS, modifying the VMSS yourself is generally not recommended due to the following reasons:

Managed Service: AKS is a managed Kubernetes service, meaning Microsoft handles most of the underlying infrastructure management for you. Directly modifying VMSS configurations can interfere with the automated management and updates provided by AKS.
Configuration Consistency: AKS maintains certain configurations to ensure the cluster operates correctly. Manual modifications to the VMSS could lead to a configuration drift, where the manually set configurations diverge from the managed state AKS expects and maintains.
Stability and Reliability: Direct modifications can lead to instability or unexpected behavior within your cluster. This includes potential issues during upgrades, scaling operations, or applying patches.

Because of these reasons handling the fix for the vulnerability means waiting for the Azure release team to provide us with a patched image.

Check the AKS version

When you upgrade Kubernetes it also upgrades the node images so a good place to start is to identify the version of Kubernetes your AKS clusters are running. You can do this through the Azure portal, CLI, or API.

Azure Portal:

Navigate to your AKS cluster resource and check the version information in the Overview section.

Azure CLI:

az aks show --resource-group <ResourceGroupName> --name <AKSClusterName> --query kubernetesVersion

Note: Replace ResourceGroupName and AKSClusterName with your actual resource group and AKS cluster names.

Then by making use of kubectl command line, you can retrieve the exact version of the node images you are using:

kubectl get nodes -o wide

By running these commands you will know your Kubernetes version and also the OS Image version your nodes are running on. Now you can compare your node image version against the versions mentioned in the CVE details as vulnerable to know if you are running the nodes on an image that has a vulnerable version of sshd.

Check and upgrade the AKS VMSS node image

Identify the patched image version

Azure Kubernetes Service regularly provides new node images, so it's good to upgrade your node images frequently to take advantage of the latest AKS features. Linux node images are updated weekly, and Windows node images are updated monthly.

For Azure, and AKS more specifically, you should perform the following checks:

Check for the node image with a patched sshd version on GitHub Azure AKS Releases
Check the rollout schedule of the patched node image in your region AKS Release Status page

***Tip: It is also a good practice in general to check the release page for announcements on upcoming releases and the fixes they include and keep your node images up to date to protect against the latest vulnerabilities.*

At the time of the writing of the current article, we'll be looking out for the rollout of the image with version: 202407.08.0.

Generally, when you upgrade the Kubernetes version the images will be upgraded as well, but when you have a security patch you might want to upgrade only the image and not the Kubernetes version.

Please consider carefully before upgrading a node image version because it's not possible to downgrade it afterward!

Verify the patched image version availability

In order to check for available node image upgrades for the nodes in your node pool simply run the following command:

az aks nodepool get-upgrades --nodepool-name mynodepool --cluster-name myAKSCluster --resource-group myResourceGroup

In the JSON output, check the latestNodeImageVersion parameter which indicates the version of the latest image available that the nodes can be upgraded to.

Then, you want to check the actual node image you are running on (can be done via Azure Portal or CLI). If you're using CLI for this command as well then just run:

az aks nodepool show --resource-group myResourceGroup --cluster-name myAKSCluster --name mynodepool --query nodeImageVersion

Simply compare the two image versions. If there is a difference this means there is an upgrade available for your nodes. If not, you are already running on the latest and you should check the releases for the rollout of the image you are interested in upgrading to.

Having the image version available in your region, the next step will be performing the actual node image upgrade. There are several ways of handling this depending on your scenario which I will detail below.

Upgrade all node images in all node pools

TL;DR

CLI Command: az aks upgrade --node-image-only \
Scope: This command applies the upgrade to all node pools in the specified AKS cluster. \
Use Case: Use this when you want to ensure that all nodes in your entire cluster are updated to the latest node image version.

How To

Use the az aks upgrade command with the --node-image-only flag to upgrade the node images across all node pools in the AKS cluster. This command ensures that only the node image is upgraded without altering the Kubernetes version.
```
az aks upgrade \
    --resource-group myResourceGroup \
    --name myAKSCluster \
    --node-image-only
```
After initiating the upgrade, you can verify the status of the node images using the kubectl get nodes command with a specific JSONPath query to output the node names and their image versions.
```
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubernetes\.azure\.com\/node-image-version}{"\n"}{end}'
```
Once the upgrade is complete, you can retrieve the updated details of the node pools, including the current node image version, using the az aks show command.
```
az aks show \
    --resource-group myResourceGroup \
    --name myAKSCluster
```

Upgrade a specific node pool

TL;DR

CLI Command: az aks nodepool upgrade --node-image-only\
Scope: This command targets a specific node pool within the AKS cluster, identified by the --name parameter.\
Use Case: Use this when you need to upgrade the node image for only one particular node pool, perhaps for testing or staggered rollout purposes.

How To

If you want to upgrade the node image of a specific node pool without affecting the entire cluster, use the az aks nodepool upgrade command with the --node-image-only flag.

az aks nodepool upgrade \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mynodepool \
    --node-image-only

Similar to the cluster-wide upgrade, check the status of the node images with the kubectl get nodes command.

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubernetes\.azure\.com\/node-image-version}{"\n"}{end}'

Use the az aks nodepool show command to get the details of the updated node pool.

az aks nodepool show \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mynodepool

Use Node Surge to Speed Up Upgrades

TL;DR

CLI Command: az aks nodepool update --max-surge \
Scope: This command also targets a specific node pool but includes the --max-surge parameter to control the number of extra nodes that can be created to expedite the upgrade. \
Use Case: Use this when you want to perform a faster upgrade of a node pool by temporarily increasing the number of nodes during the upgrade process, thereby reducing downtime or upgrade duration.

How To

To speed up the node image upgrade process, you can use the az aks node pool update command with the --max-surge flag, which specifies the number of extra nodes used during the upgrade process. This allows more nodes to be upgraded simultaneously.
```
az aks nodepool update \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mynodepool \
    --max-surge 33% \
    --no-wait
```

Check the node image status as previously described.

kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.kubernetes\.azure\.com\/node-image-version}{"\n"}{end}'

Retrieve the updated node pool details using the az aks node pool show command.

az aks nodepool show \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mynodepool

Conclusion

The choice between the three will depend on what your strategy will be and what you want to focus on:

If you have a new security patch or critical update and want every node in your cluster to be updated as quickly as possible without specifying individual node pools, upgrade the entire cluster.
If you are running different workloads on separate node pools and want to update the node image for only one specific pool to test compatibility or performance just target upgrade.
If you need a faster upgrade for a specific node pool and can afford to temporarily add more nodes to handle the upgrade process, use node surge.

I hope this article will give you an idea of this particular security vulnerability and how you can mitigate it and how you can approach security patches in the future in the context of AKS VMSS. Thank you for reading!

AWS: Handling 'Cannot delete entity, must remove tokens from principal first' error

Ana Cozma — Thu, 08 Feb 2024 12:19:33 +0000

This blog post will be a quick one focusing on troubleshooting a less clear error, 'Cannot delete entity, must remove tokens from principal first', that Terraform can throw when you try to delete IAM users from AWS.

Let's assume that in your Terraform configuration you manage IAM users and you want to delete one of them. You'd think that by simply removing the Terraform code and then running terraform apply it will delete the users. Which was my case. But then as soon as I ran the command to destroy the resource I ran into an issue:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  - destroy

Terraform will perform the following actions:

  # aws_iam_user.little_tester will be destroyed
  # (because aws_iam_user.little_tester is not in configuration)
  - resource "aws_iam_user" "little_tester" {
      - arn           = "arn:aws:iam::xxxxxxxxxx:user/little_tester" -> null
      - force_destroy = false -> null
      - id            = "little_tester" -> null
      - name          = "little_tester" -> null
      - path          = "/" -> null
      - tags          = {
          - "Company"  = "MyCompany"
          - "Location" = "Aruba"
          - "Unit"     = "Front Desk"
        } -> null
      - tags_all      = {
          - "Company"  = "MyCompany"
          - "Location" = "Aruba"
          - "Unit"     = "Front Desk"
        } -> null
      - unique_id     = "AAAAAAAAAAAAAAAAA" -> null
    }

Plan: 0 to add, 0 to change, 1 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_iam_user.little_tester: Destroying... [id=little_tester]
╷
│ Error: deleting IAM User (little_tester): DeleteConflict: Cannot delete entity, must remove tokens from principal first.
│     status code: 409, request id: ...
│

So what does this mean?

The error Cannot delete entity, must remove tokens from principal first. says that the user has some tokens that need to be removed before the user itself can be deleted. The tokens it refers to can be active access keys or registered MFA devices.

The decision to prevent the deletion of a user if any of these active tokens are associated to it makes sense since from a security perspective because it aims to prevent accidental deletion of users that are still active.

A way to confirm if this is the case is to go to AWS Console and check the user's Security credentials. There you should see any active access keys or registered MFA devices.

Having checked that, I saw that the user had an Access key that was still active and had an active MFA device. I removed both manually and then ran terraform apply again. And it worked! The user was deleted successfully.

How can this happen?

The user's access token and the MFA device configured to his account were not managed by Terraform, meaning they were created manually. So Terraform was not aware of them and could not delete them. And this was preventing the deletion of the user.

How this could come to be is if the user was created through Terraform code, but all the other configurations were done manually after the user was created: adding an access key, adding an MFA device, etc. So then you end up with a mix of Terraform-managed and non-Terraform-managed resources.

Something to think about for future cases, this could also happen if you create a user group in Terraform and then add users to it manually later on. These users will be part of the group, but Terraform will not be aware of them and will not be able to manage them. Or in any other scenario where you mix non-Terraform-managed and Terraform-managed resources.

What can you do?

First option, is to add the access key and MFA device to the Terraform configuration so then creation and removal of the users will be part of a complete flow fully managed by Terraform.

Second option is to simply manually go to AWS Console > IAM, and check the user's Security credentials and MFA devices. For the active ones simply deactivate them and remove them manually. Then simply run to your configuration and run terraform apply again.

And lastly, you can add the force_destroy argument to the aws_iam_user resource in your Terraform configuration.

force_destroy - (Optional, default false) When destroying this user, destroy even if it has non-Terraform-managed IAM access keys, login profile or MFA devices. Without force_destroy a user with non-Terraform-managed access keys and login profile will fail to be destroyed.

By enabling it, it will allow Terraform to delete the user even if it has non-Terraform-managed access keys and MFA devices.

Warning!

While it does seem a convenient option, be very careful with this argument, as it can lead to accidental deletion of users that are still active. So I would advise you to use it only if you are sure that the user is not active (maybe have a check in place that runs before the destruction of the resources), that you are aware of the security implications and lastly check the access of the team members that can run the Terraform code.

Hope this helps someone out there!

Azure Application Gateway WAF config vs WAF policy

Ana Cozma — Thu, 02 Nov 2023 13:49:04 +0000

Recently, I had to enable WAF on our Azure Application Gateway. Because of our infrastructure setup, I wanted to have all the rules from OWASP 3.2 enabled, but I needed to be able to exclude some of our (valid) requests from being blocked as well. To achieve this, I could either try to configure the WAF Config section on our Gateway or create a WAF policy.

Given that it was not entirely clear how you can use proper exclusions and filters based on what you need, I decided to write this post to explain the differences I found between the two and how you can use them.

What is WAF?

To recap what Web Application Firewall (WAF) is, here is a brief explanation from the official documentation:

Azure Web Application Firewall (WAF) on Azure Application Gateway provides centralized protection of your web applications from common exploits and vulnerabilities. Web applications are increasingly targeted by malicious attacks that exploit commonly known vulnerabilities. SQL injection and cross-site scripting are among the most common attacks.

WAF on Application Gateway is based on the Core Rule Set (CRS) from the Open Web Application Security Project (OWASP).

Before being able to enable and benefit from WAF capabilities, you will need to check the SKU of the Application Gateway you have. WAF can only be enabled on the WAF_v2 SKU and not the Standard SKU. As this was my case as well, I first had to change the SKU of the Application Gateway. This can be done either from the Azure Portal or using Terraform (or any other tool for IaC, in my case, I used this one).

After this, you can proceed with configuring WAF. This can be done in two ways: either using the built-in WAF Config section of the Application Gateway or creating a WAF policy for the Azure Application Gateway.

Let's look at what each one is and how you can use them.

WAF config

The WAF config section is a built-in part of the Application Gateway configuration as can be seen in the image below:

WAF Config is the Application Gateway's built-in method to configure WAF and it is the section where you can add your configurations such as exclusions or custom rules.

When using Terraform you can find the waf_configuration block under the azurerm_application_gateway resource.

Let's look at an example of how you can configure it using Terraform.

Scenario: I would like to configure it to use OWASP 3.2 rules, enable the WAF, and exclude some of our telemetry requests from being blocked while also disabling some rules. This is how the basic configuration would look like:

resource "azurerm_application_gateway" "application_gateway" {
  (...)
  "waf_configuration" {
    enabled                  = true
    firewall_mode            = "Prevention"
    rule_set_type            = "OWASP"
    rule_set_version         = "3.2"
    file_upload_limit_mb     = 100
    max_request_body_size_kb = 128
    request_body_check       = false

    exclusion {
        match_variable          = "RequestCookieNames"
        selector                = "telemetry"
        selector_match_operator = "Contains"
    }

    disabled_rule_group {
    rule_group_name = "REQUEST-920-PROTOCOL-ENFORCEMENT"
    rules = [
      920230,
      920320,
      ]
    }

    disabled_rule_group {
      rule_group_name = "REQUEST-921-PROTOCOL-ATTACK"
      rules = [
        921180,
        921170,
      ]
    }
  }
}

The scenario was simple and while the configuration itself is not hard to do, there are a few drawbacks to using it:

it does not allow you to add custom rules from the Azure Portal UI. This means that if you want to add a custom rule, you will have to do it using the Azure CLI (or PowerShell). I would like to ideally have all my configurations in one place and not have to use multiple tools to configure or maintain my resources.
if you have multiple Application Gateways, you will have to configure each one of them separately. Because WAF Config is built-in the Application Gateway this also means it is managed locally to that specific Application Gateway. While it's configuration applies to everything in the Azure Application Gateway resource. Which was my case as well as I don't manage just one Application Gateway.
if you are working with Azure Front Door it's good to know that you cannot use WAF Config in that context. This is because Azure Front Door does not support WAF Config.

WAF policy

As opposed to WAF Config, which is a built-in functionality in the Application Gateway, WAF policies are a standalone resource that enables you to configure WAF. This means that you can create a WAF policy and then apply it to multiple Application Gateways or even Azure Front Door resources as well.

WAF policy allows you to have a centralized configuration for all your WAF resources. This means that you can have the same configuration for all your WAF resources and you can also have a single place where you can manage your WAF configuration.

Because it is a standalone resource the first benefit is you will be able to find all the configurations necessary in the Azure Portal UI: TODO: Rephrase

You have your Managed rules:

Your Custom rules:

And your associated gateways:

As you can already guess from the screenshots, WAF Policy gives you a bit more control over your configuration as you can be more detailed in what you want to exclude or include in your rules.

You have the flexibility to link a WAF (Web Application Firewall) policy in various ways: you can connect it:

globally by assigning it to an Azure Application Gateway resource
per-site level by linking it to a listener
per URI level by associating it with a particular route path

For more details (and examples) on how you can link a WAF policy to your resources, you can check the official documentation here.

In Terraform this means you will need to create a new resource:

resource "azurerm_web_application_firewall_policy" "waf_policy" {
  name                = "wafpolicy"
  resource_group_name = azurerm_resource_group.rg.name
  location            = azurerm_resource_group.rg.location

  policy_settings {
    enabled                     = true
    mode                        = "Prevention"
    request_body_check          = false
    file_upload_limit_in_mb     = 100
    max_request_body_size_in_kb = 128
  }

  ## Example of managed rules
  managed_rules {
    managed_rule_set {
      type    = "OWASP"
      version = "3.2"
      rule_group_override {
        rule_group_name = "REQUEST-920-PROTOCOL-ENFORCEMENT"
        disabled_rules = [
          920200,
          920201,
          920202
        ]
      }

      rule_group_override {
        rule_group_name = "REQUEST-921-PROTOCOL-ATTACK"
        disabled_rules = [
          921170,
          921180,
        ]
      }

      rule_group_override {
        rule_group_name = "REQUEST-942-APPLICATION-ATTACK-SQLI"
        disabled_rules = [
          942430
        ]
      }
    }
  }

  ## Example of custom rules
  custom_rules {
    name      = "ExcludeServicesFromWAF"
    priority  = 14
    rule_type = "MatchRule"

    match_conditions {
      match_variables {
        variable_name = "RequestUri"
      }

      operator           = "Contains"
      negation_condition = false
      match_values = [
        "/service1/",
      (...)
      ]
    }

    action = "Allow"
  }
}

After creating the WAF Policy you will need to associate it to the Application Gateway which will be done by adding the following parameters to the azurerm_application_gateway resource:

  firewall_policy_id                = azurerm_web_application_firewall_policy.wafpolicy.id
  force_firewall_policy_association = true

The first thing you will notice, if you go to Azure Portal, is that in your Application Gatway resource you will no longer have the WAF Config section available, but a link to the WAF Policy you just created:

This means that any change you want to make to your WAF configuration you will need to do it in the WAF Policy resource itself and not in the Application Gateway resource.
This offered me the granularity I needed to be able to exclude the requests I wanted and also have the same configuration for all my Application Gateways.

In my case, WAF Config was not the right answer for what I needed: have the same exclusions on all our gateways and also have the same custom rules regradless of environment and allow me to exclude the requests that were coming from our services.
This is why I decided to look into WAF policies instead.

Final thoughts

WAF Config is a good option if you want to configure WAF settings at Application Gateway level that applies to all the listeners and rules within it. It's quite suitable if you have a single set of WAF settings that you want to apply to all your web applications behind the Application Gateway.

Whereas, WAF Policy will be a good choice when you need to have a more granular control over your WAF settings, where you need to define custom WAF settings and rules per-application or per-bath basis. One use-case for this could be if you have several applications behind the Application Gateway that have different security concerns and require configuring different WAF settings.

I did not dive into all the rules and settings you can configure for WAF, which will be the topic of a separate more in-depth article, but I hope this post will help you decide which one is the best option for you.

Thank you for reading and hope this helps somebody else!

Ensuring Seamless Operations: Troubleshooting and Resolving Dapr Certificate Expiry

Ana Cozma — Thu, 20 Jul 2023 10:50:49 +0000

A CNCF project, the Distributed Application Runtime (Dapr) provides APIs that simplify microservice connectivity. Whether your communication pattern is service to service invocation or pub/sub messaging, Dapr helps you write resilient and secured microservices. Essentially, it provides a new way to build microservices by using the reusable blocks implemented as sidecars.

While Dapr is great as it is language agnostic and it solves some challenges that come with microservices and distributed systems, such as message broker integration, encryption etc, troubleshooting Dapr issues can be quite challenging. Dapr logs, especially the error messages, can be quite generic and sometimes do not provide enough information for you to understand what is going on.

In this blog post, I want to detail a problem I had with Dapr certificate expiration, how I troubleshoot the root cause, the symptoms the application was manifesting and how I managed to solve it.

I want also to highlight how important it is to have proper monitoring in place so I will be touching upon that as well by showing you some lessons learned and what I ended setting up to save me from repeating the same mistakes in the future.

Symptoms

Application deployment was failing because it could not inject the dapr sidecar. It kept restarting until reaching the 5min defaut timeout and rolledback. Checking the events on the pod I noticed the GET /healthz endpoints for liveness and readiness probes were throwing connect: connection refused.

There were no errors in app logs or in the Dapr sidecar logs. The only thing I noticed was that the dapr sidecar was in CrashLoopBackOff state.

Troubleshooting

Step 1: Dapr Operator logs

Since no longs were available on pod or Dapr sidecar, I started by checking the logs of the next best thing which was the Dapr Operator and I noticed the following error:

{"instance":"dapr-operator-0000000000-abcd","level":"info","msg":"starting webhooks","scope":"dapr.operator","time":"2023-05-25T12:51:13.267369255Z","type":"log","ver":"1.10.4"}
I0525 12:51:13.269285       1 leaderelection.go:248] attempting to acquire leader lease dapr-system/webhooks.dapr.io...
{"instance":"dapr-operator-0000000000-abcd","level":"info","msg":"Conversion webhook for \"subscriptions.dapr.io\" is up to date","scope":"dapr.operator","time":"2023-05-25T12:51:13.277615379Z","type":"log","ver":"1.10.4"}
W0601 02:52:46.530879       1 reflector.go:347] pkg/mod/k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1.Secret ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0601 02:52:46.531001       1 reflector.go:347] pkg/mod/k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1alpha1.Configuration ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0601 02:52:46.531061       1 reflector.go:347] pkg/mod/k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1.Service ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0601 02:52:46.531050       1 leaderelection.go:330] error retrieving resource lock dapr-system/operator.dapr.io: Get "https://X.X.X.X:443/apis/coordination.k8s.io/v1/namespaces/dapr-system/leases/operator.dapr.io": http2: client connection lost
W0601 02:52:46.531095       1 reflector.go:347] pkg/mod/k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1alpha1.Resiliency ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
W0601 02:52:46.530891       1 reflector.go:347] pkg/mod/k8s.io/client-go@v0.26.1/tools/cache/reflector.go:169: watch of *v1alpha1.Component ended with: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeeding
E0601 02:52:46.531191       1 leaderelection.go:330] error retrieving resource lock dapr-system/webhooks.dapr.io: Get "https://X.X.X.X:443/apis/coordination.k8s.io/v1/namespaces/dapr-system/leases/webhooks.dapr.io": http2: client connection lost

The Dapr operator works by establishing an admission webhook, which enables Kubernetes (K8s) to interact with it when it intends to deploy a new pod. After a successful response, the daprd container is added to the pod. For more in detail information on how the operator works, check the Dapr Operator control plane service overview documentation.

Step 2: Investigate the `http2: client connection lost` error

The http2: client connection lost error indicated to me that K8s could not successfully invoke the admission webhook, so I started to check one by one:

Network connectivity: The error message mentioned a potential issue with the client connection being lost. So I verified that the machine running the Dapr process could establish a stable connection to the Kubernetes API server. Checked for any network connectivity issues or firewalls that might be interfering with the communication. Everything was fine.

API server issues: I also checked for any issues with the Kubernetes API server itself, such as high load, resource constraints, or misconfiguration. No issues found.

Namespace or resource deletion: I checked that no resources had been deleted in the the dapr-system namespace or the webhooks.dapr.io resource. Everything was still there.

Step 3: AKS cluster logs

So as next step, I started looking into the AKS cluster logs and noticed that all the services that were also using Dapr had the following error authentication handshake failed. The full log is below:

{"app_id":"app1","instance":"app1-123456-abc7","level":"info","msg":"sending workload csr request to sentry","scope":"dapr.runtime.grpc.internal","time":"2023-06-19T13:19:53.535345802Z","type":"log","ver":"1.10.4"}
2023-06-19 15:19:53.535 
{"app_id":"app1","instance":"app1-123456-abc7","level":"info","msg":"renewing certificate: requesting new cert and restarting gRPC server","scope":"dapr.runtime.grpc.internal","time":"2023-06-19T13:19:53.535329702Z","type":"log","ver":"1.10.4"}
2023-06-19 15:19:53.535 
{"app_id":"app1","instance":"app1-123456-abc7","level":"error","msg":"error starting server: error from authenticator CreateSignedWorkloadCert: error from sentry SignCertificate: rpc error: code = Unavailable desc = connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate has expired or is not yet valid: current time 2023-06-19T13:19:51Z is after 2023-06-16T12:31:17Z\"","scope":"dapr.runtime.grpc.internal","time":"2023-06-19T13:19:53.535259601Z","type":"log","ver":"1.10.4"}

The errors above were a confirmation that it could not establish a connection because it could not authenticate due to an handshake failure.

Step 4: Dapr Sentry logs

To dig deeper, I researched how Dapr handles mTLS which pointed me to the Dapr Sentry service.

The Dapr Sentry service manages mTLS between services and acts as a certificate authority. It generates mTLS certificates and distributes them to any running sidecars. This allows sidecars to communicate with encrypted, mTLS traffic.

So I went to check the Dapr Sentry logs and I finally found the issue: Dapr root certificate expired.

2023-06-19 14:49:06.566 
{"instance":"dapr-sentry-123456-abc7","level":"warning","msg":"Dapr root certificate expiration warning: certificate has expired.","scope":"dapr.sentry","time":"2023-06-19T12:49:06.566339341Z","type":"log","ver":"1.10.4"}

In order to view the logs of the Dapr Sentry service you can run the following command:

kubectl logs --selector=app=dapr-sentry --namespace <NAMESPACE>

Generating a new root certificate

By default, the certificate expires after 365 days. You can change the expiration time by setting the --cert-chain-expiration flag when you start the Dapr Sentry service. The value is in days.

With Dapr, you can encrypt communication between applications using self-signed one valid for 1 year so it was time to renew the certificate.

To renew the certificate, I followed the recommended steps to root and issuer certificate upgrade using CLI. You can find the steps here.

Generated brand new root and issuer certificates, signed by a newly generated private root key by running the following command:

dapr mtls renew-certificate -k --valid-until <days> --restart

⌛  Starting certificate rotation
ℹ️  generating fresh certificates
ℹ️  Updating certifcates in your Kubernetes cluster
ℹ️  Dapr control plane version 1.10.4 detected in namespace dapr-system
✅  Certificate rotation is successful! Your new certicate is valid through Wed, 18 Jun 2025 13:37:30 UTC
ℹ️  Restarting deploy/dapr-sentry..
ℹ️  Restarting deploy/dapr-operator..
ℹ️  Restarting statefulsets/dapr-placement-server..
✅  All control plane services have restarted successfully!

Restarted one of the applications in kubernetes to see if the changes worked successfully. And it did!
Redeployed all applications that were using Dapr via our normal Github Actions pipelines.

There was no downtime and the process was quite smooth. Dapr does not renew certificates automatically so depending on your setup you will need to renew them manually or create an intermediary service that does it for you.

Next steps

Lesson learned no 1: Have an overview of your Dapr services

I had no overview of the Dapr system which caused me a lot of time in trying to get to the root cause. So first thing I did was to create a nice dashboard where we can have an overview of our Dapr services and their certificates. I started from the official one from Grafana for this. But the dashboard is a bit outdated so I had some issues with the queries, so I did some changes and you can find the JSON of the dashboard below if it helps anyone.

For the full Grafana Dashboard JSON. Go to Next steps > Lesson learned no 1: Have an overview of your Dapr services and click on Expand report.

I added some variables for the Prometheus datasource name and the cluster name. You can change the refresh rate of the dashboard and the time range.

And the output creates a dashboard that something like below:

Each section of the dashboard has a very nice info box that will tell you what the section shows and how to interpret the data.

Ideally I think a good practice is not to overload yourself by creating tons of dashboards that you will not look at, maintain or forget about.

In this case, it's quite useful to have one because in the event of an incident or a problem, this will save you hours of troubleshooting and will give you a good overview of the system and what is failing. If you look at the outputs of the board itself you'll see that it logs: CSR Failures, Server TLS certificate issuance failures, etc.

Lesson Learned no 2: Make sure you are aware before expiration

Beginning 30 days prior to mTLS root certificate expiration the Dapr sentry service will emit hourly warning level logs indicating that the root certificate is about to expire. You can use these logs to set up alerts to notify you when the certificate is about to expire.

"Dapr root certificate expiration warning: certificate expires in 2 days and 15 hours"

First thing is configure a Loki data source.

I already had this done and setting it up might be the subject of another blog post. But in a nutshell, Loki is a log aggregation system that integrates with Grafana which allows you to ingest and query log data. So I just made sure I had a Loki data source configured correctly.

Next, I created a create a log query.

In the Explore view of Grafana, selecting the Loki data source, I wrote a log query that retrieves the logs I want to use for the alert. The query you build might differ but it should match the logs produced by the kubectl logs command for dapr-sentry.

For example:

{cluster="$cluster", namespace="dapr-system"} |= `Dapr root certificate expiration warning: certificate expires in`

Adjust the query based on the specific log lines or patterns you want to target. I wanted to get all the logs that had the warning message about the certificate expiration starting from the 30days mark. But you can just edit the query to log you x days before the expiration.

A good rule of thumb is to test the log query.

After executing the query, you should see the warnings in the log entries.

Tip:

If no logs are returned, check that the query is correct, that the data source is set up correctly, and that the logs are being ingested by Loki.

Since all was good in my case, I proceeded to add this query from the Explore page to my previously created dashboard so I can see the logs in the dashboard itself as well. So I created a new panel with the logs and a nice description of what the logs mean for anyone reading it.

And lastly, I create an alert rule.

In the Alerting section in Grafana I went to "Create Rule" to define an alert rule. I configured the alert based on the previous query and I defined the conditions that trigger the alert based on the log query results. For example, you can set a condition like "Count() is above 0" to trigger the alert when there is at least one log entry matching the query. Or you can customize it based on your needs.

Here the implementation of the alert might differ based on what tooling you use, which channel you want to be alerted on (slack, email etc).

Hope this gave an insight into how you can troubleshoot and monitor Dapr in your environments.

Thank you for reading! And let me know if you have any questions or feedback.

Troubleshooting and Resolving a Pod Stuck in 'CreateContainerConfigError' in Kubernetes

Ana Cozma — Mon, 30 Jan 2023 21:58:25 +0000

The other day I was making changes to my helm charts and, after deploying my application, I noticed that one of my pods was stuck in a CreateContainerConfigError state. This is a pretty tricky error because it doesn't give you any details on what the underlying issue could be.

What is the CreateContainerConfigError?

To understand this, let's look at what happens at deployment time to give you an idea of the flow and what could go wrong at each step.

When you deploy a pod, the first step is to pull the image from the registry and then create the container. If the image is not found, then Kubernetes will return an ErrImagePull error. If the image is found, then it will proceed to create the container.

If the container creation fails, then it will return a CreateContainerError error. If the container creation succeeds, then Kubernetes will then start the container.

If the container start fails, then it will return a CreateContainerConfigError error.

In other words, the error happens when the container is transitioning from a Pending state to a Running state. It is at this point that the deployment configuration will be validated to make sure that the container can be started. If the configuration is invalid, then it will return a CreateContainerConfigError error.

How to Troubleshoot the CreateContainerConfigError

Disclaimer: There can be many reasons why the container configuration is invalid and it will depende on your specific configuration. I will only be covering the one that I have encountered. If you have encountered a different cause, please leave a comment below.

Because the error happens during the validation of the configuration, a good starting point is to double-check the following:

is the ConfigMap missing? Is it properly configured?
is a Secret missing? Is it properly configured?
is the PersistentVolume missing? Is it properly configured?
is the Pod being created correctly? Are there any empty or invalid fields?

Now that we understand what the error is, and what we should be looking at, let's look at how to troubleshoot it and narrow down the problem.

Check the Pod Status

The first thing I did was get the pod that I had the error and that I wanted to drill into.

You can do this by running kubectl get pods -n <namespace>.

~ kubectl get pods -n my-service                                                                   
NAME                                READY   STATUS                       RESTARTS       AGE
my-service-00000000078c9fff-dssbk   0/2     CreateContainerConfigError   1 (10s ago)    28s
my-service-00000000bcddf7d-xfsmk    2/2     Running                      25 (42h ago)   16d

Check the Events

Next, we are interested to see all the events on the pod.

You can do this by running kubectl describe pod <pod-name> -n <namespace> and look at the bottom at the Events. This will give you a lot of information about the pod, including the events that have happened to it similar to the following, which has been redacted to remove sensitive information.

~ kubectl describe pod my-service-00000000078c9fff-dssbk -n my-service 

Name:             my-service-00000000078c9fff-dssbk
Namespace:        my-service
Priority:         0
Service Account:  default
Node:             <node-details>
Start Time:       Wed, 25 Jan 2023 15:36:14 +0100
Labels:           app.kubernetes.io/instance=my-service
                  app.kubernetes.io/name=my-service
                  pod-template-hash=00000000
Annotations:      <annotations>
Status:           Pending
IP:               
IPs:
  IP:           
Controlled By:  ReplicaSet/
Containers:
  my-service:
    Container ID:
    Image:          <image-name>
    Image ID:
    Port:           80/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       CreateContainerConfigError
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     100m
      memory:  128Mi
    Requests:
      cpu:      100m
      memory:   128Mi
    Liveness:   http-get http://:http/ delay=15s timeout=60s period=60s #success=1 #failure=3
    Readiness:  http-get http://:http/ delay=15s timeout=60s period=60s #success=1 #failure=3
    Environment:
    (...)
      AzureWebJobsStorage:                                                  <set to the key 'AzureWebJobsStorage' in secret 'my-service'>                                     Optional: false
      AzureAccessKey:                                                       <set to the key 'AzureAccessKey' in secret 'my-service'>                                          Optional: false
      AzureTopicEndpoint:                                                   <set to the key 'AzureTopicEndpoint' in secret 'my-service'>                                      Optional: false
      ClientId:                                                             <set to the key 'ClientId' in secret 'my-service'>                                                Optional: false
    (...)
    State:          Waiting
    (...)
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  94s                default-scheduler  Successfully assigned my-service/my-service-00000000078c9fff-dssbk to <node-name>
  Normal   Pulled     94s                kubelet            Successfully pulled image "image" in 165.014261ms
  Warning  Failed     77s (x4 over 94s)  kubelet            Error: couldn't find key ClientId in Secret my-service/my-service
(...)

The Events section shows a list of all the events that have occurred in the process of creating the pod.

And here we find the issue. The pod is actually missing a secret, the ClientId in my case, that it needs to start. And that is why the pod is in:

    State:          Waiting
      Reason:       CreateContainerConfigError

If you want to double-check that the secret is missing, you can run kubectl get secrets -n <namespace> and check if the secret is not there.

Or you can output it in a JSON format and check that the key is missing by running the following command:

kubectl get secret my-service -n my-service -o json | jq '.data | map_values(@base64d)'

How to Resolve the CreateContainerConfigError

In my case, because I store my infrastructure and configuration (including the kubernetes secrets) in Terraform, I just needed to add the secret to the Terraform configuration, apply it and because the deployment had already timed out, re-run the deployment. But it would've picked it up automatically if I had applied it a bit sooner.

Now that the pod has its necessary configuration and is valid if we run kubectl get pods -n <namespace> again, we can see that the pod is now in a Running state.

kubectl get pods -n my-service                                                                   
NAME                                READY   STATUS                       RESTARTS       AGE
my-service-00000000078c9fff-dssbk   2/2     Running                       1 (10s ago)    28s
my-service-00000000bcddf7d-xfsmk    2/2     Terminating                  25 (42h ago)   16d

And there you have it. You have successfully resolved the CreateContainerConfigError.

This was an easy one, let me know what you encountered in the comments below and how you fixed it.

Happy Coding and I hope this helps someone!

Book Review: Observability Engineering: Achieving Production Excellence

Ana Cozma — Wed, 25 Jan 2023 12:50:46 +0000

For the first book review in what I will call my coffee reads section of the blog I will be reviewing the book Observability Engineering: Achieving Production Excellence.

Observability Engineering: Achieving Production Excellence

The book, in its own description, sets out to be an advocate for the adoption of observability practices in the software industry. Written by Charity Majors, Liz Fong-Jones, and George Miranda from Honeycomb.io, the book aims to be a resource for anyone interested in learning more about what is good observability, how you can build on top of your system today, and how to implement it in your organization.

The book, consisting of around 400 pages, is split into 3 main parts: the first part is an introduction to observability, the second part is a deep dive into the different observability tools and practices, and the third part is a guide on how to implement observability in your organization.

Thoughts on the book

I bought this book for Kindle and went into it with basic knowledge about observability and having used the Honeycomb product a bit for work.

My expectations from the title and the summary of the book were to learn about the different tools and practices in the realm of observability in addition to getting a better understanding of the theory behind observability and how it can be implemented in an organization.

Highlights

The introduction to observability section was a very high-level overview of what observability is and how it differs from monitoring.

I appreciated the change in mentality the book triggered in me around the concept of observability versus the concept of monitoring. The book did a very good job of explaining how observability has a more holistic approach than monitoring. I could relate to the examples given in the book about how monitoring is a reactive approach to problems and how observability is a proactive approach to problems and all the frustrations that came from relying solely on monitoring such as excessive alerting, false positives, and the lack of context, excessive use of dashboards, and the lack of understanding of the system.

The book did not focus solely on the tools that Honeycomb uses and tried to offer an overview of what the market offers for this goal it did a good job at explaining the different tools and practices in terms of observability. The section on tracing explains how tracing works and how it can be used to understand the flow of a request through a system in great detail with examples and code snippets. I found it easy to follow and because of this, I understood what benefits it brings and how it can be used. The section on metrics was also very useful in explaining how you can understand the health of a system by using metrics and how you can also make use of them to detect any system anomalies.

The main key takeaway from this chapter was that by using observability concepts and tooling correctly you don't need to rely on software engineers to understand the system, you can use the data, that is available for everyone, to understand the system and make decisions based on the data.

The book also emphasizes the organizational and cultural changes that need to be made to implement observability in an organization. As an example, the book explains how the concept of blameless postmortems is a good way to encourage a culture of learning and how it can be used to improve the system. How if you rely on data then all the software engineers with a hero complex will have to adapt. I had a lot of respect for the authors for being honest about these points and for highlighting the day-to-day realities of implementing observability in an organization.

The idea of adding case studies of companies that have implemented observability in their organization was very nice and I enjoyed reading about how different companies went about adopting observability practices and the challenges they needed to address.

Areas for Improvement

Onto the things that I think could be revisited in the next versions of the book.

There is a bit of repetition in the book, especially in the first part, where the same concepts are explained in different ways and the same idea, the difference between observability and monitoring, is reiterated several times. Definitely, this could be reduced in the next version of the book.

The book also does not go into implementation details on how observability can actually be adopted. It reiterates the challenges but is a bit succinct on how to address them. And the same for implementing observability practices in your system.

Some sections had a lot of detailed code snippets which were hard to read. I skimmed over them to get the gist while in other sections I would have liked to see more code snippets to help me understand the concepts better.

The case study section was too high level for my liking. I would've loved to read more about the challenges they faced and how they addressed them, rather than mention team collaboration as being the success metric in adopting observability. Go into detail about the tools they used and how they used them. What their lessons learned were.

So because of this, the book felt a bit unbalanced at times in my opinion.

Summary

To summarize, the book, I would say, had a lot of good key takeaways to set you on the path to adopting observability practices either in your day-to-day work or team level or maybe even organizational level, but it did fell short in some areas. Nevertheless, it was a good read and I would recommend it to anyone interested in learning more about observability.

If you've read the book let me know what you thought about it in the comments below. Or if you have any recommendations for other books on the topic of observability I would love to hear them.

I will also keep reviewing tech books on my blog as well as part of a Coffee Reads series.

Enjoy your coffee!

Kube-bench and Popeye: A Power Duo for AKS Security Compliance

Ana Cozma — Mon, 23 Jan 2023 18:22:01 +0000

In today's world, security is a top priority for any organization or at least it should be. With the rise of cloud computing, the number of security threats has increased exponentially.

So how do we keep up? Where do we start?

Microsoft has created a set of security benchmarks to give users a starting point for setting up their security configurations. The Microsoft cloud security benchmark (MCSB) is the successor of Azure Security Benchmark (ASB), which was rebranded in October 2022 (Currently in public preview).

In this post, I would like to go over the Azure security baseline for Azure Kubernetes Service and give a shoutout to two tools that can aid you in the process of establishing your compliance with the baseline.

Azure Security Baseline for AKS

The Azure Security Baseline for Azure Kubernetes Service (AKS) is a set of recommendations for securing your AKS cluster.

It is an exhaustive list of various aspects of AKS security and it also provides the corresponding actions to be taken in each case. From the documentation's overview:

You can monitor this security baseline and its recommendations using Microsoft Defender for Cloud. Azure Policy definitions will be listed in the Regulatory Compliance section of the Microsoft Defender for Cloud dashboard.

When a section has relevant Azure Policy Definitions, they are listed in this baseline to help you measure compliance to the Azure Security Benchmark controls and recommendations. Some recommendations may require a paid Microsoft Defender plan to enable certain security scenarios.

It is based on the CIS Kubernetes Benchmark and the Azure Security Benchmark v1.0.

CIS Benchmarks are best practices for the secure configuration of a target system. Available for more than 100 CIS Benchmarks across 25+ vendor product families, CIS Benchmarks are developed through a unique consensus-based process comprised of cybersecurity professionals and subject matter experts around the world. CIS Benchmarks are the only consensus-based, best-practice security configuration guides both developed and accepted by government, business, industry, and academia.

For more information on CIS Benchmark please check CIS Benchmark FAQ.

For more information on the CIS Benchmark for Kubernetes please check the kubernetes benchmark.

In the CIS Benchmark for Kubernetes document, there are instructions for both Master nodes and Worker nodes. But when using AKS we don't have access to the master nodes. In this case, we can make use of the CIS Benchmark document for AKS.

What could we use to help us check our AKS setup against this benchmark?

We can start by looking at the Azure Portal and Microsoft Defender for Cloud, checking out CIS compliance with Kube-bench and any configuration mismatches with Popeye. I will go into more detail on the last two tools. But first, let's see what Microsoft Defender for Cloud looks like and what can you get from it.

Microsoft Defender for Cloud

As suggested by Microsoft, we can start with Microsoft Defender for Cloud.
If you go to Azure Portal and search for Microsoft Defender for Cloud, then filter by "Assessed Resources", and select your cluster you will see a list of all the cluster details and Recommendations and the Alerts tab as well.

Let's take the first recommendation as an example:
Azure Kubernetes Service clusters should have Defender profile enabled

If you click on it and expand it will give you the following information:

You can choose to Exempt it, meaning you have either fixed this issue or you don't want to fix it or Enforce it, meaning you want to enforce this setting by adding it to an Azure Policy definition.

There is also a nice description of the issue and suggested remediation steps to take.

Kube-bench

The official repository can be found here with detailed installation instructions.

kube-bench is a tool that checks whether Kubernetes is deployed securely by running the checks documented in the CIS Kubernetes Benchmark.

There are multiple ways of running this tool that you can check here.

Setting it up

To test out this tool, I decided to just apply it to my local cluster so the first thing I did was start my minikube instance and then I ran the following command:

> minikube start
😄  minikube v1.22.0 on Darwin 12.6.2
✨  Using the hyperkit driver based on existing profile
👍  Starting control plane node minikube in cluster minikube
🏃  Updating the running hyperkit "minikube" VM ...
🎉  minikube 1.28.0 is available! Download it: https://github.com/kubernetes/minikube/releases/tag/v1.28.0
💡  To disable this notice, run: 'minikube config set WantUpdateNotification false'

🐳  Preparing Kubernetes v1.21.2 on Docker 20.10.6 ...
🔎  Verifying Kubernetes components...
    ▪ Using image gcr.io/k8s-minikube/storage-provisioner:v5
🌟  Enabled addons: storage-provisioner, default-storageclass

❗  /usr/local/bin/kubectl is version 1.25.2, which may have incompatibilites with Kubernetes 1.21.2.
    ▪ Want kubectl v1.21.2? Try 'minikube kubectl -- get pods -A'
🏄  Done! kubectl is now configured to use "minikube" cluster and "default" namespace by default

# Download the job.yaml file
> curl https://raw.githubusercontent.com/aquasecurity/kube-bench/main/job.yaml > job.yaml

> kubectl apply -f job.yaml
job.batch/kube-bench created

> kubectl get pods -A                                                                                                                                ✔  at minikube ⎈ 
NAMESPACE       NAME                                        READY   STATUS              RESTARTS   AGE
default         kube-bench-t2fgh                            0/1     ContainerCreating   0          5s

> kubectl get pods -A                                                                                                                                ✔  at minikube ⎈
NAMESPACE       NAME                                        READY   STATUS      RESTARTS   AGE
default         kube-bench-t2fgh                            0/1     Completed   0          32s

You can run Kube-bench inside a pod, but it will need access to the host's PID namespace to check the running processes, as well as access to some directories on the host where config files and other files are stored.

The supplied job.yaml file can be applied to run the tests as a job. This was enough for me to run locally to get a feel of what the tool does and how it generates the report.

Next, after having run the tests, I wanted to get the report. The results of the tests can be found in the logs of the pod which you can get by running:

> kubectl logs kube-bench-t2fgh

Kube-bench generates a report that looks like the following:

[INFO] 1 Master Node Security Configuration
[INFO] 1.1 Master Node Configuration Files
[PASS] 1.1.1 Ensure that the API server pod specification file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.2 Ensure that the API server pod specification file ownership is set to root:root (Automated)
[PASS] 1.1.3 Ensure that the controller manager pod specification file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.4 Ensure that the controller manager pod specification file ownership is set to root:root (Automated)
[PASS] 1.1.5 Ensure that the scheduler pod specification file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.6 Ensure that the scheduler pod specification file ownership is set to root:root (Automated)
[PASS] 1.1.7 Ensure that the etcd pod specification file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.8 Ensure that the etcd pod specification file ownership is set to root:root (Automated)
[WARN] 1.1.9 Ensure that the Container Network Interface file permissions are set to 644 or more restrictive (Manual)
[WARN] 1.1.10 Ensure that the Container Network Interface file ownership is set to root:root (Manual)
[FAIL] 1.1.11 Ensure that the etcd data directory permissions are set to 700 or more restrictive (Automated)
[FAIL] 1.1.12 Ensure that the etcd data directory ownership is set to etcd:etcd (Automated)
[PASS] 1.1.13 Ensure that the admin.conf file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.14 Ensure that the admin.conf file ownership is set to root:root (Automated)
[PASS] 1.1.15 Ensure that the scheduler.conf file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.16 Ensure that the scheduler.conf file ownership is set to root:root (Automated)
[PASS] 1.1.17 Ensure that the controller-manager.conf file permissions are set to 644 or more restrictive (Automated)
[PASS] 1.1.18 Ensure that the controller-manager.conf file ownership is set to root:root (Automated)
[FAIL] 1.1.19 Ensure that the Kubernetes PKI directory and file ownership is set to root:root (Automated)
[WARN] 1.1.20 Ensure that the Kubernetes PKI certificate file permissions are set to 644 or more restrictive (Manual)
[WARN] 1.1.21 Ensure that the Kubernetes PKI key file permissions are set to 600 (Manual)
[INFO] 1.2 API Server
[WARN] 1.2.1 Ensure that the --anonymous-auth argument is set to false (Manual)
[PASS] 1.2.2 Ensure that the --token-auth-file parameter is not set (Automated)
[PASS] 1.2.3 Ensure that the --kubelet-https argument is set to true (Automated)
[PASS] 1.2.4 Ensure that the --kubelet-client-certificate and --kubelet-client-key arguments are set as appropriate (Automated)
[FAIL] 1.2.5 Ensure that the --kubelet-certificate-authority argument is set as appropriate (Automated)
[PASS] 1.2.6 Ensure that the --authorization-mode argument is not set to AlwaysAllow (Automated)
[PASS] 1.2.7 Ensure that the --authorization-mode argument includes Node (Automated)
[PASS] 1.2.8 Ensure that the --authorization-mode argument includes RBAC (Automated)
[WARN] 1.2.9 Ensure that the admission control plugin EventRateLimit is set (Manual)
[PASS] 1.2.10 Ensure that the admission control plugin AlwaysAdmit is not set (Automated)
[WARN] 1.2.11 Ensure that the admission control plugin AlwaysPullImages is set (Manual)
[WARN] 1.2.12 Ensure that the admission control plugin SecurityContextDeny is set if PodSecurityPolicy is not used (Manual)
[PASS] 1.2.13 Ensure that the admission control plugin ServiceAccount is set (Automated)
[PASS] 1.2.14 Ensure that the admission control plugin NamespaceLifecycle is set (Automated)
[FAIL] 1.2.15 Ensure that the admission control plugin PodSecurityPolicy is set (Automated)
[PASS] 1.2.16 Ensure that the admission control plugin NodeRestriction is set (Automated)
[PASS] 1.2.17 Ensure that the --insecure-bind-address argument is not set (Automated)
[PASS] 1.2.18 Ensure that the --insecure-port argument is set to 0 (Automated)
[PASS] 1.2.19 Ensure that the --secure-port argument is not set to 0 (Automated)
[FAIL] 1.2.20 Ensure that the --profiling argument is set to false (Automated)
[FAIL] 1.2.21 Ensure that the --audit-log-path argument is set (Automated)
[FAIL] 1.2.22 Ensure that the --audit-log-maxage argument is set to 30 or as appropriate (Automated)
[FAIL] 1.2.23 Ensure that the --audit-log-maxbackup argument is set to 10 or as appropriate (Automated)
[FAIL] 1.2.24 Ensure that the --audit-log-maxsize argument is set to 100 or as appropriate (Automated)
[WARN] 1.2.25 Ensure that the --request-timeout argument is set as appropriate (Manual)
[PASS] 1.2.26 Ensure that the --service-account-lookup argument is set to true (Automated)
[PASS] 1.2.27 Ensure that the --service-account-key-file argument is set as appropriate (Automated)
[PASS] 1.2.28 Ensure that the --etcd-certfile and --etcd-keyfile arguments are set as appropriate (Automated)
[PASS] 1.2.29 Ensure that the --tls-cert-file and --tls-private-key-file arguments are set as appropriate (Automated)
[PASS] 1.2.30 Ensure that the --client-ca-file argument is set as appropriate (Automated)
[PASS] 1.2.31 Ensure that the --etcd-cafile argument is set as appropriate (Automated)
[WARN] 1.2.32 Ensure that the --encryption-provider-config argument is set as appropriate (Manual)
[WARN] 1.2.33 Ensure that encryption providers are appropriately configured (Manual)
[WARN] 1.2.34 Ensure that the API Server only makes use of Strong Cryptographic Ciphers (Manual)
[INFO] 1.3 Controller Manager
[WARN] 1.3.1 Ensure that the --terminated-pod-gc-threshold argument is set as appropriate (Manual)
[FAIL] 1.3.2 Ensure that the --profiling argument is set to false (Automated)
[PASS] 1.3.3 Ensure that the --use-service-account-credentials argument is set to true (Automated)
[PASS] 1.3.4 Ensure that the --service-account-private-key-file argument is set as appropriate (Automated)
[PASS] 1.3.5 Ensure that the --root-ca-file argument is set as appropriate (Automated)
[PASS] 1.3.6 Ensure that the RotateKubeletServerCertificate argument is set to true (Automated)
[PASS] 1.3.7 Ensure that the --bind-address argument is set to 127.0.0.1 (Automated)
[INFO] 1.4 Scheduler
[FAIL] 1.4.1 Ensure that the --profiling argument is set to false (Automated)
[PASS] 1.4.2 Ensure that the --bind-address argument is set to 127.0.0.1 (Automated)

== Remediations master ==
1.1.9 Run the below command (based on the file location on your system) on the master node.
For example,
chmod 644 <path/to/cni/files>

1.1.10 Run the below command (based on the file location on your system) on the master node.
For example,
chown root:root <path/to/cni/files>

1.1.11 On the etcd server node, get the etcd data directory, passed as an argument --data-dir,
from the below command:
ps -ef | grep etcd
Run the below command (based on the etcd data directory found above). For example,
chmod 700 /var/lib/etcd

1.1.12 On the etcd server node, get the etcd data directory, passed as an argument --data-dir,
from the below command:
ps -ef | grep etcd
Run the below command (based on the etcd data directory found above).
For example, chown etcd:etcd /var/lib/etcd

1.1.19 Run the below command (based on the file location on your system) on the master node.
For example,
chown -R root:root /etc/kubernetes/pki/

1.2.5 Follow the Kubernetes documentation and setup the TLS connection between
the apiserver and kubelets. Then, edit the API server pod specification file
/etc/kubernetes/manifests/kube-apiserver.yaml on the master node and set the
--kubelet-certificate-authority parameter to the path to the cert file for the certificate authority.
--kubelet-certificate-authority=<ca-string>

1.2.9 Follow the Kubernetes documentation and set the desired limits in a configuration file.
Then, edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
and set the below parameters.
--enable-admission-plugins=...,EventRateLimit,...
--admission-control-config-file=<path/to/configuration/file>

1.2.11 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --enable-admission-plugins parameter to include
AlwaysPullImages.
--enable-admission-plugins=...,AlwaysPullImages,...

1.2.12 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --enable-admission-plugins parameter to include
SecurityContextDeny, unless PodSecurityPolicy is already in place.
--enable-admission-plugins=...,SecurityContextDeny,...

1.2.15 Follow the documentation and create Pod Security Policy objects as per your environment.
Then, edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --enable-admission-plugins parameter to a
value that includes PodSecurityPolicy:
--enable-admission-plugins=...,PodSecurityPolicy,...
Then restart the API Server.

1.2.20 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the below parameter.
--profiling=false

1.2.21 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --audit-log-path parameter to a suitable path and
file where you would like audit logs to be written, for example:
--audit-log-path=/var/log/apiserver/audit.log

1.2.22 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --audit-log-maxage parameter to 30 or as an appropriate number of days:
--audit-log-maxage=30

1.2.23 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --audit-log-maxbackup parameter to 10 or to an appropriate
value.
--audit-log-maxbackup=10

1.2.24 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --audit-log-maxsize parameter to an appropriate size in MB.
For example, to set it as 100 MB:
--audit-log-maxsize=100

1.2.25 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
and set the below parameter as appropriate and if needed.
For example,
--request-timeout=300s

1.2.32 Follow the Kubernetes documentation and configure a EncryptionConfig file.
Then, edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the --encryption-provider-config parameter to the path of that file: --encryption-provider-config=</path/to/EncryptionConfig/File>

1.2.33 Follow the Kubernetes documentation and configure a EncryptionConfig file.
In this file, choose aescbc, kms or secretbox as the encryption provider.

1.2.34 Edit the API server pod specification file /etc/kubernetes/manifests/kube-apiserver.yaml
on the master node and set the below parameter.
--tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM
_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM
_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM
_SHA384

1.3.1 Edit the Controller Manager pod specification file /etc/kubernetes/manifests/kube-controller-manager.yaml
on the master node and set the --terminated-pod-gc-threshold to an appropriate threshold,
for example:
--terminated-pod-gc-threshold=10

1.3.2 Edit the Controller Manager pod specification file /etc/kubernetes/manifests/kube-controller-manager.yaml
on the master node and set the below parameter.
--profiling=false

1.4.1 Edit the Scheduler pod specification file /etc/kubernetes/manifests/kube-scheduler.yaml file
on the master node and set the below parameter.
--profiling=false


== Summary master ==
39 checks PASS
12 checks FAIL
13 checks WARN
0 checks INFO

[INFO] 2 Etcd Node Configuration
[INFO] 2 Etcd Node Configuration Files
[PASS] 2.1 Ensure that the --cert-file and --key-file arguments are set as appropriate (Automated)
[PASS] 2.2 Ensure that the --client-cert-auth argument is set to true (Automated)
[PASS] 2.3 Ensure that the --auto-tls argument is not set to true (Automated)
[PASS] 2.4 Ensure that the --peer-cert-file and --peer-key-file arguments are set as appropriate (Automated)
[PASS] 2.5 Ensure that the --peer-client-cert-auth argument is set to true (Automated)
[PASS] 2.6 Ensure that the --peer-auto-tls argument is not set to true (Automated)
[PASS] 2.7 Ensure that a unique Certificate Authority is used for etcd (Manual)

== Summary etcd ==
7 checks PASS
0 checks FAIL
0 checks WARN
0 checks INFO

[INFO] 3 Control Plane Configuration
[INFO] 3.1 Authentication and Authorization
[WARN] 3.1.1 Client certificate authentication should not be used for users (Manual)
[INFO] 3.2 Logging
[WARN] 3.2.1 Ensure that a minimal audit policy is created (Manual)
[WARN] 3.2.2 Ensure that the audit policy covers key security concerns (Manual)

== Remediations controlplane ==
3.1.1 Alternative mechanisms provided by Kubernetes such as the use of OIDC should be
implemented in place of client certificates.

3.2.1 Create an audit policy file for your cluster.

3.2.2 Consider modification of the audit policy in use on the cluster to include these items, at a
minimum.


== Summary controlplane ==
0 checks PASS
0 checks FAIL
3 checks WARN
0 checks INFO

[INFO] 4 Worker Node Security Configuration
[INFO] 4.1 Worker Node Configuration Files
[PASS] 4.1.1 Ensure that the kubelet service file permissions are set to 644 or more restrictive (Automated)
[PASS] 4.1.2 Ensure that the kubelet service file ownership is set to root:root (Automated)
[PASS] 4.1.3 If proxy kubeconfig file exists ensure permissions are set to 644 or more restrictive (Manual)
[PASS] 4.1.4 If proxy kubeconfig file exists ensure ownership is set to root:root (Manual)
[PASS] 4.1.5 Ensure that the --kubeconfig kubelet.conf file permissions are set to 644 or more restrictive (Automated)
[PASS] 4.1.6 Ensure that the --kubeconfig kubelet.conf file ownership is set to root:root (Automated)
[WARN] 4.1.7 Ensure that the certificate authorities file permissions are set to 644 or more restrictive (Manual)
[WARN] 4.1.8 Ensure that the client certificate authorities file ownership is set to root:root (Manual)
[PASS] 4.1.9 Ensure that the kubelet --config configuration file has permissions set to 644 or more restrictive (Automated)
[PASS] 4.1.10 Ensure that the kubelet --config configuration file ownership is set to root:root (Automated)
[INFO] 4.2 Kubelet
[PASS] 4.2.1 Ensure that the anonymous-auth argument is set to false (Automated)
[PASS] 4.2.2 Ensure that the --authorization-mode argument is not set to AlwaysAllow (Automated)
[PASS] 4.2.3 Ensure that the --client-ca-file argument is set as appropriate (Automated)
[PASS] 4.2.4 Ensure that the --read-only-port argument is set to 0 (Manual)
[PASS] 4.2.5 Ensure that the --streaming-connection-idle-timeout argument is not set to 0 (Manual)
[FAIL] 4.2.6 Ensure that the --protect-kernel-defaults argument is set to true (Automated)
[PASS] 4.2.7 Ensure that the --make-iptables-util-chains argument is set to true (Automated)
[WARN] 4.2.8 Ensure that the --hostname-override argument is not set (Manual)
[WARN] 4.2.9 Ensure that the --event-qps argument is set to 0 or a level which ensures appropriate event capture (Manual)
[WARN] 4.2.10 Ensure that the --tls-cert-file and --tls-private-key-file arguments are set as appropriate (Manual)
[PASS] 4.2.11 Ensure that the --rotate-certificates argument is not set to false (Automated)
[PASS] 4.2.12 Verify that the RotateKubeletServerCertificate argument is set to true (Manual)
[WARN] 4.2.13 Ensure that the Kubelet only makes use of Strong Cryptographic Ciphers (Manual)

== Remediations node ==
4.1.7 Run the following command to modify the file permissions of the
--client-ca-file chmod 644 <filename>

4.1.8 Run the following command to modify the ownership of the --client-ca-file.
chown root:root <filename>

4.2.6 If using a Kubelet config file, edit the file to set protectKernelDefaults: true.
If using command line arguments, edit the kubelet service file
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf on each worker node and
set the below parameter in KUBELET_SYSTEM_PODS_ARGS variable.
--protect-kernel-defaults=true
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service

4.2.8 Edit the kubelet service file /etc/systemd/system/kubelet.service.d/10-kubeadm.conf
on each worker node and remove the --hostname-override argument from the
KUBELET_SYSTEM_PODS_ARGS variable.
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service

4.2.9 If using a Kubelet config file, edit the file to set eventRecordQPS: to an appropriate level.
If using command line arguments, edit the kubelet service file
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf on each worker node and
set the below parameter in KUBELET_SYSTEM_PODS_ARGS variable.
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service

4.2.10 If using a Kubelet config file, edit the file to set tlsCertFile to the location
of the certificate file to use to identify this Kubelet, and tlsPrivateKeyFile
to the location of the corresponding private key file.
If using command line arguments, edit the kubelet service file
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf on each worker node and
set the below parameters in KUBELET_CERTIFICATE_ARGS variable.
--tls-cert-file=<path/to/tls-certificate-file>
--tls-private-key-file=<path/to/tls-key-file>
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service

4.2.13 If using a Kubelet config file, edit the file to set TLSCipherSuites: to
TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
or to a subset of these values.
If using executable arguments, edit the kubelet service file
/etc/systemd/system/kubelet.service.d/10-kubeadm.conf on each worker node and
set the --tls-cipher-suites parameter as follows, or to a subset of these values.
--tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256
Based on your system, restart the kubelet service. For example:
systemctl daemon-reload
systemctl restart kubelet.service


== Summary node ==
16 checks PASS
1 checks FAIL
6 checks WARN
0 checks INFO

[INFO] 5 Kubernetes Policies
[INFO] 5.1 RBAC and Service Accounts
[WARN] 5.1.1 Ensure that the cluster-admin role is only used where required (Manual)
[WARN] 5.1.2 Minimize access to secrets (Manual)
[WARN] 5.1.3 Minimize wildcard use in Roles and ClusterRoles (Manual)
[WARN] 5.1.4 Minimize access to create pods (Manual)
[WARN] 5.1.5 Ensure that default service accounts are not actively used. (Manual)
[WARN] 5.1.6 Ensure that Service Account Tokens are only mounted where necessary (Manual)
[WARN] 5.1.7 Avoid use of system:masters group (Manual)
[WARN] 5.1.8 Limit use of the Bind, Impersonate and Escalate permissions in the Kubernetes cluster (Manual)
[INFO] 5.2 Pod Security Policies
[WARN] 5.2.1 Minimize the admission of privileged containers (Automated)
[WARN] 5.2.2 Minimize the admission of containers wishing to share the host process ID namespace (Automated)
[WARN] 5.2.3 Minimize the admission of containers wishing to share the host IPC namespace (Automated)
[WARN] 5.2.4 Minimize the admission of containers wishing to share the host network namespace (Automated)
[WARN] 5.2.5 Minimize the admission of containers with allowPrivilegeEscalation (Automated)
[WARN] 5.2.6 Minimize the admission of root containers (Automated)
[WARN] 5.2.7 Minimize the admission of containers with the NET_RAW capability (Automated)
[WARN] 5.2.8 Minimize the admission of containers with added capabilities (Automated)
[WARN] 5.2.9 Minimize the admission of containers with capabilities assigned (Manual)
[INFO] 5.3 Network Policies and CNI
[WARN] 5.3.1 Ensure that the CNI in use supports Network Policies (Manual)
[WARN] 5.3.2 Ensure that all Namespaces have Network Policies defined (Manual)
[INFO] 5.4 Secrets Management
[WARN] 5.4.1 Prefer using secrets as files over secrets as environment variables (Manual)
[WARN] 5.4.2 Consider external secret storage (Manual)
[INFO] 5.5 Extensible Admission Control
[WARN] 5.5.1 Configure Image Provenance using ImagePolicyWebhook admission controller (Manual)
[INFO] 5.7 General Policies
[WARN] 5.7.1 Create administrative boundaries between resources using namespaces (Manual)
[WARN] 5.7.2 Ensure that the seccomp profile is set to docker/default in your pod definitions (Manual)
[WARN] 5.7.3 Apply Security Context to Your Pods and Containers (Manual)
[WARN] 5.7.4 The default namespace should not be used (Manual)

== Remediations policies ==
5.1.1 Identify all clusterrolebindings to the cluster-admin role. Check if they are used and
if they need this role or if they could use a role with fewer privileges.
Where possible, first bind users to a lower privileged role and then remove the
clusterrolebinding to the cluster-admin role :
kubectl delete clusterrolebinding [name]

5.1.2 Where possible, remove get, list and watch access to secret objects in the cluster.

5.1.3 Where possible replace any use of wildcards in clusterroles and roles with specific
objects or actions.

5.1.4 Where possible, remove create access to pod objects in the cluster.

5.1.5 Create explicit service accounts wherever a Kubernetes workload requires specific access
to the Kubernetes API server.
Modify the configuration of each default service account to include this value
automountServiceAccountToken: false

5.1.6 Modify the definition of pods and service accounts which do not need to mount service
account tokens to disable it.

5.1.7 Remove the system:masters group from all users in the cluster.

5.1.8 Where possible, remove the impersonate, bind and escalate rights from subjects.

5.2.1 Create a PSP as described in the Kubernetes documentation, ensuring that
the .spec.privileged field is omitted or set to false.

5.2.2 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.hostPID field is omitted or set to false.

5.2.3 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.hostIPC field is omitted or set to false.

5.2.4 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.hostNetwork field is omitted or set to false.

5.2.5 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.allowPrivilegeEscalation field is omitted or set to false.

5.2.6 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.runAsUser.rule is set to either MustRunAsNonRoot or MustRunAs with the range of
UIDs not including 0.

5.2.7 Create a PSP as described in the Kubernetes documentation, ensuring that the
.spec.requiredDropCapabilities is set to include either NET_RAW or ALL.

5.2.8 Ensure that allowedCapabilities is not present in PSPs for the cluster unless
it is set to an empty array.

5.2.9 Review the use of capabilites in applications running on your cluster. Where a namespace
contains applicaions which do not require any Linux capabities to operate consider adding
a PSP which forbids the admission of containers which do not drop all capabilities.

5.3.1 If the CNI plugin in use does not support network policies, consideration should be given to
making use of a different plugin, or finding an alternate mechanism for restricting traffic
in the Kubernetes cluster.

5.3.2 Follow the documentation and create NetworkPolicy objects as you need them.

5.4.1 if possible, rewrite application code to read secrets from mounted secret files, rather than
from environment variables.

5.4.2 Refer to the secrets management options offered by your cloud provider or a third-party
secrets management solution.

5.5.1 Follow the Kubernetes documentation and setup image provenance.

5.7.1 Follow the documentation and create namespaces for objects in your deployment as you need
them.

5.7.2 Use security context to enable the docker/default seccomp profile in your pod definitions.
An example is as below:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

5.7.3 Follow the Kubernetes documentation and apply security contexts to your pods. For a
suggested list of security contexts, you may refer to the CIS Security Benchmark for Docker
Containers.

5.7.4 Ensure that namespaces are created to allow for appropriate segregation of Kubernetes
resources and that all new resources are created in a specific namespace.


== Summary policies ==
0 checks PASS
0 checks FAIL
26 checks WARN
0 checks INFO

== Summary total ==
62 checks PASS
13 checks FAIL
48 checks WARN
0 checks INFO

This can also be run inside the AKS cluster by following the instructions here.
As a reminder: Kube-bench cannot be run on AKS master nodes. It can only be run on worker nodes, this is not a limitation of Kube-bench but of AKS as mentioned before.

The report breakdown

From the report above you can see that Kube-bench benchmarks 5 sections of your configurations which are the following:

Control Plane Components
Etcd
Control Plane Configurations
Worker Nodes
Policies

Each section starts by describing which section it targets, the lines having the INFO tag. For example:

[INFO] 5 Kubernetes Policies
[INFO] 5.1 RBAC and Service Accounts

Then it lists the checks that are performed for that section. Each check gets a PASS, FAIL or WARN status. For example:

[WARN] 5.1.1 Ensure that the cluster-admin role is only used where required (Manual)

And after the tests run, it also suggests remediations for the check that got a WARN/FAIL status. For example:

== Remediations policies ==
5.1.1 Identify all clusterrolebindings to the cluster-admin role. Check if they are used and
if they need this role or if they could use a role with fewer privileges.
Where possible, first bind users to a lower privileged role and then remove the
clusterrolebinding to the cluster-admin role :
kubectl delete clusterrolebinding [name]

And at the end you can find a summary of the section:

== Summary policies ==
0 checks PASS
0 checks FAIL
26 checks WARN
0 checks INFO

Potential use cases

By running it as a CronJon in your cluster, Kube-bench can help you identify potential security issues in your cluster. It is a great tool to have in your toolbox and it is very easy to use.
You can configure it to run on a schedule like every week or month and get a report on the security of your cluster, while also taking into account the specific CIS benchmark for your cloud provider. For example, you can set up and run the job-aks.yaml file to run the tests on an AKS cluster.

Popeye - A Kubernetes Cluster Sanitizer

The repository for the tool can be found here.

This is a read-only utility that scans live Kubernetes clusters and reports potential issues with deployed resources and configurations.
What I liked about this tool is that it is very easy to install and use and it achieves what it promises: it reduces the cognitive overload one faces when operating a Kubernetes cluster in the wild.

Setting it up

Popeye can be used standalone using the command line, using a spinach.yml file or running directly in the cluster as a CronJob.

In this post, I will be using the command line option on a mac. So to install it I just ran:

# Install popeye
> brew install derailed/popeye/popeye
# Check popeye version
> popeye version                                                                                                                   
 ___     ___ _____   _____                       K          .-'-.
| _ \___| _ \ __\ \ / / __|                       8     __|      `\
|  _/ _ \  _/ _| \ V /| _|                         s   `-,-`--._   `\
|_| \___/_| |___| |_| |___|                       []  .->'  a     `|-'
  Biffs`em and Buffs`em!                            `=/ (__/_       /
                                                      \_,    `    _)
                                                         `----;  |
Version:   0.10.1
Commit:    ae19897a4b5d3738a3e98179207759e45a53a64c
Date:      2022-06-28T14:46:13Z
Logs:      /var/folders/vp/l8dlq0gn3x71f3vk82shmzlm0000gn/T/popeye.log

# Connected to my AKS cluster
# Check the context I am using
> kubectl config current-context
# Run popeye
> popeye

The report breakdown

The report will be printed to the console and it will look something like the following snippet below. I have removed some of the output for brevity and to give you an idea of the output format and the types of checks that are performed and the results.

The report is nicely split into sections and each section has a summary of the checks performed and the results. It ends with giving a grade to the cluster.

The color coding is also very helpful to quickly identify the issues:

Level	Icon	Jurassic	Color	Description
Ok	✅	OK	Green	Happy!
Info	🔊	I	BlueGreen	FYI
Warn	😱	W	Yellow	Potential Issue
Error	💥	E	Red	Action required

> popeye                                                                                                                                      

 ___     ___ _____   _____                                                      K          .-'-.
| _ \___| _ \ __\ \ / / __|                                                      8     __|      `\
|  _/ _ \  _/ _| \ V /| _|                                                        s   `-,-`--._   `\
|_| \___/_| |___| |_| |___|                                                      []  .->'  a     `|-'
  Biffs`em and Buffs`em!                                                          `=/ (__/_       /
                                                                                    \_,    `    _)
                                                                                       `----;  |


GENERAL [AKS-STAGING]
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · Connectivity...................................................................................✅
  · MetricServer...................................................................................✅


CLUSTER (1 SCANNED)                                                          💥 0 😱 0 🔊 0 ✅ 1 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · Version........................................................................................✅
    ✅ [POP-406] K8s version OK.


CLUSTERROLES (13 SCANNED)                                                   💥 0 😱 0 🔊 0 ✅ 13 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · azure-policy-webhook-cluster-role..............................................................✅
  · dapr-operator-admin............................................................................✅
  · dashboard-reader...............................................................................✅
  · gatekeeper-manager-role........................................................................✅
  · grafana-agent..................................................................................✅
  · keda-operator..................................................................................✅
  · keda-operator-external-metrics-reader..........................................................✅
  · kong-kong......................................................................................✅
  · omsagent-reader................................................................................✅
  · policy-agent...................................................................................✅
  · system:coredns-autoscaler......................................................................✅
  · system:metrics-server..........................................................................✅
  · system:prometheus..............................................................................✅


CLUSTERROLEBINDINGS (19 SCANNED)                                             💥 0 😱 6 🔊 0 ✅ 13 68٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · azure-policy-webhook-cluster-rolebinding.......................................................✅
  · dapr-operator..................................................................................✅
  · dapr-role-tokenreview-binding..................................................................😱
    😱 [POP-1300] References a ClusterRole (system:auth-delegator) which does not exist.
  · dashboard-reader-global........................................................................✅
  · gatekeeper-manager-rolebinding.................................................................✅
  · grafana-agent..................................................................................✅
  · keda-operator..................................................................................✅
  · keda-operator-hpa-controller-external-metrics..................................................✅
  · keda-operator-system-auth-delegator............................................................😱
    😱 [POP-1300] References a ClusterRole (system:auth-delegator) which does not exist.
  · kong-kong......................................................................................✅
  · kubelet-api-admin..............................................................................😱
    😱 [POP-1300] References a ClusterRole (system:kubelet-api-admin) which does not exist.
  · metrics-server:system:auth-delegator...........................................................😱
    😱 [POP-1300] References a ClusterRole (system:auth-delegator) which does not exist.
  · omsagentclusterrolebinding.....................................................................✅
  · policy-agent...................................................................................✅
  · replicaset-controller..........................................................................😱
    😱 [POP-1300] References a ClusterRole (system:controller:replicaset-controller) which does not
       exist.
  · system:coredns-autoscaler......................................................................✅
  · system:discovery...............................................................................😱
    😱 [POP-1300] References a ClusterRole (system:discovery) which does not exist.
  · system:metrics-server..........................................................................✅
  · system:prometheus..............................................................................✅


CONFIGMAPS (43 SCANNED)                                                     💥 0 😱 0 🔊 37 ✅ 6 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/kube-root-ca.crt...................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · dapr-system/operator.dapr.io...................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · dapr-system/webhooks.dapr.io...................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · default/kube-root-ca.crt.......................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
(...)


DAEMONSETS (9 SCANNED)                                                        💥 0 😱 2 🔊 0 ✅ 7 77٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · kube-system/azure-ip-masq-agent................................................................✅
  · kube-system/cloud-node-manager.................................................................✅
  · kube-system/cloud-node-manager-windows.........................................................✅
  · kube-system/csi-azuredisk-node.................................................................✅
  · kube-system/csi-azuredisk-node-win.............................................................✅
  · kube-system/csi-azurefile-node.................................................................✅
  · kube-system/csi-azurefile-node-win.............................................................✅
  · kube-system/kube-proxy.........................................................................😱
    🐳 kube-proxy
      😱 [POP-107] No resource limits defined.
    🐳 kube-proxy-bootstrap
      😱 [POP-107] No resource limits defined.
  · prometheus-agent/grafana-agent.................................................................😱
    🐳 agent
      😱 [POP-106] No resources requests/limits defined.


DEPLOYMENTS (29 SCANNED)                                                     💥 0 😱 4 🔊 3 ✅ 22 86٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-dashboard.....................................................................🔊
    🐳 dapr-dashboard
      🔊 [POP-108] Unnamed port 8080.
  · dapr-system/dapr-operator......................................................................🔊
    🐳 dapr-operator
      🔊 [POP-108] Unnamed port 6500.
  · dapr-system/dapr-sentry........................................................................🔊
    🐳 dapr-sentry
      🔊 [POP-108] Unnamed port 50001.
  · dapr-system/dapr-sidecar-injector..............................................................✅
  · gatekeeper-system/gatekeeper-audit.............................................................✅
  · gatekeeper-system/gatekeeper-controller........................................................✅
  · keda-system/keda-operator......................................................................✅
  · keda-system/keda-operator-metrics-apiserver....................................................✅
  · kong/kong-kong.................................................................................😱
    🐳 ingress-controller
      😱 [POP-106] No resources requests/limits defined.
    🐳 proxy
      😱 [POP-106] No resources requests/limits defined.
  · kube-system/azure-policy.......................................................................✅
  · kube-system/azure-policy-webhook...............................................................✅
  · kube-system/coredns............................................................................✅
  · kube-system/coredns-autoscaler.................................................................✅
  · kube-system/konnectivity-agent.................................................................✅
  · kube-system/metrics-server.....................................................................✅
(...)


HORIZONTALPODAUTOSCALERS (4 SCANNED)                                          💥 0 😱 2 🔊 0 ✅ 2 50٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · HPA............................................................................................😱
    😱 [POP-604] If ALL HPAs triggered, 11500m will match/exceed cluster CPU(4931m) capacity by
       6570m.
    😱 [POP-605] If ALL HPAs triggered, 14720Mi will match/exceed cluster memory(1523Mi) capacity by
       13196Mi.
  · my-service/keda-hpa-my-service.................................................................✅
  · my-service-2/keda-hpa-my-service-2.............................................................😱
    😱 [POP-602] Replicas (1/100) at burst will match/exceed cluster CPU(4931m) capacity by 5070m.
    😱 [POP-603] Replicas (1/100) at burst will match/exceed cluster memory(1523Mi) capacity by
       11276Mi.
(...)


INGRESSES (13 SCANNED)                                                      💥 0 😱 0 🔊 0 ✅ 13 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · my-service/my-service..........................................................................✅
  · kong/kong-dapr.................................................................................✅
(...)


NAMESPACES (23 SCANNED)                                                     💥 0 😱 0 🔊 3 ✅ 20 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system....................................................................................✅
  · default........................................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · gatekeeper-system..............................................................................✅
  · keda-system....................................................................................✅
  · kong...........................................................................................✅
  · kube-node-lease................................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · kube-public....................................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · kube-system....................................................................................✅
  · logstash.......................................................................................✅
  · prometheus-agent...............................................................................✅
(...)


NETWORKPOLICIES (2 SCANNED)                                                  💥 0 😱 0 🔊 0 ✅ 2 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · kube-system/konnectivity-agent.................................................................✅
  · kube-system/tunnelfront........................................................................✅


NODES (3 SCANNED)                                                             💥 0 😱 2 🔊 0 ✅ 1 33٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · aks-default-***-vmss000000.....................................................................😱
    😱 [POP-710] Memory threshold (80%) reached 83%.
  · aks-default-***-vmss000001.....................................................................😱
    😱 [POP-710] Memory threshold (80%) reached 105%.
  · aks-default-***-vmss00000a.....................................................................✅


PERSISTENTVOLUMES (3 SCANNED)                                                💥 0 😱 0 🔊 0 ✅ 3 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · pvc-**.........................................................................................✅
  · pvc-**.........................................................................................✅
  · pvc-**.........................................................................................✅


PERSISTENTVOLUMECLAIMS (3 SCANNED)                                           💥 0 😱 0 🔊 0 ✅ 3 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/raft-log-dapr-placement-server-0...................................................✅
  · dapr-system/raft-log-dapr-placement-server-1...................................................✅
  · dapr-system/raft-log-dapr-placement-server-2...................................................✅


PODS (72 SCANNED)                                                             💥 1 😱 71 🔊 0 ✅ 0 0٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-dashboard-1-lq942.............................................................😱
    😱 [POP-301] Connects to API Server? ServiceAccount token is mounted.
    🐳 dapr-dashboard
      😱 [POP-102] No probes defined.
      🔊 [POP-108] Unnamed port 8080.
  · dapr-system/dapr-operator-1-chmmq..............................................................😱
    😱 [POP-301] Connects to API Server? ServiceAccount token is mounted.
    🐳 dapr-operator
      😱 [POP-205] Pod was restarted (17) times.
      🔊 [POP-105] Liveness probe uses a port#, prefer a named port.
      🔊 [POP-105] Readiness probe uses a port#, prefer a named port.
      🔊 [POP-108] Unnamed port 6500.
  · dapr-system/dapr-placement-server-0............................................................😱
    😱 [POP-301] Connects to API Server? ServiceAccount token is mounted.
    😱 [POP-302] Pod could be running as root user. Check SecurityContext/Image.
    🐳 dapr-placement-server
      🔊 [POP-105] Liveness probe uses a port#, prefer a named port.
      🔊 [POP-105] Readiness probe uses a port#, prefer a named port.
      😱 [POP-306] Container could be running as root user. Check SecurityContext/Image.
(...)


PODDISRUPTIONBUDGETS (8 SCANNED)                                              💥 0 😱 2 🔊 0 ✅ 6 75٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-dashboard-disruption-budget...................................................✅
  · dapr-system/dapr-operator-disruption-budget....................................................✅
  · dapr-system/dapr-placement-server-disruption-budget............................................✅
  · dapr-system/dapr-sentry-budget.................................................................✅
  · dapr-system/dapr-sidecar-injector-disruption-budget............................................✅
  · kube-system/coredns-pdb........................................................................😱
    😱 [POP-403] Deprecated PodDisruptionBudget API group "policy/v1beta1". Use "policy/v1" instead.
  · kube-system/konnectivity-agent.................................................................😱
    😱 [POP-403] Deprecated PodDisruptionBudget API group "policy/v1beta1". Use "policy/v1" instead.
  · logstash/logstash-logstash-pdb.................................................................✅


PODSECURITYPOLICIES (0 SCANNED)                                              💥 0 😱 0 🔊 0 ✅ 0 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · Nothing to report.


REPLICASETS (197 SCANNED)                                                  💥 0 😱 0 🔊 0 ✅ 197 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-dashboard-1...................................................................✅
  · dapr-system/dapr-operator-1....................................................................✅
  · dapr-system/dapr-sentry-1......................................................................✅
  · dapr-system/dapr-sidecar-injector-1............................................................✅
  · keda-system/keda-operator-1....................................................................✅
  · keda-system/keda-operator-metrics-apiserver-1..................................................✅
(...)


ROLES (5 SCANNED)                                                            💥 0 😱 0 🔊 0 ✅ 5 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · default/secret-reader..........................................................................✅
  · gatekeeper-system/gatekeeper-manager-role......................................................✅
  · kong/kong-kong.................................................................................✅
  · kube-system/azure-policy-webhook-role..........................................................✅
  · kube-system/policy-pod-agent...................................................................✅


ROLEBINDINGS (7 SCANNED)                                                      💥 0 😱 2 🔊 0 ✅ 5 71٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · default/dapr-secret-reader.....................................................................✅
  · gatekeeper-system/gatekeeper-manager-rolebinding...............................................✅
  · kong/kong-kong.................................................................................✅
  · kube-system/azure-policy-webhook-rolebinding...................................................✅
  · kube-system/keda-operator-auth-reader..........................................................😱
    😱 [POP-1300] References a Role (kube-system/extension-apiserver-authentication-reader) which
       does not exist.
  · kube-system/metrics-server-auth-reader.........................................................😱
    😱 [POP-1300] References a Role (kube-system/extension-apiserver-authentication-reader) which
       does not exist.
  · kube-system/policy-pod-agent...................................................................✅


SECRETS (277 SCANNED)                                                     💥 0 😱 0 🔊 254 ✅ 23 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-operator-token-tdnb4..........................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · dapr-system/dapr-sidecar-injector-cert.........................................................✅
  · dapr-system/dapr-trust-bundle..................................................................✅
(...)


SERVICES (35 SCANNED)                                                        💥 3 😱 17 🔊 7 ✅ 8 42٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-api...........................................................................🔊
    🔊 [POP-1102] Use of target port #6500 for service port TCP::80. Prefer named port.
  · dapr-system/dapr-dashboard.....................................................................😱
    🔊 [POP-1102] Use of target port #8080 for service port TCP::8080. Prefer named port.
    😱 [POP-1109] Only one Pod associated with this endpoint.
  · dapr-system/dapr-placement-server..............................................................🔊
    🔊 [POP-1102] Use of target port #50005 for service port TCP:api:50005. Prefer named port.
    🔊 [POP-1102] Use of target port #8201 for service port TCP:raft-node:8201. Prefer named port.
  · dapr-system/dapr-sentry........................................................................🔊
    🔊 [POP-1102] Use of target port #50001 for service port TCP::80. Prefer named port.
  · dapr-system/dapr-sidecar-injector..............................................................✅
  · dapr-system/dapr-webhook.......................................................................💥
    💥 [POP-1106] No target ports match service port TCP::443.
  · default/dapr-eventgrid-func-dapr...............................................................💥
    💥 [POP-1100] No pods match service selector.
    💥 [POP-1105] No associated endpoints.
  · default/kubernetes.............................................................................✅
(...)


SERVICEACCOUNTS (38 SCANNED)                                                 💥 0 😱 1 🔊 8 ✅ 29 97٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-operator......................................................................✅
  · dapr-system/dashboard-reader...................................................................✅
  · dapr-system/default............................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · default/default................................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · domain-event-emitter/default...................................................................✅
  · domain-start/default...........................................................................✅
  · gatekeeper-system/default......................................................................🔊
    🔊 [POP-400] Used? Unable to locate resource reference.
  · gatekeeper-system/gatekeeper-admin.............................................................✅
(...)


STATEFULSETS (2 SCANNED)                                                     💥 0 😱 0 🔊 0 ✅ 2 100٪
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
  · dapr-system/dapr-placement-server..............................................................✅
  · logstash/logstash-logstash.....................................................................✅


SUMMARY
┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅┅
Your cluster score: 82 -- B
                                                                                o          .-'-.
                                                                                 o     __| B    `\
                                                                                  o   `-,-`--._   `\
                                                                                 []  .->'  a     `|-'
                                                                                  `=/ (__/_       /
                                                                                    \_,    `    _)
                                                                                       `----;  |

Popeye has a bit of a different scope than Kube-bench in the sense that it scans your cluster for best practices and potential issues. It is not a security scanner like Kube-bench, but it can be used to find potential security issues and for your cluster management.

It targets the following:

nodes
namespaces
pods
services

Its main scope is to find misconfigurations like port mismatches, dead or unused resources etc. You can find the full list of available sanitizers here.

Potential use cases

Should you choose it to run in a pipeline or as a job you can build on it and make action items to fix based on the errors reported in the report.

Or this could just be a little helpful tool to run on your cluster to get a quick overview of the state of your cluster. And you can extract any action items for yourself. In this case, you can save the report using the --save flag and attach it to your JIRA Ticket or PR.

This will save the report in your current working directory.

> popeye --save                                                                                                                    
/var/folders/vp/l8dlq0gn3x71f3vk82shmzlm0000gn/T/popeye/sanitizer_aks-staging_1673563387070145000.txt

Summary

I quite enjoyed using both Popeye and Kube-bench. They are both very useful in different ways. Popeye is more of a cluster management tool, while Kube-bench is more of a security scanner. But they can both be used to improve the security of your cluster.

In the next posts from this series, we will take a deeper look at the reports in more detail, focusing on the errors and seeing what we can do to improve our cluster score.

Until then, thank you for reading and I hope you found this useful. Let me know what other tools you use

Terraform vs. Helm for managing K8s objects

Ana Cozma — Wed, 10 Aug 2022 09:02:57 +0000

When I started migrating to Kubernetes (K8s) I discovered that I can use Terraform for managing not only the infrastructure, but also I could define the K8s objects in it, but I also could use Helm to handle that. But what would be a good way to handle this?

In this post we will cover Terraform and Helm for managing Kubernetes clusters with some code snippets and an idea on how you can use them together to get you started.

Structure of the post:

What is Terraform?
Manage Kubernetes Resources via Terraform
What is Helm?
Manage Kubernetes Resources via Helm
Using Helm and Terraform Together

What is Terraform?

HashiCorp Terraform is an infrastructure as code tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share.

It can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features.

Terraform treats Infrastructure as Code (IaC) meaning teams manage infrastructure setup with configuration files instead of using graphical user interface (think of Azure Portal and the like).

So why would you use Terraform?

Some of the benefits include:

It allows teams to build, change, and manage the infrastructure in a safe, consistent, and repeatable way by defining resource configurations that can be versioned, reused, and shared.
It supports all major cloud providers: Azure, AWS, GCP and many other which you can find by browsing their registry.

Providers define individual units of infrastructure, for example compute instances or private networks, as resources. You can compose resources from different providers into reusable Terraform configurations called modules, and manage them with a consistent language and workflow.

You define your providers in the terraform code as follows:

terraform {
  required_providers {
    helm = {
      version = "2.5.1"
    }
    azuread = {
      source  = "hashicorp/azuread"
      version = "2.23.0"
    }
    azurerm = {
      source = "hashicorp/azurerm"
    }
  }

It's configuration language is declarative:

Meaning that it describes the desired end-state for your infrastructure, in contrast to procedural programming languages that require step-by-step instructions to perform tasks. Terraform providers automatically calculate dependencies between resources to create or destroy them in the correct order.

This means that any new team member joining will be able to understand the infrastructure setup you have just by going through the configuration files.

It's state allows you to track resource changes throughout your deployments.
All configurations are subject to version control to safely collaborate on infrastructure.

You can read more on the advantages it brings by looking over the official use cases from the Terraform documentation.

Terraform - How does it work?

Terraform follows a simple workflow for managing your infrastructure. So once a new resource or changes to resources are desired, the team will:

Initialize the backend by running the terraform init command, which will install the plugins Terraform needs to manage the infrastructure.

Output will look something like:

Initializing modules...
(...)
- module1 in ../../modules/module1
- module2 in ../../modules/module2

Initializing the backend...

Successfully configured the backend "azurerm"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...
- Reusing previous version of hashicorp/kubernetes from the dependency lock file
- Reusing previous version of hashicorp/azuread from the dependency lock file
- Reusing previous version of hashicorp/azurerm from the dependency lock file
- Reusing previous version of hashicorp/helm from the dependency lock file
- Installing hashicorp/helm v2.5.1...
- Installed hashicorp/helm v2.5.1 (signed by HashiCorp)
- Installing hashicorp/kubernetes v2.10.0...
- Installed hashicorp/kubernetes v2.10.0 (signed by HashiCorp)
- Installing hashicorp/azuread v2.23.0...
- Installed hashicorp/azuread v2.23.0 (signed by HashiCorp)
- Installing hashicorp/azurerm v3.10.0...
- Installed hashicorp/azurerm v3.10.0 (signed by HashiCorp)

Partner and community providers are signed by their developers.
If you'd like to know more about provider signing, you can read about it here:
https://www.terraform.io/docs/cli/plugins/signing.html

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

In the example above we initialize the backend, in our case we store the state file in a container in azure, the providers (azurerm, azuread, helm, kubernetes) and we get a successful message of completion.

Plan the changes to be made by running the terraform plan command, which will give a review of the 'planned' changes Terraform will make to match your configuration.

During the plan, Terraform will mark which resources will be:

added with a '+' sign,
updated with a '~' sign or
deleted with a '-' sign.

In this example Terraform is creating a brand new resource as you can see all attributes are marked with the + sign:

  # module.monitoring.helm_release.prometheus_agent[0] will be created
  + resource "helm_release" "prometheus_agent" {
      + atomic                     = false
      + chart                      = "../../charts/prometheus-agent"
      + cleanup_on_fail            = false
      + create_namespace           = false
      + dependency_update          = false
      + disable_crd_hooks          = false
      + disable_openapi_validation = false
      + disable_webhooks           = false
      + force_update               = false
(...)

But in this one, the resource was already previously created, and Terraform is just updating something to it which is marked with the ~ sign:

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place

Terraform will perform the following actions:
  # module.grafana.grafana_organization.org will be updated in-place
  ~ resource "grafana_organization" "org" {
      ~ admins       = [
          + "new_admin@grafana.admin",
        ]
        id           = "1"
        name         = "MyOrg"
        # (5 unchanged attributes hidden)
    }

Terraform keeps track of your real infrastructure in a state file, which acts as a source of truth for your environment. Meaning it uses this file to determine the changes to make to your infrastructure so that it will match your configuration.

This is also helpful to detect infrastructure drifts between the desired state and the current one.

Apply the desired changes by running the terraform apply command.

A view from the Terraform official documentation:

So this is Terraform in a nutshell.

Manage Kubernetes Resources via Terraform

Terraform’s Kubernetes (K8S) provider is used to interact with the resources supported by Kubernetes and offers many benefits, but it's important to note that the capability is still new. This means you might not have all of the resources available in the provider or there might be some open bugs.

That said, how would this look?
Let's look at an example for an AKS cluster in Terraform:

resource "azurerm_kubernetes_cluster" "main" {
  name                              = "aks-${var.prefix}-${var.env}"
  location                          = azurerm_resource_group.main.location
  resource_group_name               = azurerm_resource_group.main.name
  dns_prefix                        = "${var.prefix}-${var.env}"
  role_based_access_control_enabled = var.ad_admin_group == "" ? false : true
  kubernetes_version                = var.kubernetes_version

  dynamic "azure_active_directory_role_based_access_control" {
    for_each = var.ad_admin_group == "" ? [] : [1]
    content {
      admin_group_object_ids = [var.ad_admin_group]
      azure_rbac_enabled     = true
    }
  }

  dynamic "oms_agent" {
    for_each = var.oms_agent_enabled == true ? [1] : []
    content {
      log_analytics_workspace_id = var.log_analytics_workspace_id
    }
  }
  azure_policy_enabled             = var.azure_policy_enabled
  http_application_routing_enabled = false
  api_server_authorized_ip_ranges  = var.api_server_authorized_ip_ranges

  default_node_pool {
    name                 = "default"
    enable_auto_scaling  = var.enable_auto_scaling
    max_count            = var.enable_auto_scaling ? var.max_count : null
    min_count            = var.enable_auto_scaling ? var.min_count : null
    node_count           = var.node_count
    type                 = "VirtualMachineScaleSets"
    vm_size              = var.node_size
    tags                 = var.tags
    orchestrator_version = var.node_pool_orchestrator_version
  }

  service_principal {
    client_id     = azuread_application.main.application_id
    client_secret = azuread_service_principal_password.main.value
  }

  tags = var.tags
}

So you create the resources, run terraform apply and it will provision your infrastructure.

For the deployment we create a separate Terraform file using the kubernetes deployment resource:

resource "kubernetes_deployment" "example" {
  metadata {
    name = "example"
    labels = {
      App = "Example"
    }
  }

  spec {
    replicas = 2
    selector {
      match_labels = {
        App = "Example"
      }
    }
    template {
      metadata {
        labels = {
          App = "Example"
        }
      }
      spec {
        container {
          image = "nginx:1.7.8"
          name  = "example"

          port {
            container_port = 80
          }

          resources {
            limits = {
              cpu    = "0.5"
              memory = "512Mi"
            }
            requests = {
              cpu    = "250m"
              memory = "50Mi"
            }
          }
        }
      }
    }
  }
}

And in order to create this you again run terraform apply and confirm the changes.

Same approach will apply to creating a service:

resource "kubernetes_service" "example" {
  metadata {
    name = "example"
  }
  spec {
    selector = {
      App = kubernetes_deployment.example.spec.0.template.0.metadata[0].labels.App
    }
    port {
      port        = 80
      target_port = 80
    }

    type = "LoadBalancer"
  }
}

And if you want to scale this setup then the approach is:

Make the changes to the replica count

  spec {
    replicas = 3
    selector {
      match_labels = {
        App = "Example"
      }
    }

Apply the terraform code and confirm the changes

Aside from this we can store the kubernetes_namespace resource:

resource "kubernetes_namespace" "example" {
  metadata {
    annotations = {
      name = "example"
    }
    name = "example"
  }
}

And any secrets you might need:

resource "kubernetes_secret" "example" {
  metadata {
    name      = "example"
    namespace = "example"
  }

  data = {
    "some_setting" = "false"
  }
}

Using this approach means you can take advantage of the benefits of Terraform including:

can use one tool for managing your infrastructure resources and also for your cluster management
you use one language for all your infrastructure resources and also the k8s objects
you can nicely see the plan of your changes before provisioning resources

The disadvantages are of course you need to be familiar with hcl language and if the team is new adopting it would take a bit of time.

The K8s Terraform provider might not fully support all the beta objects so you might need to wait.

If you are interested in provisioning a cluster and all the K8s objects via Terraform please check the official documentation for step by step settings.

What is Helm?

Helm is a package manager tool that helps you manage Kubernetes applications. Helm makes use of Helm Charts to define, install, and upgrade Kubernetes application.

Let's look over some terminology when working with Helm:

Helm: is the command-line interface that helps you define, install, and upgrade your Kubernetes application using charts.

Charts: are the format for Helm’s application package. The chart is a bundle of information necessary to create an instance of a Kubernetes application. Basically a package of files and templates that gets converted into Kubernetes objects at deployment time.

Chart repository: is the location where you can store and share packaged charts.

The config: contains configuration information that can be merged into a packaged chart to create a releasable object.

The Release: is a running instance of a chart, combined with a specific config. It is created by Helm to track the installation of the charts you created/defined.

For more details on Helm architecture.

Some of the benefits Helm brings:

Charts are in YAML format and are reusable because they provide repeatable application installation. Because of this you can use them in multiple environments (think dev, staging and prod following the same one).
Because charts build a repeatable process this makes deployments easier.
A lot of charts are already available, but you can create your own as well - custom charts.
You can create dependencies between the charts and can also use sub-charts to add more flexibility to your setup.
Charts serve as a single point of authority.
Releases are tracked.
You can upgrade or rollback multiple K8s objects together.
Charts can be easily installed/ uninstalled.

Manage Kubernetes Resources via Helm

We looked over the main terminology when using Helm, but let's see how it would look like.

First thing we need to do is in the repository where we have our code we run the helm create <app_name> command. This command creates a chart directory along with the common files and directories used in a chart. More information on the command.

And this will create a structure as follows:

.
└── example
    ├── Chart.yaml
    ├── charts
    ├── templates
    │   ├── NOTES.txt
    │   ├── _helpers.tpl
    │   ├── deployment.yaml
    │   ├── hpa.yaml
    │   ├── ingress.yaml
    │   ├── service.yaml
    │   ├── serviceaccount.yaml
    │   └── tests
    │       └── test-connection.yaml
    └── values.yaml

We notice it created:
A Chart.yaml file which just contains the information about the chart.

apiVersion: v2
name: example
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.16.0"

The charts directory where you can add any charts that your chart depends on.

The templates directory:
A directory to store partials and helpers. The file called _helpers.tpl is the default location for template partials that the rest of the yaml files rely on as we will see.

How does this work?

Let's take a small example, in that file we define the fullname:

{{/*
Create a default fully qualified app name.
We truncate at 63 chars because some Kubernetes name fields are limited to this (by the DNS naming spec).
If release name contains chart name it will be used as a full name.
*/}}
{{- define "example.fullname" -}}
{{- if .Values.fullnameOverride }}
{{- .Values.fullnameOverride | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- $name := default .Chart.Name .Values.nameOverride }}
{{- if contains $name .Release.Name }}
{{- .Release.Name | trunc 63 | trimSuffix "-" }}
{{- else }}
{{- printf "%s-%s" .Release.Name $name | trunc 63 | trimSuffix "-" }}
{{- end }}
{{- end }}
{{- end }}

And in our service.yaml file we can include the defined fullname like:

apiVersion: v1
kind: Service
metadata:
  name: {{ include "example.fullname" . }}
  labels:
    {{- include "example.labels" . | nindent 4 }}
spec:
  type: {{ .Values.service.type }}
  ports:
    - port: {{ .Values.service.port }}
      targetPort: http
      protocol: TCP
      name: http
  selector:
    {{- include "example.selectorLabels" . | nindent 4 }}

And in the values.yaml file we can also override it:

# Default values for example.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.

replicaCount: 1

image:
  repository: nginx
  pullPolicy: IfNotPresent
  # Overrides the image tag whose default is the chart appVersion.
  tag: ""

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name: ""

podAnnotations: {}

podSecurityContext: {}
  # fsGroup: 2000

securityContext: {}
  # capabilities:
  #   drop:
  #   - ALL
  # readOnlyRootFilesystem: true
  # runAsNonRoot: true
  # runAsUser: 1000

service:
  type: ClusterIP
  port: 80
(...)

In the example above the fullname will be the Chart name since we didn't pass any value in the values.yaml file for the override based on the definition of the parameter in the _helpers.tpl file.

The values.yaml file which contains the default values for your templates. You can at this point split the values file per each environment: one for staging, one for production etc.

From here onwards you can start configuring based on what you need. If you check the yaml files you will notice the structure is very similar to what we defined in the K8s objects terraform code.

For the installation of the charts you can either go one by one and make use of the helm install/upgrade commands OR you can add this in your CI/CD pipelines.

Going by the command line could look something like:
helm upgrade example infra/charts/example --install --wait --atomic --namespace=example --set=app.name=example --values=infra/charts/example/values.yaml
where:

infra/charts/example - is the location of your Chart.yaml file
values=infra/charts/example/values.yaml - is the location of the values file
--wait - will wait until either the release is successful or the default timeout is reached (5m) if no timeout is specified
--atomic- if set, upgrade process rolls back changes made in case of failed upgrade

Check the full synopsis of the command here.

helm install or helm upgrade --install?

The install sub-command always installs a brand new chart, while the upgrade sub-command can upgrade an existing chart and install a new one, if the chart hasn't been installed before.

For simplicity you can always you the upgrade sub-command.

Using Helm and Terraform Together

Helm and Terraform are not mutually exclusive and can be used together in the same K8s setup even if the actual setup really depends on your project complexity, which benefits you want to make use of and which drawbacks you can live with.

In a potential setup where you would use both you could structure it something like:

use Terraform to create and manage resources: the K8s Cluster, the K8s namespace, and the K8s secrets( if any )
use Helm charts to deploy your applications

This is the setup we currently use and it has served us well so far.

It is worth mentioning that you can also use Terraform to handle your Helm deploys using the helm_release resource.

In this approach you would have both infrastructure and provisioning in one place - in Terraform. I will not go in this post in the differences between them, but I will mention going with this approach should depend on how frequent you need to apply changes to your infrastructure because the way this works is during terraform apply operation the helm release will take place.

There is no one-size-fits-all approach, but you should tailor tooling and the strategy to your needs.

Hope you find this helpful. Thank you for reading and feel free to comment on your experience and what you prefer to use and why.

Terraform: Alternative to the Template provider on Apple M1 MBP

Ana Cozma — Fri, 05 Aug 2022 08:24:30 +0000

We ran into an issue while applying our Terraform infrastructure on a M1 Mac where we were making use of the Terraform Provider Template.

When applying it, we were getting the following error:

template v2.2.0 does not have a package available for your current platform, darwin_arm64

Since the provider is archived, we need to find an alternative.

What does archiving mean?
Per Terraform Archiving Providers documentation.

The code repository and all commit, issue, and PR history will still be available.

Existing released binaries will remain available on the releases site.

Documentation for the provider will remain on the Terraform website.

Issues and pull requests are not being monitored, merged, or added.

No new releases will be published.

Nightly acceptance tests may not be run.

So what alternatives do we have instead of the deprecated provider?

Let's look at an example.

Resource using the deprecated Template provider

Let's say we have the following resource - a grafana dashboard json that we store in our Terraform code.

data "template_file" "grafana_json" {
  template = file("${path.module}/grafana_dashboard.json")
  vars = {
    title                      = var.monitoring_title
    monitoring_datasource_name = var.monitoring_datasource_name
  }
}

And the grafana dashboard Terraform resource:

resource "grafana_dashboard" "metrics" {
  config_json = data.template_file.grafana_json.rendered
  folder      = var.monitoring_folder
}

When you try to apply this piece of code it will throw the aforementioned error.

Updating to the built-in `templatefile` Terraform function

We can make use of the built-in templatefile Terraform function that:

templatefile reads the file at the given path and renders its content as a template using a supplied set of template variables.

The function uses the format:
templatefile(path, vars)

In our case the path is the path to the grafana json file.

And vars contains all the variables that we need to use for the grafana dashboard json file.

The "vars" argument must be a map. Within the template file, each of the keys in the map is available as a variable for interpolation. The template may also use any other function available in the Terraform language, except that recursive calls to templatefile are not permitted. Variable names must each start with a letter, followed by zero or more letters, digits, or underscores.

And our new code looks like this:

resource "grafana_dashboard" "metrics" {
  config_json = templatefile("${path.module}/grafana_dashboard.json", {
    title                      = var.monitoring_title
    monitoring_datasource_name = var.monitoring_datasource_name
  })
  folder = var.monitoring_folder
}

Thanks to the new templatefile function, we can get rid of the template_file data source.

This means at this point we no longer rely on the hashicorp/template provider and we can apply our infrastructure changes.

At this point you can apply the infrastructural changes, but it still might not work and throw the error and this is because if the infrastructure has already been initialized and applied previously we have a record of the deprecated provider stored in the lock file.

The template provider is still in the `.terraform.lock.hcl` file

If you run terraform init and still see that the template provider is being installed:

Initializing modules...

Initializing the backend...

Initializing provider plugins...
- Reusing previous version of grafana/grafana from the dependency lock file
...
- Reusing previous version of hashicorp/template from the dependency lock file
...
- Using previously-installed hashicorp/template v2.2.0

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Then check the Dependency Lock File, terraform.hcl.lock file. If you still see it there:

provider "registry.terraform.io/hashicorp/template" {
  version = "2.2.0"
  hashes = [
    "h1:0wlehNaxBX7GJQnPfQwTNvvAf38Jm0Nv7ssKGMaG6Og=",
    "zh:01702196f0a0492ec07917db7aaa595843d8f171dc195f4c988d2ffca2a06386",
    "zh:09aae3da826ba3d7df69efeb25d146a1de0d03e951d35019a0f80e4f58c89b53",
    "zh:09ba83c0625b6fe0a954da6fbd0c355ac0b7f07f86c91a2a97849140fea49603",
    "zh:0e3a6c8e16f17f19010accd0844187d524580d9fdb0731f675ffcf4afba03d16",
    "zh:45f2c594b6f2f34ea663704cc72048b212fe7d16fb4cfd959365fa997228a776",
    "zh:77ea3e5a0446784d77114b5e851c970a3dde1e08fa6de38210b8385d7605d451",
    "zh:8a154388f3708e3df5a69122a23bdfaf760a523788a5081976b3d5616f7d30ae",
    "zh:992843002f2db5a11e626b3fc23dc0c87ad3729b3b3cff08e32ffb3df97edbde",
    "zh:ad906f4cebd3ec5e43d5cd6dc8f4c5c9cc3b33d2243c89c5fc18f97f7277b51d",
    "zh:c979425ddb256511137ecd093e23283234da0154b7fa8b21c2687182d9aea8b2",
  ]
}

Check who is requiring the provider (maybe it's still being used in the code elsewhere). This can be done by running the terraform providers command, which:

The terraform providers command shows information about the provider requirements of the configuration in the current working directory, as an aid to understanding where each requirement was detected from.

Providers required by configuration:
.
├── provider[registry.terraform.io/grafana/grafana]
├── ...
├── ...
├── ...
├── module.grafana
│   └── provider[registry.terraform.io/grafana/grafana]
└── module.module
    ├── ...

Providers required by state:

    provider[registry.terraform.io/hashicorp/template]

    provider[registry.terraform.io/grafana/grafana]

In this case we can see that the template provider is required by the state.

In order to get rid of this dependency, make sure you update Terraform to any versions greater than v.1.1.3. and this is because they fixed the following issue: https://github.com/hashicorp/terraform/pull/30192 in version 1.1.3.

In order to update the lock file and remove the entry for the deprecated template provider, we run terraform init.

This is because Terraform relies on two sources for determining the truth: the configuration itself and the state. If you remove the dependency on a particular provider from both your configuration and the state then running terraform init will remove any existing lock file entry for that provider.

And let's look at the output:

Initializing modules...

Initializing the backend...

Successfully configured the backend "azurerm"! Terraform will automatically
use this backend unless the backend configuration changes.

Initializing provider plugins...

- Reusing previous version of grafana/grafana from the dependency lock file

- Using previously-installed grafana/grafana v1.17.0

Terraform has made some changes to the provider dependency selections recorded
in the .terraform.lock.hcl file. Review those changes and commit them to your
version control system if they represent changes you intended to make.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan" to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

And we see the template provider is no longer there.

Now you can safely commit the freshly updated Dependency Lock File.

[K8s] Fix Helm release failing with an upgrade still in progress

Ana Cozma — Mon, 30 May 2022 19:49:10 +0000

This article applies to: Helm v3.8.0

Helm helps you manage Kubernetes applications — Helm Charts help you define, install, and upgrade even the most complex Kubernetes application. More details on Helm and the commands can be found in the official documentation.

Assuming you use Helm to handle your releases, you might end up in a case where the release will be stuck in a pending state and all subsequent releases will keep failing.

This could happen if:

you run the upgrade command from the cli and accidentally (or not) interrupt it or
you have two deploys running at the same time (maybe in Github Actions, for example)

Basically any interruption that occurred during your install/upgrade process could lead you to a state where you cannot install another release anymore.

In the release logs the failing upgrade will show an error similar to the following:

Error: UPGRADE FAILED: release <name> failed, and has been rolled back due to atomic being set: timed out waiting for the condition
Error: Error: The process '/usr/bin/helm3' failed with exit code 1
Error: The process '/usr/bin/helm3' failed with exit code 1

And the status will be stuck in: PENDING_INSTALL or PENDING_UPGRADE depending on the command you were running.

Because of this pending state when you run the command to list all release it will return empty:

> helm list --all                                                                               
NAME    NAMESPACE   REVISION    UPDATED STATUS  CHART   APP VERSION

So what can we do now? In this article we will look over the two options described below. Keep in mind that based on your setup it could be another issue, but I'm hoping that these two pointers will give you a place to start.

Roll back to the previous working version using the helm rollback command.
Delete the helm secret associated with the release and re-run the upgrade command.

So let's look at each option in detail.

First option: Roll back to the previous working version using the `helm rollback` command

From the official documentation:

This command rolls back a release to a previous revision.
The first argument of the rollback command is the name of a release, and the second is a revision (version) number. If this argument is omitted, it will roll back to the previous release.

So, in this case, let's get the history of the releases:

helm history <releasename> -n <namespace>

In the output you will notice the STATUS of your release with: pending-upgrade:

REVISION UPDATED                  STATUS          CHART     APP VERSION DESCRIPTION
1        Wed May 25 11:45:40 2022 DEPLOYED        api-0.1.0 1.16.0      Upgrade complete
2        Mon May 30 14:32:46 2022 PENDING_UPGRADE api-0.1.0 1.16.0      Preparing upgrade

Now let's perform the rollback by running the following command:

helm rollback <release> <revision> --namespace <namespace>

So in our case we run:

> helm rollback api 1 --namespace api
Rollback was a success.

After we get confirmation that the rollback was successful, we run the command to get the history again.

We now see we have two releases and that our rollback was successful having the STATUS is: deployed

> helm history api -n api

REVISION UPDATED                  STATUS      CHART     APP VERSION DESCRIPTION
1        Wed May 25 11:45:40 2022 SUPERSEEDED api-0.1.0 1.16.0      Upgrade complete
2        Mon May 30 14:32:46 2022 SUPERSEEDED api-0.1.0 1.16.0      Preparing upgrade
3        Mon May 30 14:45:46 2022 DEPLOYED    api-0.1.0 1.16.0      Rollback to 1

So what if the solution above didn't work?

Second option: Delete the helm secret associated with the release and re-run the upgrade command

First we get all the secrets for the namespace by running:

kubectl get secrets -n <namespace>

In the output you will notice a list of secrets in the following format:

NAME                        TYPE               DATA AGE
api                         Opaque             21   473d
sh.helm.release.v1.api.v648 helm.sh/release.v1 1    6d5h
sh.helm.release.v1.api.v649 helm.sh/release.v1 1    5d1h
sh.helm.release.v1.api.v650 helm.sh/release.v1 1    57m

So what's in a secret?

Helm3 makes use of the Kubernetes Secrets object to store any information regarding a release. These secrets are basically used by Helm to store and read it's state every time we run: helm list, helm history or helm upgrade in our case.

The naming of the secrets is unique per namespace.
The format follows the following convention:
sh.helm.release.v1.<release_name>.<release_version>.

There is a max of 10 secrets that are stored by default, but you can modify this by setting the --history-max flag in your helm upgrade command.

--history-max int limit the maximum number of revisions saved per release. Use 0 for no limit (default 10)

Now that we know what these secrets are used for, let's delete the helm secret associated with the pending release by running the following command:

kubectl delete secret sh.helm.release.v1.<release_name>.v<release_version> -n <namespace>

Finally, we re-run the helm upgrade command (either from command line or from your deployment workflow), which, if all was good so far, should succeed.

There is an open issue with Helm so hopefully these workaround won't be needed anymore. But it's open since 2018.

Of course there could be other cases or issues, but I hope this is a nice place to start. If you ran into something similar I would love to read your input in what the issue was and how you solved it especially since I didn't find the error message to be intuitive.

Thank you for reading!

[K8s] How to restart Kubernetes Pods

Ana Cozma — Wed, 18 May 2022 07:04:25 +0000

This article applies for Kubernetes v1.15 and above.

Kubernetes, also known as K8s, is an open-source system for automating deployment, scaling, and management of containerized applications.

It groups containers that make up an application into logical units for easy management and discovery. But what if something happens to the container? In this case, you might need a quick and easy way to restart it.

Kubernetes Pods usually run until there is a new deployment that will replace them. Therefore, there is no straightforward way to restart a single pod.

What happens when one container fails is that, instead of restarting it, it will be replaced.

Restarting Pods Options

There are a few available options that we will cover in this article:

Scaling down the number of replicas indicating how many Pods it should be maintaining in the ReplicaSet, effectively removing pods, then scaling back up
Causes downtime
Deleting a single pod, forcing K8s to recreate it
Might cause downtime
Starting a rollout (rolling restart method)
No downtime

So let's look at each option in a bit of details and keep in mind that each option could work for you depending on your needs. Some questions to ask: is it a live environment? is it a new setup? can you afford and outage on the app?

1. Changing Replicas

An option for restarting the pods is to effectively "shut them off" by scaling the number of deployment replicas first to zero:

kubectl scale deployment <name> --replicas=0

In this case, K8s will remove all the replicas that are no longer required.

And then, scaling them back up to the desired number.

kubectl scale deployment <name> --replicas=<desired_number>

This will stop and terminate all current pods and will schedule new pods in their place.

Because we are "shutting down" pods, this option will cause downtime since there will be no container available. So if you're running on a production system the rolling restart method would be the better approach.

The names of the new scheduled pods will be different from the previous ones.

Run the following command to get the new names of the pods:

kubectl get pods -n <namespace>

2. Deleting a Pod

First we get all the pods in a namespace by running the following command:

kubectl get pods -n <namespace>

Then we delete a single pod by running the following command:

kubectl delete pod <pod_name> -n <namespace>

K8s will note the change and the state difference and will schedule new pods until the desired state is achieved.

3. Rolling Restart

From version 1.15 K8s now allows you to execute a rolling restart of your deployment.
Note: Not only the kubectl versions needs to be updated, but make sure the cluster is running on this version as well.

Rolling restart is used to restart all the pods from a deployment in sequence by running the following command:

kubectl rollout restart deployment -n <yournamespace>

After running this command, K8s proceeds to shut down and restart each pod in the deployment one by one.

Because this is done sequentially, there is always some pod running meaning the application itself will still be available, effectively allowing for zero downtime.

Thank you for reading and I hope this helps someone!

DEV Community: Ana Cozma

How to Check TLS Configuration of URLs with Curl and Bash Script

Using Curl for one URL

Using Curl and a Bash script to loop through a list of URLs

Understanding and Mitigating the Latest OpenSSH Vulnerability (CVE-2024-6387) in AKS

Understand the vulnerability

CVE-2024-6387

CVE-2024-6409

Suggested actions against the vulnerability

Set LoginGraceTime to 0

Upgrade to a patched version of sshd

Check the AKS version

Check and upgrade the AKS VMSS node image

Identify the patched image version

Verify the patched image version availability

Upgrade all node images in all node pools

Upgrade a specific node pool

Use Node Surge to Speed Up Upgrades

Conclusion

AWS: Handling 'Cannot delete entity, must remove tokens from principal first' error

So what does this mean?

How can this happen?

What can you do?

Azure Application Gateway WAF config vs WAF policy

What is WAF?

WAF config

WAF policy

Final thoughts

Ensuring Seamless Operations: Troubleshooting and Resolving Dapr Certificate Expiry

Symptoms

Troubleshooting

Step 1: Dapr Operator logs

Step 2: Investigate the http2: client connection lost error

Step 3: AKS cluster logs

Step 4: Dapr Sentry logs

Generating a new root certificate

Next steps

Lesson learned no 1: Have an overview of your Dapr services

Lesson Learned no 2: Make sure you are aware before expiration

Troubleshooting and Resolving a Pod Stuck in 'CreateContainerConfigError' in Kubernetes

What is the CreateContainerConfigError?

How to Troubleshoot the CreateContainerConfigError

Check the Pod Status

Check the Events

How to Resolve the CreateContainerConfigError

Book Review: Observability Engineering: Achieving Production Excellence

Observability Engineering: Achieving Production Excellence

Thoughts on the book

Highlights

Areas for Improvement

Summary

Kube-bench and Popeye: A Power Duo for AKS Security Compliance

Azure Security Baseline for AKS

Microsoft Defender for Cloud

Kube-bench

Setting it up

The report breakdown

Potential use cases

Popeye - A Kubernetes Cluster Sanitizer

Setting it up

The report breakdown

Potential use cases

Summary

Terraform vs. Helm for managing K8s objects

Structure of the post:

What is Terraform?

Manage Kubernetes Resources via Terraform

What is Helm?

Manage Kubernetes Resources via Helm

Using Helm and Terraform Together

Terraform: Alternative to the Template provider on Apple M1 MBP

Resource using the deprecated Template provider

Updating to the built-in templatefile Terraform function

The template provider is still in the .terraform.lock.hcl file

[K8s] Fix Helm release failing with an upgrade still in progress

First option: Roll back to the previous working version using the helm rollback command

Second option: Delete the helm secret associated with the release and re-run the upgrade command

[K8s] How to restart Kubernetes Pods

Restarting Pods Options

1. Changing Replicas

Upgrade to a patched version of `sshd`

Step 2: Investigate the `http2: client connection lost` error

Updating to the built-in `templatefile` Terraform function

The template provider is still in the `.terraform.lock.hcl` file

First option: Roll back to the previous working version using the `helm rollback` command