DEV Community: Ole Markus With

Zero-configuration IRSA on kOps

Ole Markus With — Mon, 21 Mar 2022 15:48:12 +0000

A while ago, I wrote about using IAM Roles for ServiceAccounts on kOps.
In short, this feature lets you define an AWS IAM Policy for a given ServiceAccount, and kOps will create the respective AWS IAM Role,
assign the policy and establish a trust relationship allowing the ServiceAccount to assume the IAM Role.

Challenge of configuring workloads

While kOps elegantly handles what happens on the AWS side, we had not implemented anything that configures Pods to actually make
use of the IAM Role. Indeed, some of the more frequently asked support questions
in the kOps Slack channels have been around how to configure applications to assume roles.

The kOps documentation
recommended directly adding the volumes and environment variables to the Pod spec,
but it is not obvious exactly what needs to be added, and you have to manually fetch the actual role ARN that kOps creates from the AWS API or console.

The pod identity webhook

On EKS, the pod identity webhook is commonly used as the mechanism for adding the necessary parts of the Pod spec.
This webhook looks for ServiceAccounts with a specific set of annotations telling it what ARN it can assume and various other settings. When a Pod is created that uses one of
these ServiceAccounts, the webhook mutates the Pod using information found in the ServiceAccount annotations.

Configuring these annotations is a lot simpler than directly configuring the Pod spec.
Typically, EKS-specific tooling "owns" the ServiceAccount, which makes linking the role/ServiceAccount pair simpler, but also means that
ServiceAccounts cannot be managed together with the application using them.

For various reasons, installing the webhook on kOps was not that straightforward. For example, one could tell the webhook to use mounted TLS secrets. It could only use the CSR API.
And even when the webhook was installed, you had to manually annotate ServiceAccounts with the role ARN that the Pods should try to assume.
kOps could have "owned" the ServiceAccounts configured in the Cluster spec as well, but I feel the ownership of ServiceAccounts should be with the application and not the cluster.

Webhook the kOps way

As mentioned towards the end of my previous article,
because kOps already knows the mapping between ServiceAccounts and IAM roles, there shouldn't be any need for
users to copy the ARN from AWS into the ServiceAccount annotation. Something should be able to just read the mapping in the Cluster spec
and and configure workloads accordingly.

I wrote this could be a webhook similar to the pod identity webhook. But why not just implement this as a feature in the pod identity webhook?
The EKS team was very open to the idea, and a PR later, the webhook can be configured to look for additional Pods to mutate.

After this PR, the webhook will:

First look for annotations on the ServiceAccount as before.
If no annotations are found on the ServiceAccount, the webhook will look for a mapping configured in the pod-identity-webhook ConfigMap.

Using the pod identity webhook addon

As of kOps 1.23, kOps supports the webhook as a managed addon. When installed, kOps will populate the webhook ConfigMap based on the spec.iam.serviceAccountExternalPermissions struct.

Installing

Before continuing, make sure you already have a kOps 1.23 cluster with an AWS OIDC provider enabled.
See my previous article on how to go about that.

Once your cluster is running 1.23, you can enable the webhook by adding the following to your cluster spec:

spec:
  certManager:
    enabled: true
  podIdentityWebhook:
    enabled: true

The cert manager addon is required to establish the trust between the webhook and the API server.

Now run kops update cluster --yes and wait a minute or so for the control plane to deploy the addon(s).

Adding a ServiceAccount mapping

Start by granting a set of AWS privileges to a ServiceAccount:

spec:
  iam:
    serviceAccountExternalPermissions:
    - aws:
        policyARNs:
        - arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess
      name: pod-identity-webhook-test
      namespace: default

Running kops update cluster you will see something like the following:

  IAMRole/pod-identity-webhook-test.default.sa.<cluster>
        Tags                    {Name: pod-identity-webhook-test.default.sa.<clusterZ, KubernetesCluster: <cluster>, kubernetes.io/cluster/<cluster>: owned}
        ExportWithID            default-pod-identity-webhook-test

  IAMRolePolicy/external-pod-identity-webhook-test.default.sa.test.<cluster>
        Role                    name:pod-identity-webhook-test.default.sa.test.<cluster>
        ExternalPolicies        [arn:aws:iam::aws:policy/AmazonEC2ReadOnlyAccess]
        Managed                 true
...
  +   config: '{"default/pod-identity-webhook-test":{"RoleARN":"arn:aws:iam::<account>:role/pod-identity-webhook-test.default.sa.<cluster>","Audience":"amazonaws.com","UseRegionalSTS":true,"TokenExpiration":0}}'
  -   config: '{}'

kOps wants to create an IAM role for the ServiceAccount and assign it the AmazonEC2ReadOnlyAccess policy.

You can also see that it populates the mapping information into the pod-identity-webhook ConfigMap.

Run kops update cluster --yes to apply the changes. Then run kubectl logs -n kube-system -l app=pod-identity-webhook -f and observe the webhook picking up the mapping.

I0319 07:10:28.312786       1 cache.go:186] Adding SA default/pod-identity-webhook-test to CM cache: &{RoleARN:arn:aws:iam::<account>:role/pod-identity-webhook-test.default.sa.<cluster> Audience:amazonaws.com UseRegionalSTS:true TokenExpiration:86400}

Deploying the workload

Once the mapping is in place, we can deploy the ServiceAccount and a Pod using that ServiceAccount. It's important to remember that the webhook will only mutate Pods on creation, so it must be aware of the mapping before the Pod is created.

Deploy the following to the cluster:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: pod-identity-webhook-test
  namespace: default
---
apiVersion: v1
kind: Pod
metadata:
  name: pod-identity-webhook-test
  namespace: default
spec:
  containers:
  - name: aws-cli
    image: amazon/aws-cli:latest
    command:
    - sleep
    - "300"
  serviceAccountName: "pod-identity-webhook-test"

You should now see the following in the webhook logs:

I0319 07:39:33.373273       1 cache.go:80] Fetching sa default/pod-identity-webhook-test from cache
I0319 07:39:33.373346       1 handler.go:423] Pod was mutated. Pod=pod-identity-webhook-test, ServiceAccount=pod-identity-webhook-test, Namespace=default
I0319 07:39:33.373522       1 middleware.go:132] path=/mutate method=POST status=200 user_agent=kube-apiserver-admission body_bytes=1441

And running kubectl get pod pod-identity-webhook-test -o yaml you should see that the Pod has been mutated and now contains the expected volumes and environment variables.

Testing that it works.

To confirm everything is good, you can run the following

$ kubectl exec -it -n default pod-identity-webhook-test -- aws sts get-caller-identity
{
    "UserId": "AROAV6PNU2XQTMAZ64FBK:botocore-session-1647675906",
    "Account": "<account>",
    "Arn": "arn:aws:sts::409057154529:assumed-role/pod-identity-webhook-test.default.sa.<cluster>/botocore-session-1647675906"
}

You can also check that the Pod is allowed to use the granted privileges by running something like the following:

`kubectl exec -it -n default pod-identity-webhook-test -- aws ec2 describe-instances --region eu-central-1`

Conclusion

Hopefully this will make the use of IRSA on kOps-based clusters much simpler. And I hope this post will explain how things work under the hood.

As always, I appreciate feedback on this feature and if this is useful for you.

Kubernetes with IPv6 on AWS

Ole Markus With — Wed, 13 Oct 2021 18:27:12 +0000

The Kubernetes ecosystem has been working hard on supporting IPv6 the last few years, and kOps is no different.
There are two ways we have been exploring:

Running with a private subnet with Pods IPs behind NAT.
Running with a public subnet with fully routable Pod IPs.

Both of these sort of work on AWS, but it is not without its caveats.

Configuring the cluster

Regardless of what mode is used, the VPC needs IPv6 enabled, and each instance need an allocated IPv6 address that is added to their respective Node object. This is all handled by kOps and the Cloud Controller Manager.

Private IPs

A cluster with private IPv6 addresses is relatively simple to set up. As with IPv4, the cluster is configured with one flat IPv6 CIDR and CNI takes care to configure routes and tunnelling between the instances, masq traffic destined for external IPs and so on.

You can configure the Cluster spec directly to use IPv6, but kOPs also provides teh --ipv6 flag to simplify the configuration.

Public IPs

Running with private IPv6 addresses is nice for testing how well K8s and K8s components work with IPv6, but the true advantages come when the IPs are publicly routable. The obviation of NAT, tunnelling, and overlay networking in itself gives a performance boost, but you can also do things such as having cloud load balancer directly target Pods instead of going through NodePorts and bouncing off kube-proxy.

kOps supports public IPs on AWS by assigning an IPv6 prefix to each Node's primary interface and using this prefix as the Node's Pod CIDR.

This means any CNI that supports Kubernetes IPAM (and most do) can support publicly routable IPv6 addresses.

In order to run in this mode, just add spec.podCIDRFromCloud: true to the Cluster spec.

$ kgp -o wide
NAME                                                                  READY   STATUS    RESTARTS   AGE   IP                                       NODE                                          NOMINATED NODE   READINESS GATES
aws-cloud-controller-manager-rm9bf                                    1/1     Running   0          16h   172.20.52.202                            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-58c7f89d46-5ttmx                                         1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::f4c8            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-cainjector-5998558479-lvvsr                              1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::6d33            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cert-manager-webhook-756bb49f7d-f4pfh                                 1/1     Running   0          16h   2a05:d018:4ea:8101:ba62::2cdc            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cilium-7mjbl                                                          1/1     Running   0          16h   2a05:d018:4ea:8103:6f5a:dc57:f7b7:b73a   ip-172-20-97-249.eu-west-1.compute.internal   <none>           <none>
cilium-operator-677b9469b7-8pndm                                      1/1     Running   0          16h   172.20.52.202                            ip-172-20-52-202.eu-west-1.compute.internal   <none>           <none>
cilium-psxfs                                                          1/1     Running   0          16h   2a05:d018:4ea:8101:2cc1:f30c:f885:6e6f   ip-172-20-54-232.eu-west-1.compute.internal   <none>           <none>
cilium-wq6xg                                                          1/1     Running   0          16h   2a05:d018:4ea:8102:ccc:bcce:24de:4840    ip-172-20-81-228.eu-west-1.compute.internal   <none>           <none>

(Yes, some Pods with hostNetworking: true have IPv4 addresses here. The reason for that is that Pods receive the IP that the Node had at the time, which in the case of the control plane was IPv4 as the Node came up before Cloud Controller Manager assigned it an IPv6 address)

Can I use this in production?

So the big question is how mature is running IPv6 clusters on AWS?

Not very. Yet.

Taking the simpler private IP mode first, we found various issues with how various components decide which IP to use. E.g metrics-server will pick the first IP on the Node object regardless of what the Pod IP is. So ordering of the Node IPs matter. CNIs still show behavior that suggests it is not that well-tested yet. For example Cilium struggles with routing issues in this 18-months-old issue.

For public IPs, there are some additional problems. On most Linux distro's accept_ra=2 sysctl must be set on the correct interfaces. And since the interface name depends on distro and instance type, this is a bit tricky. On Ubunutu, this is not need because Systemd has taken over a lot of the kernel responsibilities in this area. Systemd is not without bugs though, so when IPv6 single-address DHCPv6 is mixed with prefix delegation, DHCPv6 breaks. Hopefully this fix will make it into Ubuntu soon. Cilium works around this issue, but all other CNIs lose Node connectivity about 5 min after kOps configuration has finished.

Then there are various important apps that do not understand IPv6 well. Many will try to talk to the IPv4 metadata API, for example. If you are lucky, the application use a new version of the AWS SDK so you can set AWS_EC2_METADATA_SERVICE_ENDPOINT_MODE=IPv6.

One of the benefits I mentioned above was using Pods as targets for load balancers. This is a feature that AWS Load Balancer Controller supports. But alas! AWS has two endpoints for the EC2 API. A single-stack IPv4 endpoint at ec2.<region>.amazonaws.com and a dual-stack one at https://api.ec2.eu-west-1.aws`. The SDK will use the former unless configured in code to use something else, and this is not currently possible. There is a pull request for this, but that only brings you to the next component. And if you want to use Cluster Autocaler you are also out of luck because AWS doesn't provide a dual-stack endpoint of the autoscaling API at all.

Even if IPv6 worked perfectly on cluster level, and AWS provides dual-stack endpoints for all their APIs, you would probably need to talk to other resources that only provides IPv4 IPs. In order to reach those, AWS would have to provide DNS64/NAT64, which can allow resources with single-stack IPv6 addresses to talk to resources with single-stack IPv4 addresses.

Hopefully support for this will be available soon.

Using IAM Roles for ServiceAccounts on kOps

Ole Markus With — Wed, 19 May 2021 06:00:37 +0000

This feature has now been implemented and available for some time. See the official docs. Note that the feature flag mentioned below has been replaced with: spec.iam. useServiceAccountExternalPermissions: true

Until recently, the only way for a Pod to use the AWS API was to either provision static credentials or assign additional IAM Policies to the Nodes Pods were running on. kOps addons rely on the latter, which has several issues:

All other Pods running on the same Node would have the same permissions.
EC2 Instances cannot enforce IMDSv2 with http-put-request-hop-limit: 1.

kOps mitigates these concerns by letting addons run on the Control Plane (CP) Nodes. Unfortunately, out of the box, kOps only protect the CP Nodes with Taints, and any cluster user can add Tolerations to Pods and schedule them on the CP Nodes.

The solution to this is to create dedicated IAM Roles for each of the addon Pods, and reduce the privileges given to the IAM Roles assigned to the EC2 instances.

kOps 1.21 introduces a set of features that in sum enables IAM Roles for ServiceAccounts (IRSA).

Let us have a look at how to enable support for IRSA.

ServiceAccount Issuer Discovery

The first feature needed to support IRSA is what Kubernetes refers to as Service Account Issuer Discovery. Essentially it means publishing the OIDC issuer discovery metadata, which contains things like the public key of the ServiceAccount token signing keys. By default, the Kubernetes API Server will publish this on the API Server, but this doesn't work out-of-the box on kOps clusters. AWS also requires the documents to be published in a globally readable location. It is technically possible to expose the API Server on a public IP and allow anonymous access to the OIDC Discovery metadata, but many would be uncomfortable doing so. When this feature is configured, kOps will publish these documents to a VFS path.

VFS path is a Virtual File System path that kOps also uses for storing configurations, secrets, and keys, e.g. the path pointing to the kOps state store is a VFS path.

Right now, only S3 is supported, as we need to implement support for converting a VFS path to the corresponding HTTPS endpoint, e.g. from s3://<bucket>/<path> to https://<bucket>.s3.<region>.amazonaws.com/<path>.

In order to enable this feature, you only need to add the following to the cluster spec:

spec:
  serviceAccountIssuerDiscovery:
    discoveryStore: s3://<my bucket>

If you want to use this with AWS, take care that there is no policy preventing public access to the objects stored therein.

Once you have OIDC discovery metadata published, you can configure any OIDC consumer that supports OIDC issuer discovery to establish trust with your service accounts. This is not limited to AWS, but can be used if you want your ServiceAccounts to authenticate natively to Hashicorp Vault or any other OIDC consumer that supports OIDC issuer discovery.

AWS OIDC Provider

The purpose of this feature is to make AWS trust the Kubernetes ServiceAccounts so that the ServiceAccounts can assume AWS IAM Roles. kOps will do this for you if you add the following to the spec:

spec:
  serviceAccountIssuerDiscovery:
    enableAWSOIDCProvider: true

Using IAM Roles for ServiceAccounts belonging to kOps addons

All addons that require access to the AWS API currently run on the Control Plane (CP) Nodes and assume the instance role in order to access AWS services. This is problematic because any other Pod running on CP Nodes can assume the instance role as well. And we cannot use IMDSv2 with http-put-response-hop-limit: 1 as that would block addons, too.

With the features above in place, each addon will be ported to using IRSA instead. Each addon will get a dedicated role it can assume that has exactly the privileges it needs. kOps will then automatically configure the Pods to use IRSA as well. Enabling IRSA for kOps addons is then entirely transparent. The corresponding privileges are also then removed from the CP Nodes.

At the moment, using IRSA for kOps addons requires the UseServiceAccountIAM feature flag enabled, as we feel we have not tested the functionality enough. We are also missing the ability to override/augment the IAM Policy that the ServiceAccount uses, which can be necessary, e.g. if you want to use cert-manager DNS validation for your own domains.

Creating IAM Roles for your own workloads

Provision the IAM Roles

kOps can provision IAM Roles for your workloads (Deployments, StatefulSets, Jobs, etc.), including the IAM Policy Statement that allows the workload's ServiceAccount to assume the IAM Role and grant the role the privileges you want.

You can let the role assume existing policies, or you can define the policy inline like this:

spec:
  iam:
    serviceAccountExternalPermissions:
      - name: someServiceAccount
        namespace: someNamespace
        aws:
          policyARNs:
            - arn:aws:iam::000000000000:policy/somePolicy
      - name: anotherServiceAccount
        namespace: anotherNamespace
        aws:
          inlinePolicy: |-
            [
              {
                "Effect": "Allow",
                "Action": "s3:ListAllMyBuckets",
                "Resource": "*"
              }
            ]

Configuring Pods to use IRSA

One thing to bear in mind is that kOps will not "own" ServiceAccounts the same way EKS does when using IRSA. So you have to modify your workloads as appropriately yourself.

Typically, you will use environment variables to configure the AWS SDK to use IRSA. The following shows the changes you have to make to the Pod spec:

spec:
  containers:
  - env:
    - name: AWS_DEFAULT_REGION
      value: <region>
    - name: AWS_REGION
      value: <region>
    - name: AWS_ROLE_ARN
      value: "arn:aws:iam::<account number>:role/<role>"
    - name: AWS_WEB_IDENTITY_TOKEN_FILE
      value: "/var/run/secrets/amazonaws.com/serviceaccount/token"
    - name: AWS_STS_REGIONAL_ENDPOINTS
      value: "regional"
    volumeMounts:
    - mountPath: "/var/run/secrets/amazonaws.com/serviceaccount/"
      name: aws-token
  volumes:
  - name: aws-token
    projected:
      sources:
      - serviceAccountToken:
          audience: "amazonaws.com"
          expirationSeconds: 86400
          path: token

If you prefer, you could create ServiceAccounts with these details and use the EKS identity webhook, but I don't see kOps supporting that webhook as a native addon.

Zero-configuration IRSA

This feature is now available. Read more in this post

You don't have to care about anything in order for kOps addons to use IRSA. I would really like this to be the case for your own workloads as well.

Since you define the relationship between AWS IAM and ServiceAccount in the Cluster spec, and the changes you have to make to your Pod spec just mirror that relationship, something could automatically read the Cluster spec and configure workloads for you.

This would have to be an addon that either provides a webhook similar to the EKS identity webhook, or acts as a controller that watch all workloads in the cluster. It is debatable if such an addon really should be a part of the kOps project or if this should be standalone.

I would really love to hear how you would want this to behave. If you have any ideas, comment here or reach out in #kops-users on the Kubernetes Slack.

Blazing fast Kubernetes scaling with ASG warm pools

Ole Markus With — Mon, 19 Apr 2021 15:46:56 +0000

Last week, AWS launched warm pools for auto scaling groups (ASG). In short, this feature allows you to create a pool of pre-initialised EC2 instances. When the ASG needs to scale out, it will pull in Nodes from the warm pool if available. Since these are already pre-initialised, the scale-out time is reduced significantly.

The warm pool is also virtually for free. You pay for the warm pool instances as you would for any other stopped instance. You also have to pay for the time the instances spend on initialising when entering the warm pool, but this is more or less cancelled out by the reduced time spent on initialising when entering the ASG itself.

What does pre-initialised mean, you wonder? Took a while for me to understand what that means too. What happens is that the EC2 instance boots, runs for a while, and then shuts down. We'll get back to this in a bit.

Now, what intrigued me is this bit from the warm pool documentation:

This is a very interesting claim. Why on earth would this feature not be possible for Kubernetes regardless of what shape and form self-managed could Kubernetes comes in?

AWS may have made interesting choices with their Elastic Kubernetes Service precluding them from taking advantage of warm pools, but there are alternatives!

Over the last year or so I have regularly contributed to kOps, which is my preferred way of deploying and maintaining production-ready clusters on AWS. I know well how it boots a plain Ubuntu instance and configures it to become a Kubernetes node, and I could not imagine implementing warm pool support would be any challenge. And turns out it was not either.

The results

This post will describe in detail some of the inner workings of kOps, how ASG behaves, and how to observe various time spans between when a scale-out is triggered and the node is ready for Kubernetes workloads.

If you are here only for the results, here is the TL;DR:

Time between a scale-out is triggered and Pods start improved at least 50%.
It should be possible to improve this even further.
Most, if not all, of the functionality below will be available in kOps 1.21.

Summary

This table shows the number of seconds between CAS being triggered and Pods starting.

Configuration	First Pod started	Last Pod started
No warm pool	149	190
Warm pool	79	149
Warm pool plus life cycle hook	76	98
Warm pool plus life cycle hook + warm-pull images	70	79

And now for the details!

Initialising a Kubernetes node

First, let us have a look at what the process of converting a brand new EC2 instance to a Kubernetes Node means.
On a new EC2 instance, this happens on first boot:

cloud-init installs a configuration service called nodeup.
nodeup takes the cluster configuration and installs containerd, kubelet, and the necessary distro packages with their correct configurations.
nodeup establishes trust with the API server (control plane).
nodeup creates and installs a systemd service for kubelet.
nodeup starts the kubelet service, which is the process on each node that manages the Kubernetes workloads.
kubelet pulls down all the images the control plane tells it to run and starts them as defined by Pod specs and similar.

When a kOps-provisioned instance reboots, nodeup runs through all of the above again to ensure the instance is in the expected state. nodeup is smart enough not to redo already performed tasks though, so the second run is quite fast.

Doing nothing at all

The most naïve way of implementing support for warm pools is to do nothing more than creating the warm pool. Unfortunately, this would start kubelet, which will register the Node with the cluster. Since the AWS cloud provider does not remove instances in stopped state, the control plane marks the Node NotReady, but keep it around in case it comes back up.

It is not a catastrophe to have a large amount of NotReady Nodes in the cluster, but any sane monitoring would not be too happy, and detecting actual bad Nodes would be harder.

Making `nodeup` aware of the warm pool

The only thing I had to do to support warm pools gracefully was to make nodeup conscious of its ASG lifecycle state.

Instances entering the warm pool have a slightly different lifecycle than instances that go directly into the ASG. nodeup needs to do is to:

check if the current instance ASG lifecycle state has a warming: prefix.
if it does not, install and start the kubelet service.

This way kubelet does not start and join the cluster on first boot, but since we enabled the service, systemd will start it on the second boot.

Comparing the difference

In this section I will take you through comparing the time it takes to scale out a Kubernetes Deployment with and without a warm pool enabled. The acid test is the interval between Cluster Autoscaler (CAS) reacts to the scale-out demand and all the Pods starting.

Configuration

In the kOps Cluster spec I ensured I had the following snippet:

spec:
  clusterAutoscaler:
    enabled: true
    balanceSimilarNodeGroups: true

This enables the cluster autoscaler addon.

The number of new instances per ASG influences the scale-out time of the ASG itself. Adding 9 instances to one ASG is significantly slower than adding 3 instances to 3 ASGs. So, to ensure fair comparisons, we tell CAS to balance the ASGs.

On each of the InstanceGroups with the Node role, I set the following capacity:

spec:
  machineType: t3.medium
  maxSize: 10
  minSize: 1

The cluster will launch and with the with the minimum capacity.

I then created a Deployment that has resource.requirement.cpu set:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: default
spec:
  selector:
    matchLabels:
      app: nginx
  replicas: 0
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:stable-alpine
        ports:
        - containerPort: 80
        resources:
          requests:
            cpu: 1

I used the nginx:stable-alpine image here as it is fairly small. I did not want image pull time to significantly impact the scale-out time.

Scaling out

To scale the Deployment, I executed the following:

kubectl scale deployment.v1.apps/nginx-deployment --replicas=10

A t3.medium instance only has 2 cpus, some of which are reserved by other Pods already. So, increasing the replicas as above causes CAS to scale out one instance per replica.

Obtaining the test results

A good way of getting the details is to list the events for a Pod:

$ kubectl get events -o custom-columns=FirstSeen:.firstTimestamp,LastSeen:.lastTimestamp,Count:.count,From:.source.component,Type:.type,Reason:.reason,Message:.message --field-selector involvedObject.kind=Pod,involvedObject.name=nginx-deployment-123abc

For all the configurations in this post, I ran the following command on the first and last Pod to enter the Running state.

For the first Pod, I got:

FirstSeen              LastSeen               Count   From                 Type      Reason             Message
2021-04-17T08:56:30Z   2021-04-17T08:56:30Z   1       cluster-autoscaler   Normal    TriggeredScaleUp   Pod triggered scale-up: [{Nodes-eu-central-1a 1->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:58:59Z   2021-04-17T08:58:59Z   1       `kubelet`              Normal    Started            Started container nginx

For the last Pod I got:

FirstSeen              LastSeen               Count   From                 Type      Reason             Message
2021-04-17T08:56:30Z   2021-04-17T08:56:30Z   1       cluster-autoscaler   Normal    TriggeredScaleUp   Pod triggered scale-up: [{Nodes-eu-central-1a->5 (max: 10)} {Nodes-eu-central-1b 1->4 (max: 10)}]
<snip>
2021-04-17T08:59:40Z   2021-04-17T08:59:40Z   1       `kubelet`              Normal    Started            Started container nginx

This means the first Pod started after 149s and the last Pod after 190s. These are the numbers I'll be comparing across all configurations. I also found it interesting to compare the difference between the first and last Pod start time.

Hunting for delays

This part may not be all that interesting. Here I just try to show which time spans can be interesting for improving the warming process, including explaining what may cause a 41 second difference between those two Pods.

ASG reaction time

If I look at the ASG activity, I see the following message on all instances the ASG launched:

At 2021-04-17T08:56:30Z a user request explicitly set group desired capacity changing the desired capacity from 1 to 5. At 2021-04-17T08:56:32Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 1 to 5.

However, if I look at the actual boot of the instances, I see these two lines for the first and last instance to boot respectively:

-- Logs begin at Sat 2021-04-17 08:56:51 UTC, end at Sat 2021-04-17 09:10:23 UTC. --
-- Logs begin at Sat 2021-04-17 08:57:13 UTC, end at Sat 2021-04-17 09:09:19 UTC. --

So even if the ASG launched the instances at the same time, they do not actually boot at the same time.

It looks like we can generally assume 20-30 seconds response time on an ASG scale-out.

Nodeup run time

We can see that the nodeup spends a fairly consistent amount of time initialising the node. The times below show when nodeup finished on the two Nodes.

Apr 17 08:58:27 ip-172-20-42-5 `systemd`[1]: kops-configuration.service: Succeeded.
Apr 17 08:58:50 ip-172-20-90-140 `systemd`[1]: kops-configuration.service: Succeeded.

The delay is almost the same. From boot until the instance has been configured takes about 95-100 seconds.

Kubelet becoming ready

The last part is the time it takes before node enters Ready state.

kubelet becomes ready once it has registered with the control plane and verified storage, CPU, memory, and networking is working properly.

Interestingly, this part further skewed the difference between our two Nodes:

  Ready                True    Sat, 17 Apr 2021 11:19:08 +0200   Sat, 17 Apr 2021 10:58:53 +0200   KubeletReady                 `kubelet` is posting ready status. AppArmor enabled
  Ready                True    Sat, 17 Apr 2021 11:14:56 +0200   Sat, 17 Apr 2021 10:59:25 +0200   KubeletReady                 `kubelet` is posting ready status. AppArmor enabled

This last leg took 26s and 35s for those two instances.

Enter the warm pool

So how much does adding a warm pool improve the scale-out time?

Wrap up this test by scaling the deployment back to 0 so CAS can scale down the ASGs again.

kubectl scale deployment.v1.apps/nginx-deployment --replicas=0

Adding a warm pool

This feature has not been released to a kOps beta at the time of the writing; the field names below may change.

Configuration

The following will make kOps create warm pools for our InstanceGroups.

warmPool: 
  minSize: 10
  maxSize: 10

Specifically, this will configure a warm pool with 10 Nodes. I only do this to ensure I have a known amount of warm instances to make the the tests comparable. When using warm pools under normal operations, I would just use AWS defaults.

Apply the configuration and watch the warm pool instances appear:

$ kops get instances
ID                      NODE-NAME                                       STATUS       ROLES   STATE           INTERNAL-IP     INSTANCE-GROUP               MACHINE-TYPE
i-01ade8dad4c7ce0cd     ip-172-20-114-104.eu-central-1.compute.internal UpToDate     node                    172.20.114.104  Nodes-eu-central-1c          t3.medium
i-01f169e730f88e016     ip-172-20-118-255.eu-central-1.compute.internal UpToDate     master                  172.20.118.255  master-eu-central-1c.masters t3.medium
i-069e09e5873042cd7     ip-172-20-93-29.eu-central-1.compute.internal   UpToDate     master                  172.20.93.29    master-eu-central-1b.masters t3.medium
i-09b42c88fbd3399ab     ip-172-20-58-68.eu-central-1.compute.internal   UpToDate     node                    172.20.58.68    Nodes-eu-central-1a          t3.medium
i-0a85aed8869a30432     ip-172-20-59-147.eu-central-1.compute.internal  UpToDate     master                  172.20.59.147   master-eu-central-1a.masters t3.medium
i-0b37f6a258d9c7775     ip-172-20-75-150.eu-central-1.compute.internal  UpToDate     node                    172.20.75.150   Nodes-eu-central-1b          t3.medium
i-0c16b3c668615f259                                                     UpToDate     node    WarmPool        172.20.50.226   Nodes-eu-central-1a          t3.medium
i-0cdf1c334d452c9a6                                                     UpToDate     node    WarmPool        172.20.126.45   Nodes-eu-central-1c          t3.medium
i-0d00f4debb586d17f                                                     UpToDate     node    WarmPool        172.20.116.222  Nodes-eu-central-1c          t3.medium
i-0d04d04cb7f4be2d7                                                     UpToDate     node    WarmPool        172.20.58.177   Nodes-eu-central-1a          t3.medium
i-0d49db4113702292c                                                     UpToDate     node    WarmPool        172.20.59.66    Nodes-eu-central-1a          t3.medium
i-0e07b6cc361d7a909                                                     UpToDate     node    WarmPool        172.20.89.193   Nodes-eu-central-1b          t3.medium
i-0f3451adef13005e7                                                     UpToDate     node    WarmPool        172.20.114.9    Nodes-eu-central-1c          t3.medium
<snip>

The kOps output only shows that the instance is in the warm pool, not if it has finished pre-initialisation. But if you go into the Instance Management part of the ASG in the AWS console I can see something like this:

As you can see, not all instances have entered warmed:stopped state yet. But after waiting a bit longer, they are all ready and I can scale out.

kubectl scale deployment.v1.apps/nginx-deployment --replicas=10

Once all the Pods have entered the Running state, let's look if this has improved.

Again, find the first and last Pod that entered the Running state and list their events. The method is the same as last time, and it shows that the first Pod starts after 79s and the last one starts after 149s.

The first Pod starts 70 seconds faster than the first node without a warm pool. The last Pod starts 41 seconds faster than the last Pod without warm instances. That is a pretty decent improvement.

Without the warm pool, the difference between the first Pod and last Pod was 41s. This time it was a whopping 70s; half a minute more. We cannot seem to blame this on ASG response time or the time between the kubelet starting and becoming ready.

So, what is happening here?

Turns out the time an instance runs before it is shut down is completely arbitrary. Some instances stay running for seconds, others for minutes. On the Node the first Pod is running on, nodeup was allowed run to until completion, while on the Node of the last Pod, it was barely allowed to run at all. Luckily nodeup is fairly good at knowing the current state of the Node and can pick up from where it left of regardless of when it was interrupted.

Enter lifecycle hooks

But we do not want nodeup to be interrupted. Nor do we want the instance to stay running long after nodeup has finished.

What AWS did to solve this is to let warming instances run through the same lifecycle as other instances. So, if you have a lifecycle hook for EC2_INSTANCE_LAUNCHING it will trigger on warming Nodes too.

Amend the InstanceGroup spec as follows:

warmPool:
  useLifecycleHook: true

This will make kOps provision a lifecycle hook that can be used by nodeup to signal that it has completed its configuration.

Do the scale down/scale up dance again and observe the first and last Pod creation time.

First Pod 76s; last Pod 98s. That's 22s difference between the first and last Pod. Down from 40s without warm pool and certainly down from the 70s for warm pool without the instance lifecycle hook.

The effect of warm pool alone

With warm pool and some fairly easy changes to kOps, we did the impossible. We created warm pool support for self-managed Kubernetes.

The result is significantly faster response times to Pod scale-out. From up to 190 seconds without warm pool, to up to 98 seconds with warm pool and lifecycle hooks. The best case went from 149 seconds to 76 seconds.

That is 50% faster. Not bad!

Exploiting warm pools further

So far we have focused just on running a regular nodeup run. But can we do better? Can we exploit warm pool to make Pod scale out even faster?

Most likely we cannot do much about what happens before nodeup runs. Nodeup already runs fairly fast; roughly 10 seconds. There could be some optimisations there, but we would not be able to shave off many seconds.

But what about those 26-35 seconds between nodeup completing and the Pods becoming ready?

All Nodes are running additional system containers such as kube-proxy a CNI, in this case Cilium. All of these need to be present on the machine before much else happens. And those images are not necessarily small.

In my case, the time it took was the following:

FirstSeen              LastSeen               Count   From      Type     Reason    Message
2021-04-18T06:10:34Z   2021-04-18T06:10:34Z   1       `kubelet`   Normal   Pulled    Successfully pulled image "k8s.gcr.io/kube-proxy:v1.20.1" in 19.120641887s
2021-04-18T06:10:55Z   2021-04-18T06:10:55Z   1       `kubelet`   Normal   Pulled    Successfully pulled image "docker.io/cilium/cilium:v1.9.4" in 8.270050914s

So not only did it take a while for each of these images to be pulled, but they are pulled in sequence, adding up to 25+ seconds.

But during warming, nodeup already knows about these images. What if it could just pull them so they were already present? Let's try this and see if this changes anything.

After another round of the scale-in/scale-out dance we describing the node and see the image has indeed been pulled during warming.

2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`   Normal   Pulled    Container image "k8s.gcr.io/kube-proxy:v1.20.1" already present on machine
2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`   Normal   Started   Started container kube-proxy
2021-04-18T06:35:37Z   2021-04-18T06:35:37Z   1       `kubelet`             Normal   Pulled      Container image "docker.io/cilium/cilium:v1.9.4" already present on machine
2021-04-18T06:35:38Z   2021-04-18T06:35:38Z   1       `kubelet`             Normal   Started     Started container cilium-agent

We also see that the containers have started at roughly the same time.

Comparing this configuration with previous configuration shows this did not bring down the starting time of the first Pod by many seconds. This time the first Pod started 70 seconds after CAS triggered. But the last Pod consistently trails only a few seconds behind, at about 79 seconds. This makes the improvement worthwhile.

Pulling in images during warming could also be used for any other containers that may run on a given Node. This is certainly valuable for any DaemonSet, but also for any other Pod that have a chance of being deployed to the Node. Worst-case is that you waste a bit of disk space should the Pods not be scheduled on the Node.

Wrap up

The feature is not even two weeks old, so I certainly have not had the time to explore all the ways this can be exploited. nodeup typically does not expect instances to reboot so there may be optimisations that can be done there as well. For example, also on second boot, kubelet is triggered by nodeup, which may not be necessary. If nodeup successfully created the kubelet service on the first run, there should be zero changes done to the system on the second run. The only reason there would be a change is if the cluster configuration changed, and those should only be applied through an instance rotation.

I hope you are as excited about this as I am. And if you wonder when this all will be available to you, the answer is "shortly". Some of the functionality has already been merged to Kubernetes/kops, and I obviously have the code for the rest more or less ready. I hope for all of this to be available in kOps 1.21 expected sometime in May.

DEV Community: Ole Markus With

Zero-configuration IRSA on kOps

Challenge of configuring workloads

The pod identity webhook

Webhook the kOps way

Using the pod identity webhook addon

Installing

Adding a ServiceAccount mapping

Deploying the workload

Testing that it works.

Conclusion

Kubernetes with IPv6 on AWS

Configuring the cluster

Private IPs

Public IPs

Can I use this in production?

Using IAM Roles for ServiceAccounts on kOps

ServiceAccount Issuer Discovery

AWS OIDC Provider

Using IAM Roles for ServiceAccounts belonging to kOps addons

Creating IAM Roles for your own workloads

Provision the IAM Roles

Configuring Pods to use IRSA

Zero-configuration IRSA

Blazing fast Kubernetes scaling with ASG warm pools

The results

Summary

And now for the details!

Initialising a Kubernetes node

Doing nothing at all

Making nodeup aware of the warm pool

Comparing the difference

Configuration

Scaling out

Obtaining the test results

Hunting for delays

ASG reaction time

Nodeup run time

Kubelet becoming ready

Enter the warm pool

Adding a warm pool

Configuration

Enter lifecycle hooks

The effect of warm pool alone

Exploiting warm pools further

Wrap up

Making `nodeup` aware of the warm pool