Gajus Kuizinas

Posted on Sep 11, 2019

Mistake that cost thousands (Kubernetes, GKE)

#startup #kubernetes #docker #devops

No exaggeration, unfortunately. As a disclaimer, I will add that this is a really stupid mistake and shows my lack of experience managing auto-scaling deployments. However, it all started with a question with no answer and I feel obliged to share my learnings to help others avoid similar pitfalls.

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

I asked this question multiple times in Kubernetes community. No one suggested an answer. If you are unsure about the answer, then there is something for you to learn from my experience (or skip to Answer for the impatient). Here it goes:

Premise

I woke up middle of the night with a determination to reduce our infrastructure costs.

We are running a large Kubernetes cluster. "large" is relative of course. In our case that is 600 vCPUs during normal business hours. This number goes double during peak hours and goes to near 0 during some hours of the night.

Invoice for the last month was USD 3,500.

This is already pretty darn good given the computing power that we get, and Google Kubernetes Engine (GKE) made cost management mostly easy:

We use the least expensive data center (europe-west2 (London) is ≈15% more expensive than europe-west4 (Netherlands))
We use different machine types for different deployments (memory heavy vs CPU heavy)
We use Horizontal Pod Autoscaler (HPA) and Custom Metrics to scale deployments
We use cluster autoscaler (https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-autoscaler) to scale node pools
We use preemptible VMs

Using exclusively preemptible VMs is what allows us to keeps the costs low. To illustrate the savings, in case of n1-standard-1 machine type hosted in europe-west4, the difference between dedicated and preemptible VM is USD 26.73/ month VS USD 8.03/ month. That is 3.25x lower cost. Of course, preemptible VMs have their limitations that you need to familiarise with and counteract, but that is a whole different topic.

With all of the above in place, it felt like we are doing all the right things to keep the costs low. However, I always had a nagging feeling that something is off.

Major red flag 🚩

About that nagging feeling:

Average CPU usage per Node was low (10%-20%). This didn't seem right.

My first thought was that I have misconfigured compute resources. What resource are required depends entirely on the program that you are running. Therefore, the best thing to do is to deploy your program without resource limits, observe how your program behaves during idle/ regular and peak loads, and set requested/ limit resources based on the observed values.

I will illustrate my mistake through an example of a single deployment "admdesl".

In our case, resource requirements are sporadic:



NAME                       CPU(cores)   MEMORY(bytes)
admdesl-5fcfbb5544-lq7wc   3m           112Mi
admdesl-5fcfbb5544-mfsvf   3m           118Mi
admdesl-5fcfbb5544-nj49v   4m           107Mi
admdesl-5fcfbb5544-nkvk9   3m           103Mi
admdesl-5fcfbb5544-nxbrd   3m           117Mi
admdesl-5fcfbb5544-pb726   3m           98Mi
admdesl-5fcfbb5544-rhhgn   83m          119Mi
admdesl-5fcfbb5544-rhp76   2m           105Mi
admdesl-5fcfbb5544-scqgq   4m           117Mi
admdesl-5fcfbb5544-tn556   49m          101Mi
admdesl-5fcfbb5544-tngv4   2m           135Mi
admdesl-5fcfbb5544-vcmjm   22m          106Mi
admdesl-5fcfbb5544-w9dsv   180m         100Mi
admdesl-5fcfbb5544-whwtk   3m           103Mi
admdesl-5fcfbb5544-wjnnk   132m         110Mi
admdesl-5fcfbb5544-xrrvt   4m           124Mi
admdesl-5fcfbb5544-zhbqw   4m           112Mi
admdesl-5fcfbb5544-zs75s   144m         103Mi

Pods that average 5m are "idle": there is a task in the queue for them to process, but we are waiting for some (external) condition to clear before proceeding. In case of this particular deployment, these pods will change between idle/ active state multiple times every minute and spend 70%+ in idle state.

A minute later the same set of pods will look different:



NAME                       CPU(cores)   MEMORY(bytes)
admdesl-5fcfbb5544-lq7wc   152m         107Mi
admdesl-5fcfbb5544-mfsvf   49m          102Mi
admdesl-5fcfbb5544-nj49v   151m         116Mi
admdesl-5fcfbb5544-nkvk9   105m         100Mi
admdesl-5fcfbb5544-nxbrd   160m         119Mi
admdesl-5fcfbb5544-pb726   6m           103Mi
admdesl-5fcfbb5544-rhhgn   20m          109Mi
admdesl-5fcfbb5544-rhp76   110m         103Mi
admdesl-5fcfbb5544-scqgq   13m          120Mi
admdesl-5fcfbb5544-tn556   131m         115Mi
admdesl-5fcfbb5544-tngv4   52m          113Mi
admdesl-5fcfbb5544-vcmjm   102m         104Mi
admdesl-5fcfbb5544-w9dsv   18m          125Mi
admdesl-5fcfbb5544-whwtk   173m         122Mi
admdesl-5fcfbb5544-wjnnk   31m          110Mi
admdesl-5fcfbb5544-xrrvt   91m          126Mi
admdesl-5fcfbb5544-zhbqw   49m          107Mi
admdesl-5fcfbb5544-zs75s   87m          148Mi

Looking at the above I thought that it makes sense to have a configuration such as:



resources:
  requests:
    memory: '150Mi'
    cpu: '20m'
  limits:
    memory: '250Mi'
    cpu: '200m'

This translates to:

idle pods don't consume more than 20m
active (healthy) pods peak at 200m

However, when I used this configuration, it made deployments hectic.



admdesl-78fc6f5fc9-xftgr  0/1    Terminating                3         21m
admdesl-78fc6f5fc9-xgbcq  0/1    Init:CreateContainerError  0         10m
admdesl-78fc6f5fc9-xhfmh  0/1    Init:CreateContainerError  1         9m44s
admdesl-78fc6f5fc9-xjf4r  0/1    Init:CreateContainerError  0         10m
admdesl-78fc6f5fc9-xkcfw  0/1    Terminating                0         20m
admdesl-78fc6f5fc9-xksc9  0/1    Init:0/1                   0         10m
admdesl-78fc6f5fc9-xktzq  1/1    Running                    0         10m
admdesl-78fc6f5fc9-xkwmw  0/1    Init:CreateContainerError  0         9m43s
admdesl-78fc6f5fc9-xm8pt  0/1    Init:0/1                   0         10m
admdesl-78fc6f5fc9-xmhpn  0/1    CreateContainerError       0         8m56s
admdesl-78fc6f5fc9-xn25n  0/1    Init:0/1                   0         9m6s
admdesl-78fc6f5fc9-xnv4c  0/1    Terminating                0         20m
admdesl-78fc6f5fc9-xp8tf  0/1    Init:0/1                   0         10m
admdesl-78fc6f5fc9-xpc2h  0/1    Init:0/1                   0         10m
admdesl-78fc6f5fc9-xpdhr  0/1    Terminating                0         131m
admdesl-78fc6f5fc9-xqflf  0/1    CreateContainerError       0         10m
admdesl-78fc6f5fc9-xrqjv  1/1    Running                    0         10m
admdesl-78fc6f5fc9-xrrwx  0/1    Terminating                0         21m
admdesl-78fc6f5fc9-xs79k  0/1    Terminating                0         21m

This would happen whenever a new Node is brought in/ out of the cluster (which happens often due to auto-scaling).

As such, I kept increasing requested pod resources until I have ended up with the following configuration for this deployment:



resources:
  requests:
    memory: '150Mi'
    cpu: '100m'
  limits:
    memory: '250Mi'
    cpu: '500m'

With this configuration the cluster was running smoothly, but it meant that even idle Pods were pre-allocated more CPU time than they need. This is the reason why the average CPU usage per Node was low. However, I didn't know what is the solution (reducing requested resources resulted in hectic cluster state/ outages) and as such I rolled with a variation of generous resource allocation for all the deployments.

Answer

Back to my question:

What is the difference between a Kubernetes cluster using 100x n1-standard-1 (1 vCPU) VMs VS having 1x n1-standard-96 (vCPU 96), or 6x n1-standard-16 VMs (vCPU 16)?

For starters, there is no price-per-vCPU difference between n1-standard-1 and n1-standard-96. Therefore, I reasoned that using a machine with fewer vCPUs is going to give me more granular control over the price.

The other consideration I had was how fast the cluster will auto-scale, i.e. if there is a sudden surge, how fast can the cluster auto scaler provision new nodes for the unscheduled pods. This was not a concern though – our resource requirements grow and shrink gradually.

And so I went with mostly 1 vCPU nodes, the consequence of which I have described in Premise.

Retrospectively, it was an obvious mistake: distributing pods across nodes with a single vCPU does not allow efficient resource utilisation as individual deployments change between idle and active states. Put it another way, the more vCPUs you have on the same machine, the more tightly you can pack many pods because as a portion of pods go over their required quota, there are readily available resources to take.

What worked:

I switched to 16 vCPU machines because they provide a balanced solution between fine resource control when auto-scaling the cluster and sufficient resources per machine to enable tight scheduling of pods that are going through idle/ active states.
I used resource configuration that requests only marginally more than the resources that are needed during an idle state, but have generous limits. It allows to have many pods scheduled on the same machine when majority of the pods are in an idle state, but still allows resource intensive bursts.
I switched to n2 machine type: n2 machines are more expensive, but they have 2.8 GHz base frequency (compare with ~2.2 GHz available to n1-* machines). We are taking an advantage of a higher clock frequency to process resource intensive tasks as fast as possible and put pods into the earlier described idle state as quick as possible.

Current average Node vCPU utilisation is up to 60%. This sounds about right. It will take some to conclude what are the savings. However, today we are already using less than half vCPUs that we used same time yesterday.

Latest comments (9)

remi bourgarel • Sep 13 '19

Depending on the runtime but 1vCPU might not be enough and could cause problem, for instance not every garbage collector run are blocking but with 1 vCPU they are.

001101 • Sep 13 '19

Mostly you can optimize resources by writing software which uses less.

Gert van den Berg • Sep 13 '19 • Edited

Amazon EKS's default networking-related low pod limits likely mess with this. I can run 35 pods on a 2 vCPU node or 58 on a 4 vCPU node. (A 8 vCPU node also support only 58 pods) (That is on the t3* series, the m* ones are worse)

Mohammad Oli Ahad • Sep 13 '19

Thank you so much for taking time to share it, @gajus ! Very useful; we're running production loads where we're heading towards a scale-question scenario and it's great to know this in advance!

Ahmed Dodo • Sep 12 '19

Also running 100 (1vcpu) nodes that means 100 OS consumption of CPU and mem. and more pod deamons consumption per node. correct me if i'm wrong.

Mohamed Najiullah • Sep 12 '19

Something else that I learned while trying to utilise resources effectively was to try and have minimal number of daemon pods.

Having more number of smaller nodes also results in more resources being allocated to daemon pods as they need to run in every node. We're now trying to reduce the number of nodes as much as other constraints allow us to and have seen lower resources being spent on daemon pods.