Quadcode team for Quadcode

Posted on Jan 9, 2023 • Originally published at Medium

A Retrospective of Working with Bare Metal Kubernetes, or To There and Back

#kubernetes #cloud #devops

The Kubernetes Platform team in Quadcode implements, supports and maintains Kubernetes and all processes around it. For five and a half years, our clusters and approaches have been changing and evolving. In the article we'll tell you how we started, what we came to, and how we managed to make the maintenance of bare metal clusters comfortable.

Our Сlusters and Team

Now there are 5 people in our team. But for the entire existence of Kubernetes in the company, about 17 engineers have worked with it.

We have three clusters in three environments:

Prod—the largest cluster.
Preprod—together with prod this is used for most of the company's stateless applications.
Infra—mainly used for short-lived applications, for example, GitLab, Jenkins Runner, browser tests and the like.

Prerequisites. Before 2017

The main prelude for the appearance of Kubernetes was the same as for many other companies: refactoring a monolith into microservices.

Microservices lived on LXC containers sliced by Ansible. Servers in this case were issued by the following process:

The team requests a server through an internal form: it clicks on the configuration and sends a request.
A letter is generated that goes to the data center.
The data center sends the server to the intermediate Infra Support team, which performs the initial configuration. The initial setup should be the same on all company servers.
The server is passed to the team that requested it.
The server is then fine-tuned to the needs of the team.

A test platform is being created that allows you to raise microservices packaged in Docker on a temporary machine in DigitalOcean. Along with the microservices run all its dependencies, and on whom it is dependent. The IP is given to the developer, and the developer goes to a temporary build and tests their microservice.

We really like how Docker performs in terms of identical supply units. And we want Docker to become a single runtime for all our microservices: it seems convenient.

But to transfer everything to Docker and run bare Docker containers on hardware or in the cloud is the way into the abyss of manual work. In addition to restarts, you also need to solve the issue of fault tolerance of virtual machines running Docker, and you need to manage containers. Therefore, we begin to select an orchestration tool, and the choice falls on Kubernetes.

Choice of an Orchestration Instrument. Q2 2017

Having chosen Kubernetes as an orchestration tool, we begin to decide where to deploy it: on hardware or in the cloud. Except there's no expertise in it at all. We begin to think and evaluate what would be best for us.

Self-hosted Kubernetes would be understandable for us in that it can be touched. We know the servers on which we'll run it, we know how they're issued, how they're configured, and what they have inside. Plus, these servers will be delivered directly to our data center, which satisfies the wishes of the business and security: everything should be in the internal network perimeter and have low latency, which is important to us because we're a fintech company.

You can't touch the cloud. On the one hand, it's unclear what kind of hardware is there. On the other hand, it's clear that there'll be no need to rush around with red eyes when someone pulls the power from the rack. At least it won't be you running around.

There are also marketing rumors that it's easier to manage the cloud. The community has little real experience on loaded projects, so it's difficult to assess how true these rumors are. At the same time the demands of the business and security don't go away. To ensure an internal network perimeter and low latency, you need to go to the NOC's, set them a task and work with them. Not to say that they're evil, on the contrary, but the task is to make an orchestrator, not a hybrid infrastructure. Plus, at this point we don't have any prod infrastructure and deep expertise in the cloud yet.

And the cost, even if estimated now:

Surprise (not), it's cheaper to rent hardware. Plus, in the case of the cloud, in addition to computing resources, you will need:

VPC;
TGW;
Direct connect;
ALB, ELB, NLB, NLP;
VPC peering;
and so on.

And people with background not only in K8s, but in all these abbreviations. As a result, it turns out that the questions are as follows:

Whose expertise are we willing to invest in? In internal Kubernetes, or in the expertise in conditional AWS?
When something goes wrong after the application enters K8S, and something always goes wrong, to whom will the question be addressed as to what exactly the problem is? To people inside the company or to the technical support of the conditional Amazon?

Regardless of whether we have a bare metal or a cloud cluster, we'll need to develop processes for deployment, application integration, application maintenance, authorizations, access control. Again, it'll be necessary to pay people who'll do this, regardless of what kind of cluster we have.

Having considered all the reasons, we decide that it's more interesting for us to invest in an internal team and internal expertise. And we raise our bare metal cluster.

Next, we need to choose which bare metal cluster. We find four solutions:

kOps;
Kubernetes The Hard Way;
Rancher;
Kubespray.

We dismiss KOps immediately, since at that moment it raises the cluster in the cloud on EC2. Kubernetes The Hard Way is a wonderful manual that we learn from: command by command, you raise the cluster, figure out what you need for this and how it works. But it's inconvenient to support a cluster with command after command, and in general we're all fashionable and want automation. And if we need to automate, then why not take a ready-made tool.

That leaves Rancher and Kubespray. Kubespray wins, which automates The Hard Way for the most part. Our main arguments for Kubespray are the following:

It's written in Ansible. We've been working on Ansible for a long time, it's convenient for us to read it, and it's more convenient than Rancher source code.
Kubespray even then allows you to flexibly complete the hodgepodge of components that make up Kubernetes.
Kubespray works transparently with the environment, just like Ansible.
Kubespray was being developed by Mirantis, one of the pioneers of Kubernetes at the time. We trusted their expertise and relied on it.

Implementation. Q3 2017

We roll out three of our clusters through Kubespray and deploy the first microservices. For us, the golden time of cluster maintenance is coming, which includes all standard operations for Kubernetes and for bare metal:

Cluster rollout.
Scale cluster.
Development of a process for entering applications into K8S, authorization, recommended deployment pipelines, etc.
Upgrade of node software: Docker, Kernel, etc.
Upgrade of the cluster version.
Renewal of K8s, CNI certificates, etc.

Prerequisites for Refactoring. 2017–2019

For the next two years, clusters grow from 5 to ~20 nodes. And the first problems begin.

One of the problems manifested itself when moving to a new data center. The story is as follows. Engineers from our data center came to us and said: "The old premises is being closed; there'll be a new one." It would seem that we shouldn't have any difficulties because of this: we'll scale the nodes in the new room, we'll give them up in the old one, we'll scale the masters in the new room, we'll give them up in the old one. In theory, there shouldn't be any downtime either. In practice, everything worked with the nodes, but not with the masters, because at that time Kubespray didn't support the scaling of masters.

We thought we could scale the masters ourselves. We already had two years of experience with Kubernetes, which should help. In the test environment everything works. On Preprod we allow downtime and realize that we're not ready to take such a risk with Prod. So we stop at the idea that the simplest solution of physically moving all the masters from one room to another will at the same time be the most reliable.

We pull out the servers in one room, load them onto the car and move them to the new room. In the end, all nine masters were moved successfully, but there was a lingering unpleasant aftertaste of having to physically move servers from one location to another in progressive times.

But this isn't the only problem that reared its head. Kubernetes maintenance operations began to grow strongly over time. Here are some examples:

The scaling of a cluster began to take 4 hours. This is due to the fact that Kubespray allows you to raise almost any hodgepodge of Kubernetes components, and a large number of tasks are simply skipped. When the number of nodes grows, the playbook takes a long time to skip these tasks.

Node's software upgrade took several days, because this is an operation that sometimes requires interaction with the data center, with the same intermediate team. For example, to update the operating system, you need to redeploy the server. It's time-consuming, tedious and extremely unpleasant.

We had to upgrade from version 1.12 to 1.14 by rolling from one cluster to another. It took a whole quarter, because it was necessary to set all the teams within the company tasks for a redeploy, and wait until they did it.

So as not to be unsubstantiated, let's take a look at the execution time of Kubespray in a sample configuration. For measurements we used the latest version of Kubespray and ran it on a different number of EC2 instances. Here is a graph of the execution time depending on the number of nodes:

Unlike in 2019, now the execution time has decreased from 4 hours to 2 hours, but there's still growth, and this growth is tens of minutes for every 3 nodes. That's a lot. Imagine what will happen if you have a cluster of 50 nodes, and you need to roll out Kubespray. Rolling out Ansible playbook on Kubespray for 4 hours is annoying. Plus, Ansible may return an error, and you'll need to spend another 4 hours.

At this stage, we came to the following conclusions:

Kubespray takes a long time to skip tasks.
We don't need a large number of additional Kubespray tasks.
Kubespray doesn't support all operations that we need. An example is the scaling of masters.
The rich variability of cluster configurations is more harmful to us than necessary. We don't need any but our own, configured, understandable, working cluster with our variables.

At that time, we had already figured out the Kubespray variables, how CNI and Kubernetes itself works. This knowledge should be enough to write the operations themselves, implement them and maintain them. Therefore, we decide to write our own playbook and remove all unnecessary tasks for our infrastructure that Kubespray does. And if some binary appears which will take care of some more additional tasks related to cluster initialization and node join, that's great.

Tool Selection. Q3 2019

Here we enter the second round of the round trip, and return to the choice of instrument.

Next, we look for a binary that will take over the tasks associated with init and join. And almost immediately we find Kubeadm. It suits us because:

It was out of beta.
Our team got the context, and realized that we can delegate some processes.
Kubeadm is becoming the recommended way to operate a K8s cluster, including in the official documentation.

Implementation. Q4 2019

We are writing our playbook, replacing a large number of tasks with exec and Kubeadm. Implement and roll to version 1.16 from the Kubespray cluster to Kubeadm, which we wrote ourselves.

The playbook looked something like this:

# Create audit policy files on each master (for apiserver and falco)
- hosts: kube-master
  become: yes
  roles:
    - { role: kubernetes-audit-policy, tags: "kubernetes-audit-policy" }
  tags: ["setup-cluster", "k8s-audit-policy"]

# Setup first master (kubeadm init). It is executed only on first play on firs master.
- hosts: kube-master[0]
  become: yes
  roles:
    - { role: kubeadm-init, tags: "kubeadm-init" }
  tags: ["setup-cluster", "kubeadm-init"]

# Generate join tokens and join new masters/nodes in cluster.
- hosts: kube-master[0]
  become: yes
  roles:
    - { role: kubeadm-join, tags: "kubeadm-join"}
  tags: ["setup-cluster", "kubeadm-join"]

# Setup calico using helm
- hosts: kube-master[0]
  become: yes
  roles:
    - { role: kubernetes-networking, tags: "kubernetes-networking"}
  tags: ["setup-cluster", "kubernetes-networking"]

# Add labels and anotations on nodes for manage taints using helm:
# app.kubernetes.io/managed-by: Helm
# meta.helm.sh/release-name: node-taints-labels
# meta.helm.sh/release-namespace: kube-system
- hosts: kube-master[0]
  become: yes
  roles:
    - {role: kubernetes-annotations-labels, tags: "kubernetes-annotations-labels"}
  tags: ["setup-cluster", "kubernetes-annotations-labels"]

################################### Install and configure rsyslog ##################################
# Install and configure rsyslog binary
- hosts: all

Operation includes all the same standard operations. At the same time, our own playbook allowed us to drastically reduce the cluster scaling time: from 4 hours to 9–10 minutes for 15 nodes:

At that moment we thought:

Wow, victory! When the next 4 hours will be, is not very clear yet. We've grown by 15 nodes in 2 years, and the current solution will probably be enough for us for a long time.

And we'd never been so wrong before. Kubernetes gets a reputation as a very stable platform and for trust in the company. Therefore, the task comes from the business that now almost all the stateless microservices, if possible, need to be rolled into Kubernetes.

Another 2 years pass. The cluster size grows from 20 to ~100 nodes. The time of operations increases again, and the cluster scaling begins to take 1.5 hours. If we need to upgrade kernels or Docker, then it takes even more time: if we previously had to reboot 20 nodes or reboot Docker, now we need to do the same on a hundred nodes.

Kubeadm didn’t support upgrading the cluster version from 1.16. We moved to version 1.17, which took six months, because the number of microservices in the cluster became simply titanic. With version 1.17, we've already written our upgrade playbook, and since then, up to version 1.22, we've upgraded jobs in Jenkins by pressing a button and periodically watching the process.

It Seems Like We've Been Through All This. Q3 2021

An attentive reader will notice that the increase in the execution time of standard operations was discussed literally a section ago. We also knew perfectly well that we were back where we started from.

Here our team has a logical question: is there a need to change something? Kubernetes works stably. An hour and a half for performing the operation is unpleasant, but so far it's not four hours, after all. Making changes for the sake of changes is pointless, and it's worth finding a good motivation.

Motivation came from the business:

You have done a great job in the last four years. Most of the microservices are in Docker, and we've achieved the goal of a single runtime. But there's a problem. We have Kubernetes in the prod, and Docker Swarm with Docker containers on the test platform. This leads to the fact that people deploy in different ways and the behavior of microservices on the test platform and in the prod is different. We'd like everything to be the same.

There are two options: either remove Kubernetes from the prod, or add Kubernetes to test builds. We don't want to remove K8s, because it's performing very well. We decide that we need to roll out K8s builds, but already on the test platform, which has become cluttered with additional clouds, including Amazon, by our efforts. And we decide that we'll raise temporary Kuber clusters on EC2 instances as a prototype. But we need technical specifications: for how long should it take to raise these clusters, how many, how many nodes.

The business gives the specifications:

Time to raise < 3 minutes on EC2.
Number of nodes—discretionary.
Number of clusters—discretionary.

The number of builds on the test platform is from 200 to 1000 per day. Here we understand that the current flow won't suit us: it'll take 200–1000 times to create EC2 instances, start inventory, go to Jenkins, press the playbook button and give the IP of the cluster to the developer. We could make a webhook with Jenkins jobs, automate inventory in Ansible and achieve some kind of automation. But why automate the rollout of jobs and Ansible playbook if you need to automate the creation of a cluster?

We reject EKS/GKE at once. Whoever created it knows that it doesn't even take three minutes.

Our team's interest in the coming changes is two-fold:

It's possible to provide constant and daily verification of the configuration on running test environments.
Somewhere in here must be the answer to the question of how not to roll out hardware for three hours, picking up an Ansible error.

Decomposition of the Process

For four years we had been doing what pushed playbook onto our hardware, and now we had to do something else. To figure out how to do something else, we leave all the accumulated knowledge and return to the sources, namely Kubernetes The Hard Way.

The Hard Way allocates nine stages to raise the K8s cluster. Each stage has its own number of Bash commands, only about a hundred for each master or node:

We can automate all these hundreds of commands for each node with some kind of script, but we want something simpler. And there's a way to make it easier in the Kubernetes documentation. Kuber offers Kubeadm for cluster management. And if you look at what Kubeadm does, you can conclude that for the first master, the five steps from The Hard Way are wrapped up into a Kubeadm init command with a pre-generated config. For additional masters, all operations are wrapped up into four commands:

If it seems that even four commands is a lot, there's a life hack. You can generate certificates for kubelet in advance:

client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem
client-key: /var/lib/kubelet/pki/kubelet-client-current.pem

Add them to the node together with:

"clientCAFile": "/etc/kubernetes/pki/ca.crt"
/etc/kubernetes/kubelet/kubelet-config.json

And run kubelet with the key -- register-node. It will register in the cluster as a node, and even Kubeadm join in this case doesn't have to be done (although you'll have to do commands to generate certificates and place them, but maybe for some people it's more transparent).

So, we have four commands that need to be executed correctly, at the right time, with the right config. In order to execute them correctly, we begin to make a prototype.

Prototyping

Step 1. First of all, we go to our playbook, make a Dev environment, and for variables in group vars, we put placeholders instead of static values. We roll out the configs on an EC2 instance without Kubeadm execs. We get the following configs:

Step 2. We make a snapshot from this EC2-instance and use the test platform to run new machines from this snapshot, for which we prescribe Amazon metadata. The metadata looks something like this:

~ cat ud-decoded
{
 "domainsEnv": "BASE_DOMAIN=* API_SERVER_DOMAIN=* ETCD_DOMAIN=*",
 "masterAddr": "master01.build126.*",
 "kubenodesAddr": "node01.build126.* node02.build126.* node03.build126.*",
 "myDomain": "node03.build126.*",
 "myRole": "node"
}

In the AWS metadata, there are just variable values for placeholders, and the role of the node is master or node.

Step 3. We write a script in cloud-init, which goes to the Amazon metadata server, takes metadata and substitutes variable values in placeholders. Depending on the role of the node (main master node/additional master/workload node), the script runs the corresponding operations and raises the cluster on each node independently of other nodes:

We launch cluster instances from a snapshot, where we put configs and a script, with metadata affixed, and wait 180 seconds for init and join of all nodes—PROFIT.

We get the following prototyping results:

The time to raise on EC2 is truly < 3 minutes. We have a wait of 180 seconds. If the master didn't wait for all the nodes that were transferred in the metadata in 3 minutes, then the script crashes with exit code 1.
However, there are as many clusters as you want, as long as there are instances in the cloud.
It is unclear how to scale and update everything on the fly, but now such a task isn't worth it, because such builds don't live longer than a week.

We think it turned out well for us. And this is strange, because when does it turn out well the first time? And we decide to see how the vendors have done it.

And How about the Vendors?

We take SSH in one hand, we take grep in the other hand, and go to the EKS nodes to start.

EKS. We grep the word eks in cloud-init-output.log and find the script /etc/eks/bootstrap.sh:

In the screenshot is the main function of the script, if you remove all the if's. If you look closely, you can see that:

Certificates and variables are taken from metadata via AWS CLI.
Sed swaps out these variables in configs with placeholders.
The systemctl start cubelet service is launched.

Basically, this is an automation of The Hard Way.

DigitalOcean. There's nothing in the cloud-init DO logs, but if you look at the cloud-init scripts, you can find the 000-k8saas script:

Inside it you can see the following:

Curl takes metadata and adds it to a file on the file system.
From there, metadata is exported to env variables.
If a node has the master role, the bootstrap-master script is run; if the kubelet role, then bootstrap-kubelet is run.

You can't see the scripts themselves, but the 10-kubeadm.conf file and the Kubeadm binary on the node itself hint at how the control plane, and the node in particular, is raised.

The metadata looks like this. We were very surprised when we saw a similar logic with our roles:

GKE. So, we saw about the same architecture for two vendors. We went to GKE, and there we saw, at first glance, something different. We grepped kube in cloud-init-output.log and saw the following services:

We thought, what are these interesting services? We went to systemctl, found kubernetes.target, which has kube-node-configuration and kube-node-installation services:

Here we relied on know-how, but in the end these are oneshot services that lead to the scripts:

The functions in both scripts are about the same. As an example, let's look at the download-kube-masters-certs function:

Here's what happens in it:

Curl takes metadata from the metadata.google.internal server.
Using a one-line Python from yaml-format, metadata, in this case certificates, are exported to an env-variable.
If you look further along the script, there'll be the same sed's and systemctl start Kubelet.

In general, the bricks to raise the cluster in this way are the same everywhere, more or less:

There's some kind of metadata server—in the case of the cloud, these are their internal servers.
There's some kind of binary that will pick up this metadata. Some have AWS CLI, some have curl.
There's some kind of binary that will replace metadata. Generally, this is sed.
There's something that will raise the cluster either in the form of Kubeadm or in the Hard Way style.
There are a bunch of systems that generate this metadata that are scattered across the cloud, and therefore the time to raise the control plane is about half an hour everywhere. But in our case, everything is faster, because there aren't so many of them yet.

Implementation (but Different)

Our HW clusters with their problems are still in place. We want them to operate in the future in much the same way as cloud clusters. For this we need:

It's possible to write all this, but you need to decide how and what tools to use. As always, this is the most difficult question. We're going in that direction, and it's already clear that when we get there, we're going to run into new problems.

The first thing that suggests itself is that it'll be a pull system with all its problems, where someone made a mistake, and it rolled out without our knowledge, something broke, but it's unclear where and how. Regarding the fail fast A/B configuration testing approach, we hope to have this out of the box, because bare metal clusters in this case become a special case of our cloud clusters, which are raised on EC2 instances.

At the same time, the system itself will objectively become more complicated. We used to have—added inventory and went to see in the log what Jenkins and Ansible playbook were doing there. Now there are some provisioner instances, metadata providers, a snapshot provider if we need them, a script that goes to the metadata server. How they all interact with each other is a separate question.

On the one hand, these are all new services; on the other hand, no one prevents them from being built on the principle of microservices. We know how to work with them: impose monitoring, logging, create observability, and all this will turn from minuses into pluses. At this stage, microservices reach infrastructure maintenance. It's probably a good thing.

Tasks for the Future

If we move away from theory and talk about the tasks that we plan to accomplish, then we really want to:

Updating cores for 100 nodes per hour.
Confidence in disaster recovery.
So that all environments roll out the same way.
To get away from dependence on the size or number of clusters.

If we talk about the order in which we'll go further, then it's approximately the following:

Add scale and upgrade to EC2 cluster.
Describe all the additional systems needed around bare metal.
Coordinate the toolkit with other teams and businesses.
Implement.
Restart the cluster nodes.

Retrospective Conclusions

Looking back 5.5 years ago, we believe that by and large our path was right. The decision to invest in the internal expertise of the team allowed us to eventually do something that's similar to the vendors' solution, but on Bare Metal, combining both the ability to independently control the configuration of K8s and the ability not to spend a lot of time supporting K8S itself.

But what would we do if we were implementing clusters from scratch now? The business, most likely, would say: "Let's go to EKS, there are experts in the community, and everything seems to be working." However, all the questions about who'll figure things out when something goes wrong—tech support or internal people—they remain. And it seems that in the process of working things out, we would have reached the choice of our own solution.

Most likely, this would also have been a push configuration. If you look from scratch at all these systems that are needed to operate a cluster in three minutes, it can get scary. Perhaps we would be afraid that we need metadata, scripts, and we need to write something around a cluster when we just want to get a cluster. But the iteration itself between a push configuration and a pull configuration would be faster now: there's already community experience and there are people who can share it.

Forem