DEV Community

perber
perber

Posted on

Lessons Learned: Kubernetes Cluster Updates and Challenges

Introduction

As DevOps engineers or Kubernetes administrators, we are often responsible for managing multiple clusters, applications, and teams. One of our customers, for example, has around 100 developers who are constantly producing new features. Managing their cluster can be a challenging task. In this article, I would like to share our experiences updating a kubernetes cluster and what challenges we faced during this particular case.

Why are Updates important?

Keeping Kubernetes updates up to date is crucial for ensuring the smooth functioning of applications and infrastructure. With every release, new features, bug fixes, and security patches are getting released.
Outdated Kubernetes clusters may also be vulnerable to security threats that have since been patched in newer versions.
By keeping your Kubernetes version up to date, you can ensure that you are taking advantage of the latest features and patches. It is important to prioritize updates and stay proactive in keeping your clusters healthy.

So, we are good DevOps engineers, and we would like to keep our infrastructure up to date to avoid vulnerabilities and get the newest features and patches.

Preparing for Update

We already have regular update dates, but since we know that this update may break something, we would like to inform them upfront. We updated our regular outlook reminder for updates with additional information. We added in our announcement that there is the possibility that the cluster will go down and will not be reachable.

Weeks before we actually planned the update, we looked through the changelog of Kubernetes. During our preparation phase, we found out that the old ingress definition, which the services are using, is not compatible with the new nginx-ingress-controller. And the new nginx-ingress-controller is not compatible with the old version of the Kubernetes cluster. Usually, you should go version by version, but no worries, we have informed the customer.

Ingresses are resources that allow external traffic to the services in the cluster. It acts as a layer between the external network and the internal services. The routing decisions are done by rules.

We are using the nginx-ingress-controller in our clusters. The ingress controller manages ingress resources, and based on the configuration of our ingress resources, the configuration gets updated in the ingress-controller.

As we are not responsible for the applications ourselves, we had to inform the customer that they have to patch the ingress versions and deploy them to the cluster. As good DevOps engineers, we already provided them with the PRs. They only need to merge and deploy them to the cluster. So, we created an announcement for the developers. We tried it a few times, but still, not all services were changed to use the new ingress configuration.

So, we decided to merge the missing ones and deploy them.

Repository structure

The Kubernetes manifest files for the services are in the same Git repository as the services.

We have multiple environments we support. Therefore we use kustomize to separate the configuration per environment. We have a folder called "base". In this directory, the manifest files are located, which every environment requires. Then we have overlays for every environment. Just to mention the ingress manifest files are in every environment located.

Creating the Pull Requests for ~135 repositories is kinda complicated. But we already have an automation tool developed which supports us here.

To decouple the actual configuration from the manifest YAML files definition, properly templating could be an option. Then we would just adapt the template, but if you are interested in going this way, keep in mind that most of the services should be configured the same way. Otherwise, you will end up in a branching hell.

Another approach to overcome this obstacle is to move the configuration of the manifest files into another repository. This way, you could easily update the configuration, but this involves more communication on updates and will restrict access for developers, which, in my opinion, depends on the goal and may increase the workload on the DevOps team.

Alternatively, you could build it and run it, but then someone within the team needs to take care of those topics, which I guess would be the best "DevOps" solution. Here are some other interesting things to consider, especially if we take pipelining and governance into account.

Kustomize is a tool that allows you to merge manifest files. We also use it in our pipelines to set the image tags. If you are interested in learning more about it, let me know.

Executing the Update

Our colleague, who is responsible for updating the Linux infrastructure and updating Kubernetes to a higher version, checks the overall status of the cluster. He is doing a Backup of the system. So we are able to rollback.
Now, he starts to drain the master nodes, patches the system, and updates the Kubernetes core components (kubelet, …). After he has finished with the master nodes, he continues with the worker nodes. As soon as we finished with the first worker nodes, the application is no longer reachable. Usually, our updates are without any downtime, but because of the incompatibility issue with the nginx-ingress-controller, our cluster wasn't reachable. At the same time, a colleague wanted to do a customer presentation. So we decided to stop the update for now and contacted the program manager.

After the presentation, we continued to work on the update. There were around 13 nodes to update. As all nodes were updated and the nginx-ingress controller with the newer version was rolled out, the cluster was still not reachable. The colleague who updated the nginx-ingress-controller had removed, for some reason, the nginx namespace, and some important changes were obviously not versioned. But we just knew that the namespace was deleted. We started to take a look at the nginx-ingress-controller logs. We found the following warnings and errors:

admission webhook "validate.nginx.ingress.kubernetes.io" denied the request: 
------------------------------------------------------------------------------- 
Error: exit status 1 
2023/02/16 15:34:47 
[warn] 848#848: the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg846300610:152 nginx: 
[warn] the "http2_max_field_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg846300610:152 2023/02/16 15:34:47 
[warn] 848#848: the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg846300610:153 nginx: 
[warn] the "http2_max_header_size" directive is obsolete, use the "large_client_header_buffers" directive instead in /tmp/nginx/nginx-cfg846300610:153 2023/02/16 15:34:47 
[warn] 848#848: the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg846300610:154 nginx: 
[warn] the "http2_max_requests" directive is obsolete, use the "keepalive_requests" directive instead in /tmp/nginx/nginx-cfg846300610:154 2023/02/16 15:34:47 
[error] 848#848: opentracing_propagate_context before tracer loaded nginx: 
[error] opentracing_propagate_context before tracer loaded nginx: configuration file /tmp/nginx/nginx-cfg846300610 test failed
Enter fullscreen mode Exit fullscreen mode

Most of the messages were just warnings. If you take a closer look, you will find the error message: [error] 848#848: opentracing_propagate_context before tracer loaded nginx: [error]

As the namespace was deleted and the nginx configuration was not versioned for whatever reason, we didn't have the chance to check the previous configuration.
We checked the services, and some of them have opentracing enabled. This is the configuration for the ingress resources:

 annotations:
   nginx.ingress.kubernetes.io/proxy-body-size: 20m
   nginx.ingress.kubernetes.io/enable-opentracing: 'true'
Enter fullscreen mode Exit fullscreen mode

As a quick fix and to give the developers the opportunity to use the cluster, we removed the line nginx.ingress.kubernetes.io/enable-opentracing: 'true' from the service configuration and deployed the changes.

Now we could take a short break and grab a coffee before we dig deeper into the issue.
We started to read the documentation of the nginx-ingress controller and found out that some configuration was missing in the config maps. For whatever reason, nginx is not able to start if the setting is not correctly set. In my opinion, it could also be just a warning instead of an error.

As a side note, we are using Jaeger for collecting spans from applications. Open Tracing is a vendor-neutral API for distributed tracing. Tracing is used to track and measure the execution of code across multiple components and services.

Now that we found the issue, we updated the nginx-ingress-controller config map:

# extra settings
custom-http-errors: 404,403,500,503
enable-opentracing: 'true'
jaeger-collector-host: $HOST_IP
jaeger-sampler-host: $HOST_IP
# end
Enter fullscreen mode Exit fullscreen mode

We redeployed the services one by one. After that, we took a look at the metrics and our services to ensure that all services were working before going home.

One important thing we can take out of this example is that all code should be versioned. But I also need to say that everyone was really nice and together we resolved the issue. The production update went much smoother ;)

Best Practices

Here are hopefully some best practices which you could read between the lines:

  • Don`t start without a rollback plan
  • Keep your Kubernetes cluster up to date to ensure stability and security.
  • Communicate clearly with all stakeholders before and during the update process.
  • Version your code and use automation tools like ArgoCD and follow the GitOps approach to identify issues faster.
  • Prioritize updates and stay proactive in keeping your clusters healthy.
  • Update other components regularly in your cluster, like the nginx-ingress-controller.
  • Ensure that all code is versioned.
  • Stay flexible and adapt to unexpected challenges.
  • Remember that keeping your infrastructure up to date is an ongoing process that requires planning and communication.

Conclusion

Updating a Kubernetes cluster can be a challenging task, but with planning, communication, and best practices in place, you can minimize the risks and ensure a smooth update process.

Now that you've read about our experience updating a Kubernetes cluster, what are your thoughts? Have you faced similar challenges, or do you have any advice to share?
I'd love to hear your insights and learn from your experiences.

Top comments (0)