Every cluster external request to an application running in a Kubernetes cluster is handled by a so-called ingress controller
. So if it's missing, nothing in the cluster is reachable anymore. So the ingress controller is a very important part of a Kubernetes
powered infrastructure.
We needed to exchange the ingress controller of our Kubernetes clusters because the one in use could not satisfy a new requirement. We made it with zero downtime and we were able to develop and test it in production before releasing it to the customers.
TLDR
We did a zero downtime swap of Traefik
with Nginx
by running both in parallel and making them externally available. Our strategy used the concepts of weight-based routing
and header-based routing
to implement blue-green deployment
and canary release
.
Why we needed to switch the ingress controller
We were using Traefik as our ingress controller and were pretty happy with it. At some point, the following requirement popped up: We need to handle a list of around 4500 URLs that need to be redirected to different locations.
At the time of writing (08-01-23) Traefik is only able to have one regex expression per redirect middleware. So we mapped each of these redirects to a Traefik middleware and chained them all together into one chain middleware object. This chain middleware could then be used by Traefik via a Kubernetes ingress annotation
:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
traefik.ingress.kubernetes.io/router.middlewares: default@redirect-chain
This setup worked, but we soon saw a huge increase in CPU and memory consumption of the Traefik instances running on production. To handle regular customer traffic we needed multiple pods of 12 GiB memory and 4 CPU. During peak traffic hours this still was not sufficient. So, at this point, we decided to try out a different ingress controller.
Cluster layout before the migration
Nginx
was a straightforward choice for a replacement technology, since prior experiences of our team members suggested that it can handle even more redirects without an issue. Some of us had experience with Nginx
handling much more redirects without a problem. So we decided to try out the Nginx ingress controller
. Our infrastructure layout before the migration looked like this:
As illustrated in the graphic we were using an Application load balancer
of AWS with one listener rule
routing all incoming requests to a target group
called traefik target group
. In this target group, all instances of our Traefik
ingress controller are registered as targets and will route every incoming request to the correct service in our cluster.
Cluster layout during migration
We wanted to migrate with zero downtime and with high confidence to not break anything, or if we do, we would like to be able to roll back to the old working setup. To archive the former goal, we applied concepts of 'blue-green deployment' and 'canary release'. Our setup for the migration looked like this:
Essentially, we deployed Traefik
and Nginx
side-by-side Our cluster is now based on two ingress controllers, Traefik and Nginx. We duplicated all of our ingress resources to provide a similar configuration to Nginx as we do for Traefik. Afterward, we created a second target group
for our application load balancer
where all Nginx ingress controller pods are registered as a target. Then we modified the listener rules
of our application load balancer
in the following manner:
-
if the request contains header 'use-nginx'
route it to target groupnginx target group
-
default
:- route 100% of all requests to the
Traefik target group
- 0% requests to
nginx target group
- route 100% of all requests to the
The first rule implements the canary release strategy which enables us to test the behavior of our Nginx
setup without interfering with regular customer Traefik on Traefik
. Header-based routing is a simple way to make features available to a reduced user group. In our case, this user group consists only of members of the infrastructure and QA team.
The second rule gives us manual control about when to release Nginx as an ingress controller and if errors appear after release, we can easily roll back to the Traefik ingress controller (blue-green deployment). Running two ingress controllers in parallel and making both externally available made our process of migration with zero downtime a breeze.
Cluster layout after migration
After a few days with Nginx
live in production, we started to remove the now unused Traefik
parts as the traefik target group
, traefik deployment
, and all ingress resources for Traefik from our infrastructure:
Final thoughts
In my opinion, this approach is pretty nice because it's simple and generally applicable. It is based on standard components rather than complicated ones like service meshes, so every team working on Kubernetes
can make use of it.
Top comments (0)