<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Florian Balling</title>
    <description>The latest articles on DEV Community by Florian Balling (@flobilosaurus).</description>
    <link>https://dev.to/flobilosaurus</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1036875%2Ff4952c19-e800-4299-ae9e-613ff054e62e.jpeg</url>
      <title>DEV Community: Florian Balling</title>
      <link>https://dev.to/flobilosaurus</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/flobilosaurus"/>
    <language>en</language>
    <item>
      <title>Zero-downtime Kubernetes Ingress Controller swap</title>
      <dc:creator>Florian Balling</dc:creator>
      <pubDate>Thu, 02 Mar 2023 07:32:22 +0000</pubDate>
      <link>https://dev.to/flobilosaurus/zero-downtime-kubernetes-ingress-controller-swap-5coi</link>
      <guid>https://dev.to/flobilosaurus/zero-downtime-kubernetes-ingress-controller-swap-5coi</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjggnhe8rybq3fymzhfg3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjggnhe8rybq3fymzhfg3.png" alt="zero downtime meme"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every cluster external request to an application running in a Kubernetes cluster is handled by a so-called &lt;code&gt;ingress controller&lt;/code&gt;. So if it's missing, nothing in the cluster is reachable anymore. So the ingress controller is a very important part of a &lt;code&gt;Kubernetes&lt;/code&gt; powered infrastructure.&lt;br&gt;
We needed to exchange the ingress controller of our Kubernetes clusters because the one in use could not satisfy a new requirement. We made it with zero downtime and we were able to develop and test it in production before releasing it to the customers.&lt;/p&gt;
&lt;h2&gt;
  
  
  TLDR
&lt;/h2&gt;

&lt;p&gt;We did a zero downtime swap of &lt;code&gt;Traefik&lt;/code&gt; with &lt;code&gt;Nginx&lt;/code&gt; by running both in parallel and making them externally available. Our strategy used the concepts of &lt;code&gt;weight-based routing&lt;/code&gt; and &lt;code&gt;header-based routing&lt;/code&gt; to implement &lt;code&gt;blue-green deployment&lt;/code&gt; and &lt;code&gt;canary release&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why we needed to switch the ingress controller
&lt;/h2&gt;

&lt;p&gt;We were using Traefik as our ingress controller and were pretty happy with it. At some point, the following requirement popped up: &lt;code&gt;We need to handle a list of around 4500 URLs that need to be redirected to different locations.&lt;/code&gt; &lt;br&gt;
At the time of writing (08-01-23) Traefik is only able to have one regex expression per &lt;a href="https://doc.traefik.io/traefik/middlewares/http/redirectregex/" rel="noopener noreferrer"&gt;redirect middleware&lt;/a&gt;. So we mapped each of these redirects to a Traefik middleware and chained them all together into one &lt;a href="https://doc.traefik.io/traefik/middlewares/http/chain/" rel="noopener noreferrer"&gt;chain middleware object&lt;/a&gt;. This chain middleware could then be used by Traefik via a &lt;code&gt;Kubernetes ingress annotation&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;networking.k8s.io/v1&lt;/span&gt;
&lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Ingress&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;traefik.ingress.kubernetes.io/router.middlewares&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;default@redirect-chain&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This setup worked, but we soon saw a huge increase in CPU and memory consumption of the Traefik instances running on production. To handle regular customer traffic we needed multiple pods of 12 GiB memory and 4 CPU. During peak traffic hours this still was not sufficient. So, at this point, we decided to try out a different ingress controller.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cluster layout before the migration
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;Nginx&lt;/code&gt; was a straightforward choice for a replacement technology, since prior experiences of our team members suggested that it can handle even more redirects without an issue. Some of us had experience with &lt;code&gt;Nginx&lt;/code&gt; handling much more redirects without a problem. So we decided to try out the &lt;code&gt;Nginx ingress controller&lt;/code&gt;. Our infrastructure layout before the migration looked like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f37rwmqpns7k2mt6qdc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6f37rwmqpns7k2mt6qdc.png" alt="Cluster layout before migration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As illustrated in the graphic we were using an &lt;code&gt;Application load balancer&lt;/code&gt; of AWS with one &lt;code&gt;listener rule&lt;/code&gt; routing all incoming requests to a &lt;code&gt;target group&lt;/code&gt; called &lt;code&gt;traefik target group&lt;/code&gt;. In this target group, all instances of our &lt;code&gt;Traefik&lt;/code&gt; ingress controller are registered as targets and will route every incoming request to the correct service in our cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cluster layout during migration
&lt;/h2&gt;

&lt;p&gt;We wanted to migrate with zero downtime and with high confidence to not break anything, or if we do, we would like to be able to roll back to the old working setup. To archive the former goal, we applied concepts of 'blue-green deployment' and 'canary release'. Our setup for the migration looked like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo928jtslwabvth66t5tg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo928jtslwabvth66t5tg.png" alt="Cluster layout during migration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Essentially, we deployed &lt;code&gt;Traefik&lt;/code&gt; and &lt;code&gt;Nginx&lt;/code&gt; side-by-side Our cluster is now based on two ingress controllers, Traefik and Nginx. We duplicated all of our ingress resources to provide a similar configuration to Nginx as we do for Traefik. Afterward, we created a second &lt;code&gt;target group&lt;/code&gt; for our &lt;code&gt;application load balancer&lt;/code&gt; where all Nginx ingress controller pods are registered as a target. Then we modified the &lt;code&gt;listener rules&lt;/code&gt; of our &lt;code&gt;application load balancer&lt;/code&gt; in the following manner:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;if the request contains header 'use-nginx'&lt;/code&gt; route it to target group &lt;code&gt;nginx target group&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;default&lt;/code&gt;:

&lt;ul&gt;
&lt;li&gt;route 100% of all requests to the &lt;code&gt;Traefik target group&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;0% requests to &lt;code&gt;nginx target group&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The first rule implements the canary release strategy which enables us to test the behavior of our &lt;code&gt;Nginx&lt;/code&gt; setup without interfering with regular customer Traefik on &lt;code&gt;Traefik&lt;/code&gt;. Header-based routing is a simple way to make features available to a reduced user group. In our case, this user group consists only of members of the infrastructure and QA team.&lt;br&gt;
The second rule gives us manual control about when to release Nginx as an ingress controller and if errors appear after release, we can easily roll back to the Traefik ingress controller (blue-green deployment). Running two ingress controllers in parallel and making both externally available made our process of migration with zero downtime a breeze.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cluster layout after migration
&lt;/h2&gt;

&lt;p&gt;After a few days with &lt;code&gt;Nginx&lt;/code&gt; live in production, we started to remove the now unused &lt;code&gt;Traefik&lt;/code&gt; parts as the &lt;code&gt;traefik target group&lt;/code&gt;, &lt;code&gt;traefik deployment&lt;/code&gt;, and all ingress resources for Traefik from our infrastructure:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9ebolym18ouz7g9yw24.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm9ebolym18ouz7g9yw24.png" alt="Cluster layout after migration"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;In my opinion, this approach is pretty nice because it's simple and generally applicable. It is based on standard components rather than complicated ones like service meshes, so every team working on &lt;code&gt;Kubernetes&lt;/code&gt; can make use of it.&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>devops</category>
      <category>aws</category>
    </item>
  </channel>
</rss>
