Damien Mathieu for Heroku

Posted on Feb 22, 2018 • Originally published at blog.heroku.com

Dissecting Kubernetes Deployments

#kubernetes #k8s #deployment

This post originally appeared on Heroku's Engineering Blog.

Kubernetes is a container orchestration system that originated at Google, and is now being maintained by the Cloud Native Computing Foundation. In this post, I am going to dissect some Kubernetes internals—especially, Deployments and how gradual rollouts of new containers are handled.

What Is a Deployment?

This is how the Kubernetes documentation describes Deployments:

A Deployment controller provides declarative updates for Pods and ReplicaSets.

A Pod is a group of one or more containers which can be started inside a cluster. A pod started manually is not going to be very useful though, as it won't automatically be restarted if it crashes. A ReplicaSet ensures that a Pod specification is always running with a set number of replicas. They allow starting several instances of the same Pod and will restart them automatically if some of them were to crash. Deployments sit on top of ReplicaSets. They allow seamlessly rolling out new versions of an application.

Here is an example of a rolling deploy in a basic app:

What we can see in this video is a 10-Pods Deployment being rolled out, one Pod at a time. When an update is triggered, the Deployment will boot a new Pod and wait until that Pod is responding to requests. When that happens, it will terminate one Pod and boot a new one. This continues until all old Pods are stopped and we have 10 new ones running the updated Deployment.

Let's see how that is handled under the covers.

A Trigger-Based System

Kubernetes is a trigger-based environment. When a Deployment is created or updated, it's new status is stored in etcd. But without any controller to perform some action on the new object, nothing will happen.

Anyone with the proper authorization access on a cluster can listen on some triggers and perform actions on them. Let's take the following example:

package main

import (
  "log"
  "os"
  "path/filepath"
  "reflect"
  "time"

  "k8s.io/api/apps/v1beta1"
  metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
  "k8s.io/apimachinery/pkg/runtime"
  "k8s.io/apimachinery/pkg/watch"
  "k8s.io/client-go/kubernetes"
  "k8s.io/client-go/tools/cache"
  "k8s.io/client-go/tools/clientcmd"
)

func main() {
  // doneCh will be used by the informer to allow a clean shutdown
  // If the channel is closed, it communicates the informer that it needs to shutdown
  doneCh := make(chan struct{})
  // Authenticate against the cluster
  client, err := getClient()
  if err != nil {
    log.Fatal(err)
  }

  // Setup the informer that will start watching for deployment triggers
  informer := cache.NewSharedIndexInformer(&cache.ListWatch{
    // This method will be used by the informer to retrieve the existing list of objects
    // It is used during initialization to get the current state of things
    ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
      return client.AppsV1beta1().Deployments("default").List(options)
    },
    // This method is used to watch on the triggers we wish to receive
    WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
      return client.AppsV1beta1().Deployments("default").Watch(options)
    },
  }, &v1beta1.Deployment{}, time.Second*30, cache.Indexers{}) // We only want `Deployments`, resynced every 30 seconds with the most basic indexer

  // Setup the trigger handlers that will receive triggerss
  informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
    // This method is executed when a new deployment is created
    AddFunc: func(deployment interface{}) {
      log.Printf("Deployment created: %s", deployment.(*v1beta1.Deployment).ObjectMeta.Name)
    },
    // This method is executed when an existing deployment is updated
    UpdateFunc: func(old, cur interface{}) {
      if !reflect.DeepEqual(old, cur) {
        log.Printf("Deployment updated: %s", cur.(*v1beta1.Deployment).ObjectMeta.Name)
      }
    },
  })

  // Start the informer, until `doneCh` is closed
  informer.Run(doneCh)
  }

// Create a client so we're allowed to perform requests
// Because of the use of `os.Getenv("HOME")`, this only works on unix environments
func getClient() (*kubernetes.Clientset, error) {
  config, err := clientcmd.BuildConfigFromFlags("", filepath.Join(os.Getenv("HOME"), ".kube", "config"))
    if err != nil {
      return nil, err
    }
  return kubernetes.NewForConfig(config)
}

If you follow the comments in this code sample, you can see that we create an informer which listens on create and update Deployment triggers, and logs them to stdout.

Back to the Deployment controller. When it is initialized, it configures a few informers to listen on:

A Deployment creation
A Deployment update
A Deployment deletion
A ReplicaSet creation
A ReplicaSet update
A ReplicaSet deletion
A Pod deletion

All those triggers allow the entire handling of a gradual rollout.

Rolling Out

For any of the mentioned triggers, the Deployment controller will do a Deployment sync. That method will check the Deployment status and perform the required action based on that.

Let's take the example of a new Deployment.

A Deployment Is Created

The controller receives the creation trigger and performs a sync. After performing all of its checks, it looks for the Deployment strategy and triggers it. In our case, we're interested in a rolling update, as it's the one you should be using to prevent downtime.

The rolloutRolling method will then create a new ReplicaSet. We need a new ReplicaSet for every rollout, as we want to be able to update the Pods one at a time. If the Deployment kept the same replica and just updated it, all Pods would be restarted and there would be a few minutes where we are unable to process requests.

At this point, we have at least 2 ReplicaSets. One of them is the one we just created. The other one (there can be more if we have several concurrent rollouts) is the old one. We will then scale up and down both of the ReplicaSets accordingly.

To scale up the new ReplicaSet, we start by looking how many replicas the Deployment expects. If we have scaled enough, we just stop there. If we need to keep scaling up, we check the max surge value and compare it with the number of running Pods. If too many are running, it won't scale up and wait until some old Pods have finished terminating. Otherwise, it will boot the required number of new Pods.

To scale down, we look at how many total Pods are running, subtract the maximum available Pods we want, then subtract any not fully booted Pods. Based on that, we know how many Pods need to be terminated and can randomly finish them.

At this point, the controller has finished for the current trigger. The deployment itself is not over though.

A ReplicaSet Is Updated

Because the new Deployment just booted new Pods, we will receive new triggers. Specifically, when a Pod goes up or down, the ReplicaSet will send an update trigger. By listening on ReplicaSet updates, we can look for Pods that have finished booting or terminating.

When that happens, we do the sync dance all over again, looking for Pods to shutdown and other ones to boot based on configuration, then wait for a new update.

A ReplicaSet Is Deleted

The ReplicaSet deleted trigger is used as a way to make sure all Deployments are always properly running. If a ReplicaSet is deleted and the Deployment didn't expect this, we need to perform a sync again to create a new one and bring the Pods back up.

This means if you want to quickly restart your app (with downtime), you can delete a Deployment's ReplicaSet safely. A new one will be created right away.

A Pod Is Deleted

Deployments allow setting a ProgressDeadlineSeconds option. If the Deployment hasn't progressed (any Pod booted or stopped) after the set number of seconds, it will be marked as failed. This typically happens when Pods enter a crash loop. When that happens, we will never receive the ReplicaSet update, as the Pod never goes online.

However, we will receive Pod deletion updates—one for each crash loop retry. By syncing here, we can check how long it's been since the last update and reliably mark the Deployment as failed after a while.

The Deployment Is Finished

If we consider the Deployment to be complete, we then clean things
up.

At cleanup, we will delete any ReplicaSet that became too old. We keep a set number of old ReplicaSets (without any Pod running) so we can rollback a broken Deployment.

Note: ReplicaSets only hold a Pod template. So if you are always using the :latest tag for your Pod (or using the default one), you won't be rolling back anything. In order to have proper rollback here, you will need to change the container tag every time it is rebuilt. For example, you could tag the containers with the git commit SHA they were built against.

You can rollback a deployment with the kubectl rollout undo command.

To Infinity and Beyond

While Kubernetes is a generally seen as a complex tool, it is not difficult to dissect its parts to understand how they work. Also, being as generic as it is is good, making the system extremely modular.

For example, as we have seen in this post, it is very easy to listen for Deployment triggers and implement your own logic on top of them. Or to entirely reimplement them in your own controller (which would probably be a bad idea). This trigger-based system also makes things more straightforward as each controller doesn't need to regularly check for updates on the objects it owns. It just needs to listen on the appropriate triggers and perform the appropriate action.

Top comments (4)

Ben Halpern • Feb 23 '18

Thanks for a great post

Daniel Albuschat • Mar 7 '18

Thanks Damien! I'm sure that this info will help me troubleshoot problems with deployments in the future. And it really shows how "simple" Kubernetes is. I got a visual experiment for rolling updates on github.com/daniel-kun/kube-alive, do you have any ideas for new, fancy experiments to make the behaviour of deployments (or other k8s stuff) "tangible"?