lykins

Posted on Jan 14

First Look Nomad 1.11.x - System Job Deployments

#nomad

Originally posted : https://blog.lykins.xyz/posts/system-job-deployments/

This last Nomad 1.11.x feature I will be covering is not new to Nomad itself, putting a new twist to how you update and manage system jobs.

What's New?

In Nomad 1.11.x, system jobs now support deployment functionality which will allow operators to manage changes and rollbacks for system jobs more effectively. This enhancement gives additional control over how system jobs are deployed and maintained across your infrastructure.

As always, I would heavily recommend reading the release notes, which I have linked above, and the documentation on the update block as it has been updated to call out different behaviors when used with system jobs, such as the following:

In system jobs, the canary setting indicates the percentage of feasible nodes to which Nomad makes destructive allocation updates. System jobs do not support more than one allocation per node, so effectively setting canary to a positive integer means this percentage of feasible nodes gets a new version of the job if the update is destructive. Non-destructive updates ignore the canary field. Setting canary to 100 updates the job on all nodes. Percentage of nodes is always rounded up to the nearest integer. If canary is set, nodes that register during a deployment do not receive placements until after the deployment is promoted.

Since this also updates the Nomad UI, I will include screenshots as well.

Comparing 1.11.x to Previous Versions

So to start off, let's do a system job run comparison to see how system jobs behaved in previous versions of Nomad versus how they behave in 1.11.x.

I will have one cluster which is running 1.10.5 and another running 1.11.1. Both clusters are mostly identically set up - 1 server and 3 clients each.

Demo Job

Running a very simple system job that deploys a busybox container on all clients in the cluster. Here is the jobspec:

job "system-deployment" {
  datacenters = ["dc1"]
  type        = "system"

  update {
    canary            = 30
    max_parallel      = 1
    min_healthy_time  = "30s"
    healthy_deadline  = "5m"
    auto_promote      = false
    auto_revert       = true
  }

  group "test-group" {
    task "test-task" {
      driver = "docker"

      config {
        image = "busybox:1.36"
        command = "sh"
        args    = ["-c", "echo 'Running...'; sleep 3600"]
      }
    }
  }
}

Pay a bit more attention to the update block and see how it is set up for this demo.

canary is set to 30, when we do a destructive update, it will update 30% of the nodes first (1 of 3 nodes in this case) before promoting the version to the remaining nodes.
max_parallel is set to 1, only one node will be updated at a time.
min_healthy_time is set to 30s, the updated allocation must be healthy for at least 30 seconds before proceeding.
healthy_deadline is set to 5m, if the updated allocation does not become healthy within 5 minutes, the deployment will be considered failed.
auto_promote is set to false, the deployment will not automatically proceed to update the remaining nodes after the canary node is healthy.
auto_revert is set to true, if the deployment fails, it will automatically revert to the previous version.

Canary is not necessary to trigger a deployment, but provides

UI

On the left side of the screen is my homelab running 1.11.1 and on the right side is a quick lab I spun up on multipass with 1.10.5. See if you can spot the subtle differences in the UI for system jobs between the two versions:

In the 1.11.1 UI on the left, you can see the "Deployments" tab exists now for the system job, whereas in 1.10.5 on the right, there is no such tab. With that, you will have additional deployment information and status on the overview page.

Update Behavior

Now let's look at the update behavior between the two versions.

Running an Updated Job

For this, I will update the container image version from 1.36 to 1.37 and run the update in the UI. Since this is a destructive update, you will see how each version handles it differently.:

Nothing different or major to see here, so we will carry along.

Deployment Behaviors

After running the update:

On the left side with 1.11.x, you can see the deployment is in progress, and it is updating one of the three nodes (50% as per the canary setting).

On the right side with 1.10.5, a deployment is not created, and the job is being updated per the max_parallel setting.

Promotion

You can see now with 1.11.x, the deployment is in progress, and it will wait till it is healthy based on any health checks - in this case it is only healthy allocations. Once healthy, it will require manual promotion to continue the deployment to the remaining nodes as auto_promote is set to false as shown below:

Here I have the option, to promote or revert the deployment. If this was a real job, I could do some additional testing or validation, but for this demo, I will go ahead and promote the deployment and Nomad will handle updating the remaining allocations.

Conclusion

System job deployments in Nomad 1.11.x bring a new level of control and reliability to managing critical system services across your infrastructure. By leveraging deployments, manual or controlled promotions, and automatic rollbacks, operators can ensure that updates to system jobs are handled safely and efficiently.

DEV Community