DEV Community

Sireesharaju Kamparaju
Sireesharaju Kamparaju

Posted on

Monthly Kernel Patching with Ansible — My DevOps Journey

This one picks up where the AWX series left off. We've covered getting playbooks running from AWX, syncing projects from GitLab, securing secrets with Vault, scheduling workflows, and getting notified when things go wrong. The natural next step is putting all of that to use on something that actually matters every month: kernel patching.

I've lost count of how many patching playbooks I've written and thrown away. Most of them worked fine on a single test VM and then fell apart the moment they hit a fleet of production servers with different uptimes, different held-back packages, and that one box nobody's rebooted since 2019. This post walks through the playbook I run as a scheduled job in AWX for monthly kernel patching on Ubuntu 22.04 — what each task does, why it's there, and a couple of scars I picked up along the way.

If you're patching three servers by hand every month, you probably don't need this. If you're patching thirty, or hundred, and you'd rather not be the person who finds out a reboot hung at 2 AM, keep reading.

Why monthly patching needs more than apt upgrade

Kernel patching sounds simple until you've done it a few times in production. The actual work isn't running the upgrade command — apt handles that part fine. The hard part is everything around it: confirming the box is healthy before you touch it, capturing what's installed before and after so you have evidence if something breaks, handling packages that are held back for reasons nobody remembers, and making the reboot itself safe instead of a coin flip.

I ran into the held-back package problem more than once — a kernel meta-package that apt list --upgradable simply refuses to touch because something else pinned it. The fix turned out to be --allow-change-held-packages passed through apt, but only after I understood why it was held, not as a reflex "just force it" move. That distinction matters more than it sounds like it should.

The structure of the playbook

I split this into a handful of logical blocks: pre-checks, evidence capture, the actual patch, a controlled reboot, and post-patch verification. Each one is its own task or small group of tasks, which makes the whole thing easier to debug when something goes sideways at task 14 out of 20.

Here's the full playbook, and then I'll go through it task by task.

---
- name: Monthly Ubuntu kernel patching
  hosts: ubuntu_servers
  become: true
  serial: 1
  max_fail_percentage: 0

  vars:
    patch_evidence_dir: /var/log/patching
    reboot_timeout: 600

  tasks:

    - name: Ensure evidence directory exists
      ansible.builtin.file:
        path: "{{ patch_evidence_dir }}"
        state: directory
        mode: '0750'

    - name: Capture running kernel before patching
      ansible.builtin.command: uname -r
      register: kernel_before
      changed_when: false

    - name: Record pre-patch package state
      ansible.builtin.shell: dpkg -l > {{ patch_evidence_dir }}/pre_patch_{{ ansible_date_time.date }}.log
      changed_when: false

    - name: Check disk space before patching
      ansible.builtin.shell: df -h / | awk 'NR==2{print $5}' | tr -d '%'
      register: disk_usage
      changed_when: false

    - name: Fail early if disk usage is dangerously high
      ansible.builtin.fail:
        msg: "Root disk usage is {{ disk_usage.stdout }}%, too high to patch safely"
      when: disk_usage.stdout | int > 85

    - name: Update apt cache
      ansible.builtin.apt:
        update_cache: true
        cache_valid_time: 3600

    - name: List held packages, if any
      ansible.builtin.shell: apt-mark showhold
      register: held_packages
      changed_when: false

    - name: Unhold kernel meta-packages so they're eligible for upgrade
      ansible.builtin.command: apt-mark unhold {{ item }}
      loop: "{{ held_packages.stdout_lines }}"
      when: "'linux-' in item"

    - name: Upgrade all packages, allowing held package changes
      ansible.builtin.apt:
        upgrade: full
        force_apt_get: true
      register: apt_upgrade_result
      environment:
        DEBIAN_FRONTEND: noninteractive

    - name: Record post-upgrade package state
      ansible.builtin.shell: dpkg -l > {{ patch_evidence_dir }}/post_patch_{{ ansible_date_time.date }}.log
      changed_when: false

    - name: Check whether a reboot is required
      ansible.builtin.stat:
        path: /var/run/reboot-required
      register: reboot_required_file

    - name: Reboot the server if required
      ansible.builtin.reboot:
        reboot_timeout: "{{ reboot_timeout }}"
        msg: "Rebooting for kernel update via Ansible"
        connect_timeout: 20
        pre_reboot_delay: 10
        post_reboot_delay: 30
        test_command: uptime
      when: reboot_required_file.stat.exists

    - name: Capture running kernel after patching
      ansible.builtin.command: uname -r
      register: kernel_after
      changed_when: false

    - name: Report kernel change
      ansible.builtin.debug:
        msg: "Kernel changed from {{ kernel_before.stdout }} to {{ kernel_after.stdout }}"

    - name: Verify critical services are active post-reboot
      ansible.builtin.systemd:
        name: "{{ item }}"
      loop: "{{ critical_services | default(['sshd', 'systemd-resolved']) }}"
      register: service_status
      failed_when: service_status.status.ActiveState != "active"

    - name: Clean up old kernel packages
      ansible.builtin.apt:
        autoremove: true
        purge: true

    - name: Write final patch summary to evidence log
      ansible.builtin.lineinfile:
        path: "{{ patch_evidence_dir }}/patch_summary.log"
        line: "{{ ansible_date_time.iso8601 }} | {{ inventory_hostname }} | {{ kernel_before.stdout }} -> {{ kernel_after.stdout }}"
        create: true
        mode: '0640'
Enter fullscreen mode Exit fullscreen mode

Now, the part that actually matters — what each piece is doing and why it's shaped this way.

Pre-checks: don't patch a server that's already unhappy

The first few tasks aren't about patching at all. They're about making sure you're not about to make a bad day worse.

serial: 1 and max_fail_percentage: 0 at the play level are doing more work than they look like. serial: 1 patches hosts one at a time instead of in parallel, which feels slow until the first time it saves you from rebooting your entire fleet simultaneously because of a bad kernel build. max_fail_percentage: 0 means the run stops dead the moment one host fails, rather than ploughing ahead and patching twenty more servers with the same problem.

Capturing the running kernel before patching (uname -r) is just a baseline. It sounds trivial, but when someone asks "did this actually update the kernel?" three weeks later, you want a log line, not a guess.

Recording the pre-patch package state with dpkg -l is the evidence capture habit. I picked this up from doing CyberArk-related audit work, where "we patched it" isn't good enough — you need a before-and-after you can actually point to. Dumping the full package list to a timestamped file costs nothing and saves you when someone needs proof of what changed.

The disk space check is one of those tasks that exists because of a specific bad memory. A kernel upgrade can pull in several hundred megabytes of new packages and headers, and if /boot or / is already tight, the upgrade can fail halfway through in a way that's genuinely annoying to recover from. Checking df -h and failing early if usage is above 85% means you find out before you've half-installed a new kernel, not after.

Handling held packages without just brute-forcing everything

This is the bit I want to spend the most time on, because it's the part that bit me.

apt-mark showhold lists anything currently held back. Packages get held for real reasons sometimes — a previous admin pinned a version for compatibility, or a partial upgrade left something in a weird state. The instinct to just --allow-change-held-packages and move on is tempting, but it's worth actually looking at what's held first.

The unhold task only targets packages with linux- in the name. This is deliberate. I don't want to blanket-unhold everything apt has flagged — some of those holds might be load-bearing for application compatibility. I only want to clear holds on kernel-related meta-packages, since those are specifically what blocks a clean kernel upgrade. Anything else stays held, and if it needs attention, that's a separate, more careful conversation.

The upgrade task itself uses upgrade: full (the Ansible equivalent of apt full-upgrade) rather than a plain upgrade, because kernel updates often need to remove obsolete packages to bring in the new ones cleanly. force_apt_get: true makes Ansible shell out to the actual apt-get binary instead of using its own internal dependency resolver, which in my experience handles held-package and meta-package edge cases more predictably. DEBIAN_FRONTEND: noninteractive stops any package post-install script from hanging the whole run waiting for a prompt that will never come, because nobody's sitting at a terminal at 3 AM watching this.

The reboot — the part everyone's actually nervous about

Checking /var/run/reboot-required rather than just always rebooting matters because not every patch run touches the kernel. Some months it's just userspace packages, and forcing a reboot anyway is just unnecessary downtime. Ubuntu already tracks this for you; you might as well use it.

The ansible.builtin.reboot module is doing a lot of quiet work here. It handles disconnecting cleanly, waiting for the host to actually go down (not just stop responding to SSH, which can happen for other reasons), and then polling until it's back up and SSH-able again. The test_command: uptime is a nice touch — it doesn't just check that a port is open, it confirms the box is genuinely responsive and not stuck at a GRUB menu or in some half-booted state. reboot_timeout: 600 gives it ten minutes, which has been enough for everything I've patched except one cursed RAID controller that needed longer — adjust per your hardware reality, not mine.

Verifying you didn't break anything

This is the step people skip, and it's the one that actually justifies the whole exercise.

Comparing kernel before and after confirms the upgrade actually took. I've seen cases where the package upgraded fine but the reboot picked the old kernel from GRUB's default entry because of a misconfigured bootloader — without this check, you'd never know.

Checking critical services are active post-reboot is a small thing that's saved me real pain. sshd and systemd-resolved are the defaults here, but in practice I override critical_services per host group — a database server gets its DB service in that list, an app server gets its app service. The check uses systemd module's ActiveState, which is a more honest signal than "did the process start" — a service can start and then immediately crash-loop, and you want to catch that before you move to the next server, not after you've patched the whole fleet on a broken assumption.

Cleanup and the paper trail

autoremove with purge: true clears out old kernel images so /boot doesn't slowly fill up over a year of monthly patching — another thing that's obvious in hindsight and easy to forget until /boot is at 98% and your next kernel upgrade can't even write its files.

The final summary line appended to patch_summary.log is deliberately boring: timestamp, hostname, old kernel, new kernel, one line per host per run. It's not fancy, but six months from now when someone asks "when did we last patch the kernel on this box," you've got a flat file you can grep instead of digging through Ansible run history or AWX job logs.

A few things I'd tell someone starting this from scratch

Test the unhold logic against a host that actually has held packages before you trust it — a playbook that's never seen a real hold can look correct and still fail the first time it meets one. Keep serial: 1 even when you're impatient; the time you save by parallelizing isn't worth the time you lose the one month a kernel update has a regression. And don't skip the evidence capture steps because they feel like overhead — they're cheap to run and the only thing standing between you and "I'm pretty sure it patched fine" when someone asks for proof.

None of this is exotic. It's mostly just being deliberate about the order of operations and refusing to skip the boring verification steps because the upgrade itself "usually just works." Usually is doing a lot of heavy lifting in that sentence, and monthly patching across a real fleet is exactly the kind of work where the exceptions show up eventually.

Drop your questions in the comments — happy to help.
— Sireesha

Part of the My DevOps Journey series:

  • Running Your First Playbook from Ansible AWX
  • Syncing GitLab Projects to AWX
  • Ansible Vault — Managing Secrets Securely in Ansible and AWX
  • AWX Workflow Templates & Schedules
  • Setting Up Email Notifications in AWX via SMTP and Postfix
  • Deploying Ansible AWX on Kubernetes Using Helm
  • Automating Linux & Windows Server Setup with Ansible

Top comments (0)