Ansible state:latest Broke Payments for 47 Minutes — What Really Happened and How to Prevent It

#ansible #devops #infrastructure #sre

It was a Monday morning. A routine playbook. One task: state=latest.

Forty-seven minutes later the payments team had a P1 incident, 50 production web servers were running a version of Nginx nobody had approved, and the postmortem had a very uncomfortable finding: Ansible did exactly what it was told to do.

This article covers what happened, how idempotency actually works in production (not how tutorials describe it), and how to install and use Ansible without repeating this.

The Incident: What Actually Happened

The task looked like this:

- name: Ensure nginx is installed
  ansible.builtin.apt:
    name: nginx
    state: latest
    update_cache: yes

Symptom: Users started seeing SSL handshake failed errors immediately after the playbook completed. Payment gateway API calls timed out. Transaction success rate dropped below 2% within 90 seconds of the run completing.

Root cause: state: latest on a Monday morning after a weekend Ubuntu mirror sync pulled Nginx from 1.24 to 1.26 across the entire fleet simultaneously. Nginx 1.26 introduced a TLS configuration change that broke the handshake with the payment processor's aging intermediate certificate chain. Nobody tested it. Nobody saw it coming. The task logged changed — but didn't log what changed.

The fix: Roll back with state: present version=1.24.* and pin the version in apt preferences. Forty-seven minutes of payment outage for one missing word in a task definition.

The lesson that matters:

state: latest is not idempotency. It is an instruction to upgrade on every run. Idempotency means reaching a defined state. latest is not a defined state — it's a moving target.

Installing Ansible Properly (Ubuntu 22.04)

The official PPA gives you a more recent Ansible than the Ubuntu default repos, and it's the right way to set up a control node that won't surprise you six months later.

# Add the Ansible PPA
sudo apt-add-repository ppa:ansible/ansible -y
sudo apt update

# Install
sudo apt install ansible -y

# Verify
ansible --version

You should see output like:

ansible [core 2.17.x]
  config file = /etc/ansible/ansible.cfg
  python version = 3.10.x

Three installation methods compared:

Method	Ansible version	Use when
`apt` (default repo)	Older (2.10.x on Ubuntu 22)	You need OS-managed packages
PPA (`ppa:ansible/ansible`)	Recent stable	Most production control nodes
`pip install ansible`	Latest	Testing, dev environments, containers

For production control nodes, the PPA is the correct choice. Pip installs are fine for dev but create version drift when the Python environment changes.

Your Inventory File — The Part Everyone Configures Wrong

Before you run anything, Ansible needs to know what to connect to. The default inventory is /etc/ansible/hosts. For anything real, use a project-local inventory:

# inventory/production

[web]
web01.example.com
web02.example.com
web03.example.com

[db]
db01.example.com

[web:vars]
ansible_user=deploy
ansible_ssh_private_key_file=~/.ssh/id_ed25519

Test connectivity before writing a single playbook:

ansible all -i inventory/production -m ping

Expected output:

web01.example.com | SUCCESS => {"ping": "pong"}
web02.example.com | SUCCESS => {"ping": "pong"}

If you see UNREACHABLE — check SSH key permissions (chmod 600), verify the target's ~/.ssh/authorized_keys, and confirm the ansible_user exists on the target.

Ad-Hoc Commands — For When You Need Fast Answers

Ad-hoc commands are the fastest way to use Ansible without writing a playbook. Essential for operational work:

# Check uptime across all web servers
ansible web -i inventory/production -m shell -a "uptime"

# Copy a config file
ansible web -i inventory/production -m copy -a "src=./nginx.conf dest=/etc/nginx/nginx.conf"

# Install a package (note: state=present, not latest)
ansible web -i inventory/production -m apt -a "name=nginx state=present update_cache=yes"

# Restart a service
ansible web -i inventory/production -m service -a "name=nginx state=restarted"

# Check free disk space
ansible all -i inventory/production -m shell -a "df -h"

The key pattern: -m module_name -a "module_arguments".

Your First Playbook — Install Apache the Right Way

The hello-world of Ansible playbooks. Notice the version pinning:

---
- name: Configure web server cluster
  hosts: web
  become: yes

  vars:
    apache_version: "2.4.*"    # Pin to minor version, not latest

  tasks:
    - name: Install Apache (pinned version)
      ansible.builtin.apt:
        name: "apache2={{ apache_version }}"
        state: present          # Not latest — ever
        update_cache: yes

    - name: Deploy site configuration
      ansible.builtin.template:
        src: templates/site.conf.j2
        dest: /etc/apache2/sites-available/mysite.conf
        mode: '0644'
      notify: reload apache     # Handler — only runs if this task changes something

    - name: Enable site
      ansible.builtin.command:
        cmd: a2ensite mysite
      changed_when: false       # This command doesn't change state idempotently

  handlers:
    - name: reload apache
      ansible.builtin.service:
        name: apache2
        state: reloaded

Run it:

ansible-playbook -i inventory/production site.yml

Before you run anything in production, always run with --check --diff first:

ansible-playbook -i inventory/production site.yml --check --diff

This shows you exactly what would change without changing anything. Make it a habit.

The Production Rule That Prevents the State:latest Incident

Never use state: latest for packages that sit in front of external integrations.

Instead, use this pattern:

vars:
  package_versions:
    nginx: "1.24.*"
    openssl: "3.0.*"
    python3: "3.10.*"

tasks:
  - name: Install packages at pinned versions
    ansible.builtin.apt:
      name: "{{ item.key }}={{ item.value }}"
      state: present
    loop: "{{ package_versions | dict2items }}"

When you need to upgrade, the version bump is a deliberate code change that goes through your review process — not an automatic consequence of running a playbook on a Monday morning after a weekend mirror sync.

What the Full Article Covers

This covers the foundations. The full guide at TheCodeForge goes deeper into:

Ansible vs Terraform — when to use each and the honest production trade-offs
Handler deduplication — why handlers run once regardless of how many tasks notify them, and when that matters
Idempotency as a property you build, not something Ansible gives you for free
The changed_when and failed_when patterns that make playbooks actually trustworthy in CI/CD
Check mode and diff mode workflows for safe production changes
A complete production troubleshooting guide with runnable diagnostic commands

Read the full Ansible Basics guide on TheCodeForge →

Written by Naren — 20 years in enterprise IT, production Ansible deployments across banking, insurance, and fintech environments. Founder of TheCodeForge.io — programming tutorials that explain the why before the how.