Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions

#ansibleplaybookfailing #ansibleerror #ansibletroubleshooting #fixansible

    Prime Automation





    Home
    Services
    Network Automation
    Blog
    Case Studies
    Free Tools
    Government
    Free Assessment








    Home &gt; Blog &gt; Ansible Playbook Failing? 7 Root Causes

  # Ansible Playbook Failing? The 7 Root Causes I See Most Often

  Fifteen years of production Ansible, distilled to the seven failures I see over and over. Each one with the actual error signature and the fix.









      April 23, 2026
      &bull;
      9 min read
      &bull;
      Automation


    When an Ansible playbook breaks in production, you do not have time for a blog post that starts "Ansible is a powerful automation tool developed by Red Hat." You have logs, you have a pager, and you have ten minutes to figure out whether this is a five-minute fix or a rollback.

    This post is the triage guide I wish I had the first time a playbook failed on me. Every cause below is something I have personally diagnosed in production — some of them more than a dozen times across enterprise network-automation gigs, government deployments, and small-business DevOps work. Each one includes the error signature you will actually see in your terminal and the fix that stops it recurring.

    If you are looking at a broken Ansible run right now and you need someone to fix it by tomorrow, the rapid-fix audit is $250 flat — written root-cause report in 48 hours, and if I cannot diagnose it you do not pay. If you have time to read, keep going.



      1
      ## Intermittent SSH Timeouts On a Subset of Hosts



    You run the same playbook. Eight hosts succeed. Three hosts fail with an SSH connection error. Re-run it — now it is four hosts that fail, but two of the previous failures pass. It looks random.

    TASK [network_config] **************************

fatal: [edge-02.atl]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host edge-02.atl port 22: Connection timed out", "unreachable": true}

    It is almost never actually random. The usual culprits, in order of frequency: (a) ControlPersist sockets going stale — SSH's own connection-reuse cache is serving you a dead socket; (b) DNS returning different results per lookup because you have two mismatched name-server entries; (c) a firewall doing rate-limiting on new SSH connections and your playbook is fanning out faster than the rate limit allows.

    Fix. Drop ControlPersist=60s to ControlPersist=0 in your ansible.cfg [ssh_connection] section to isolate the cache. Cap forks in ansible.cfg to ~20 to defeat the rate-limit case. Use -vvvv for one failed host — the SSH command-line Ansible is running will tell you whether the hang is on TCP connect or post-auth.



      2
      ## Handler Ordering That Runs at the Wrong Moment



    You notify a handler to restart a service. The playbook continues past the notify. A later task in the play fails. The handler never fires — because Ansible runs handlers only at the end of a play (or when meta: flush_handlers is explicitly called). Your service now has new config staged on disk but the daemon never reloaded. The next time something touches that box, production breaks.

    - name: deploy new nginx vhost

template:
src: vhost.j2
dest: /etc/nginx/sites-available/app
notify: reload nginx

name: run smoke test
uri:
url: http://localhost/app/health
status_code: 200

fails, because nginx never got the reload

Fix. Put - meta: flush_handlers immediately after any notify whose result you depend on in subsequent tasks. Your smoke test has no business running before the reload completed.

  3
  ## Fact-Gathering Crashes on a Subset of Hosts

One host in a group has a weird kernel, a missing Python module, or a locked-down SELinux context. Fact-gathering throws an exception on that host. By default, Ansible terminates the entire play for that host — but the error message is buried under a mountain of default-facts JSON, and it looks like a network problem until you look closely.

fatal: [legacy-01]: FAILED! => {"ansible_facts": {}, "changed": false, "failed_modules": {"ansible.legacy.setup": {"failed": true, "module_stderr": "/usr/bin/python3: No module named ansible", "module_stdout": "", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 1}}, "msg": "The following modules failed to execute: ansible.legacy.setup\n"}

Nine times out of ten this is Python path drift — the host has Python 3 somewhere non-standard, or the ansible_python_interpreter fact Ansible auto-detected is wrong.

Fix. Set ansible_python_interpreter explicitly per host in inventory, or use gather_facts: no + a targeted setup task with gather_subset: min to avoid the costly facts you do not need. For fleet hygiene, add an Ansible lint rule that fails CI if any host is missing an explicit interpreter.

  4
  ## Non-Idempotent Tasks That Report "changed" Every Run

A well-written Ansible play is idempotent — running it twice in a row produces one change on the first run and zero on the second. When you start seeing the same task flip changed on every run, something is tricking the module into thinking there is work to do.

The usual culprits: (a) shell or command tasks with no creates: / removes: guard — Ansible has no way to know those ran successfully before; (b) template tasks where the rendered file has a timestamp, a hostname, or a random ID baked in; (c) lineinfile matching on a regex that subtly does not match its own output after the first run.

- name: install monitoring agent config

template:
src: monitor.conf.j2

renders `generated_at: {{ ansible_date_time.iso8601 }}`

dest: /etc/monitor.conf
notify: restart monitor

Every run regenerates the config with a new timestamp. Every run restarts your monitor. Every run takes the monitor down for two seconds. Two seconds times 365 days times 200 hosts is real downtime.

Fix. Never embed volatile values in templates that feed a changed signal. If you need the timestamp for audit, put it in a sidecar file the playbook does not re-check. Add check_mode: no + changed_when: to any shell/command task so YOU decide what counts as a change, not Ansible's default.

  5
  ## Inventory Drift — The Host Is Not Where Ansible Thinks It Is

You add a new host. The playbook works. Three months later the same playbook fails on the same host with a mysterious error. What changed? Probably nothing in the playbook. The infrastructure shifted under it — a host was renamed, an IP was recycled, a group membership was altered in a dynamic inventory script, a DNS entry now resolves to a load-balancer VIP instead of the actual host.

Inventory drift is the hardest category to catch because the error manifests wherever the stale inventory intersects a real operation. You will see symptoms ranging from SSH connection errors (Cause 1) to "permission denied" to "this command ran on the wrong host and broke production."

Fix. Check inventory into source control even if it is dynamic — keep a snapshot artifact. Add a weekly CI job that runs ansible -m setup --limit all | diff - against last week's snapshot and alerts on unexpected host churn. For dynamic inventories, have the inventory script log why it included each host (tag, query, pattern match) so you can audit the set.

  6
  ## Retries That Hide the Real Error

Someone wrapped a flaky task in retries: 10 and delay: 5. When it worked, nobody looked closely. Now it does not work, and what you see in the log is ten identical failures in a row followed by a final abort — but the underlying error is the one that fired on attempt one, nine attempts ago, and by the time it scrolled past the useful context was gone.

This one hurts because the code looks defensive. It is actually hiding data.

- name: wait for service

uri:
url: "https://{{ inventory_hostname }}/ready"
register: result
until: result.status == 200
retries: 30
delay: 10

the service returned 403 on attempt one — a real auth bug —

but you retried 29 more times because `until` only checks 200

Fix. Distinguish expected retries (timeouts, transient DNS) from unexpected ones (4xx responses, specific exceptions). Gate retries on the kind of failure, not the overall result. Log every retry attempt with its actual failure reason. When a task eventually succeeds after retries, emit a warning so you can find these ticking time-bombs later.

  7
  ## Module or Collection Version Drift Between Laptop, CI, and Prod

The playbook works on your laptop. It works in CI. It fails in prod — or the other way around. The playbook source is identical. What is different is the version of Ansible, the version of a collection, or the version of a module dependency. Modules have changed their arguments, default behaviors, and return values across minor releases. A playbook written against community.network 3.x will fail in subtle ways on 4.x.

TASK [network_cli] *********************************

ERROR! couldn't resolve module/action 'cisco.ios.ios_interfaces'. This often indicates a misspelling, missing collection, or incorrect module path.

The module exists. The collection exists. They are not installed on THIS host. Or they are installed at a version that does not have this module yet.

Fix. Pin everything. requirements.yml with exact collection versions. A Dockerfile or poetry-locked venv for the Ansible runner itself. CI and production must run from the same pinned environment — if your CI is on Ansible 2.16 and prod is on 2.15, you will lose half a day on version-skew bugs every month. The five minutes to lock versions is worth the two hours of rollback debugging you will avoid.

## The Pattern Underneath All Seven

Every one of these has the same shape. Ansible is doing something reasonable given the inputs it has, but the inputs are not what the operator thought they were — stale SSH sockets, unflushed handlers, mis-detected interpreters, volatile templates, drifted inventories, suppressed exceptions, mismatched module versions. The playbook code looks fine. The gap between "what the code says" and "what the infrastructure actually is" is where the bug lives.

The fast way to debug broken automation is not to stare at the playbook. It is to interrogate the gap. Compare the facts Ansible gathered against what you know to be true. Check the versions of every moving piece. Strip the retries and log what actually failed. In fifteen years of fixing other people's Ansible, I have never found a bug that did not live somewhere in that gap.

  ### Broken Automation Right Now?

  I run a flat-fee rapid-fix service: $250 root-cause audit in 48 hours. Ansible, Cisco NSO, CI/CD, Python. If I cannot diagnose it, you do not pay.

  Submit Your Issue

    Prime Automation Solutions

  Network automation consulting. Ansible, Cisco NSO, Python, CI/CD. Rapid fixes from $250. Veteran-owned, Atlanta, GA.

  Services

    Fix Broken Automation
    Website Lead Recovery
    Network Automation
    All Services

  Resources

    Case Studies
    Blog
    IT Health Check
    BOBR Podcast

  Contact

    erik@primeautomationsolutions.com
    Submit an Issue

&copy; 2026 Prime Automation Solutions. All rights reserved.

Atlanta, Georgia &middot; Veteran-Owned

DEV Community

Ansible Playbook Failing? The 7 Root Causes I See Most Often — Prime Automation Solutions

fails, because nginx never got the reload

renders `generated_at: {{ ansible_date_time.iso8601 }}`

the service returned 403 on attempt one — a real auth bug —

but you retried 29 more times because `until` only checks 200

Top comments (0)

fails, because nginx never got the reload

renders generated_at: {{ ansible_date_time.iso8601 }}

the service returned 403 on attempt one — a real auth bug —

but you retried 29 more times because until only checks 200

renders `generated_at: {{ ansible_date_time.iso8601 }}`

but you retried 29 more times because `until` only checks 200