Daniel Westgaard

Posted on Jun 29 • Edited on Jul 14 • Originally published at riftmap.dev

How to Find Every Consumer of Your Ansible Role

#ansible #ansibleroles #ansiblecollections #blastradius

You have got a one-line fix for the hardening role. A sysctl value that should have been set from the start, the kind of change nobody notices until an auditor does. You open the role's repository, make the edit, and reach for a tag.

Then the question lands, the way it always does right before you cut a release. Who gets this?

You look at the repository in front of you for an answer. There is a tasks/ directory, a defaults/main.yml, a meta/main.yml, and a README that three people have edited and nobody has read. None of it names a single playbook that uses the role. Nothing inside the role knows who depends on it. That information lives somewhere else, in repositories you are not currently looking at, and the role has no way to point you towards them.

This is not your team being sloppy. It is structural, and Ansible chose it on purpose.

Here is the proof. Back in 2016, Jeff Geerling, who maintains more widely used Ansible roles than almost anyone alive, filed a request for a lock file. Something like package-lock.json or poetry.lock, a file that records exactly which version of each role and collection actually resolved, so a run is reproducible and you can see what you are really shipping. The request sat. When the old issue tracker was archived years later, it was carried over and then quietly abandoned. A decade on, Ansible still has no native lock file.

Sit with what that means. Every other ecosystem in this series at least commits to telling you which version is installed. npm has package-lock.json. Go has go.sum. Python has its various lock formats. Ansible looked at that primitive and decided not to build it. And if the ecosystem will not commit to telling you what version of a role actually ran, it is certainly not going to tell you who consumes that role across your organisation. The reverse question, the one you are asking right now with your finger over the tag, was never in scope.

So let us answer it properly. If you change a shared Ansible role, what breaks? And how do you find every consumer before you find out the hard way?

Where your role actually gets used

Start with the shape of the problem, because Ansible reuse spreads a role's dependents across more places than people expect.

The canonical layout, the one the official best practice points you towards, is one repository for your playbooks and a separate repository per role. The playbook repository pulls the roles it needs through a requirements.yml:

# platform-playbooks/roles/requirements.yml
roles:
  - name: hardening
    src: https://gitlab.example.com/platform/ansible-role-hardening.git
    scm: git
    version: v2.3.0

  - name: geerlingguy.nginx
    version: "3.1.4"

collections:
  - name: polaris.infrastructure
    source: https://automationhub.example.com/api/galaxy/
    version: ">=1.4.0,<2.0.0"

  - name: community.crypto

That is one site where your role's consumers live. There are several more, and they do not look alike.

A play can pull a role in directly at the play level:

- hosts: webservers
  roles:
    - hardening
    - role: geerlingguy.nginx

A task can pull one in dynamically or statically, anywhere in the tasks section:

tasks:
  - ansible.builtin.import_role:
      name: hardening

  - ansible.builtin.include_role:
      name: hardening

And a role can pull in another role as a dependency, declared in its own meta/main.yml, which then runs before it, recursively:

# ansible-role-webserver/meta/main.yml
dependencies:
  - role: hardening

Read that last one again, because it is the one that bites. If webserver depends on hardening in its metadata, then every playbook that uses webserver is also a consumer of hardening, and not one of those playbooks mentions hardening anywhere. The dependency is real, it runs in production, and it is invisible to anyone searching the playbook repositories for the word "hardening".

So the consumers of your role are scattered across requirements.yml files, play-level roles: keys, import_role and include_role tasks, and the meta/main.yml of other roles. Every one of those lives in a repository other than the one you are about to tag. None of them is listed anywhere central. This is the same structural fact behind the whole Find Every Consumer series: the dependency lives in the relationship between repositories, and the artifact you are changing cannot see its own dependents. It is the exact shape we covered for Terraform modules pulled by git source, only Ansible has more entry points and weaker typing.

The version you cannot see

Now layer the lock file problem back on top, because it turns a hard problem into a worse one.

Look again at the requirements.yml above. The hardening role is pinned to v2.3.0. The polaris.infrastructure collection is pinned to a range, >=1.4.0,<2.0.0. The community.crypto collection has no version at all. Each of these is a different relationship to the thing you are about to release, and you have probably got all three patterns live across your estate right now.

When a version is absent, ansible-galaxy install takes the latest it can find. When it is a range, you get the newest release inside the range at install time. When it is a branch or main, you get whatever that branch points at the moment the pipeline runs. And here is the part with no safety net: nothing writes down what actually resolved. There is no lock file to read afterwards. With npm you can at least open package-lock.json and see the truth, which is why "the change you ship without shipping it" was the core of the npm post. In Ansible you cannot, because the file that would hold that truth was never built.

So tagging v2.4.0 of your role is not shipping a change to a known list of consumers at a known version. It is a fan-out to an unknown set of repositories, an unknown number of which are pinned, ranged, or floating, resolving on a schedule you do not control, with no record of what each one actually ran. You are pushing a change into a system that has deliberately chosen not to remember its own state.

The first thing a real answer has to do, then, is the thing the ecosystem refuses to. It has to tell you, for every consumer, whether they pinned you, ranged you, or left it floating. "Who is unpinned" is not a nice-to-have. In an ecosystem with no lock file, it is the single most important fact about your blast radius, because the unpinned consumers are the ones that will move the moment you tag, without anyone deciding to.

What Ansible's own tooling tells you, and where it stops

The Ansible toolchain is genuinely good, and it answers real questions. It just answers the other ones.

ansible-galaxy collection list and ansible-galaxy role list tell you what is installed in the environment you are standing in. That is forward, local, and per-environment. There is no ansible why. The CLI will happily tell you that a collection is present; it has no concept of which of your repositories declared it, let alone which tasks use it.

ansible-inventory --graph --vars and --host will reconstruct what a given host ends up with, variables and all. Again, forward, and per-host. It answers "what does this host get", never "who across my repositories consumes this".

ansible-lint and Molecule are excellent, and you should be running both. ansible-lint in particular will flag short module names, deprecated syntax, and a long list of anti-patterns. But linting is per-project quality enforcement and Molecule is per-role testing. Neither builds a cross-repository map of who depends on what. They make each repository better in isolation; they do not connect them.

It is worth noting how often engineers reach past all of this and build the reverse lookup themselves. There is a published tool, ansible-discover, whose entire purpose is to list the dependants of a role so you can trigger the right CI jobs when it changes. There is ansible-variables, which exists to trace where a host's variables actually come from. Both are small, both are local-path, both work only across the repositories you have already checked out together. They are proof that the question matters enough for people to keep writing the same tool, and they define the gap precisely: the moment the answer spans repositories that are not on one disk at one time, these stop.

Then there is the registry layer, and this is where people most often expect the answer to live, because it feels like it should. Ansible Galaxy, Automation Hub, and a private Automation Hub do distribution, curation, governance, and supply-chain control, and they do it well. What they record is publication and downloads. A download count is not a dependency edge. Knowing that polaris.infrastructure was pulled four hundred times this month tells you nothing about which of your repositories declared it, which tasks call into it, or what breaks if you change it. This is the same trap as the Go module proxy: the proxy logs fetches, and a fetch is not a use.

Automation Controller, AWX, and the wider Automation Platform sit one layer up and know about Projects, where each Project is a git repository the controller syncs and whose requirements.yml it installs at sync time. Automation Analytics will report on job runs, host counts, and ROI. All useful, all runtime. The controller knows which Projects exist and installs their requirements; it does not parse and present a cross-Project graph of which Projects depend on which role. It answers "what ran" and "what runs", not "what is declared, everywhere, right now". It is the same distinction we drew for GitLab's pipeline analytics, where usage events told you about executions and never about the static set of repositories that include a thing.

And the update bots, which are the closest any of these come to caring about consumers? They treat Ansible as a second-class citizen. Renovate does have an ansible-galaxy manager that will bump role and collection versions per consumer, which means it implicitly knows about the repositories it is switched on for. But its data source for collections ignores a custom source: and falls back to public Galaxy, so for an internal collection served from your own Automation Hub, the exact case where blast radius is your problem, Renovate is effectively looking at the wrong registry. Dependabot does not support Ansible Galaxy at all. The feature request has been open since 2021 with a steady trickle of thumbs-up and no support. Even the ecosystem built to watch your dependencies for you does not watch these.

Every tool above answers a forward question or a distribution question. What is installed here. What does this host get. What did Galaxy serve. What ran last night. Not one of them answers the only question that matters when your finger is on the tag: across all of my repositories, who consumes this role, at what version pin, and what breaks if I change it.

Why this is harder than it looks

You might reasonably think this is a grep problem. Search every repository for the role name, collect the hits, done. It is not, and the reasons are specific to how Ansible resolves things at runtime rather than at parse time.

A role's real interface is variables it never declares

This is the hardest thing in Ansible, and it is worth slowing down for, because it is the one that has no clean answer anywhere.

A role's true contract is not its name. It is the set of variables it reads. And in Ansible, those variables are supplied by the consumer, not declared by the role. Your hardening role reads hardening_sysctl_overrides. It does not, and cannot, list that as a typed input the way a function signature would. The value is set somewhere out in the consumer's world: in a group_vars/webservers.yml, in a host_vars file, in a vars: block on a play, in an --extra-vars flag in a pipeline, in a set_fact. Ansible resolves which one wins through a precedence order with more than twenty levels, and role defaults sit very near the bottom of it, which is the source of a thousand confused afternoons.

Now rename that variable. Ship hardening v2.4.0 where hardening_sysctl_overrides becomes hardening_sysctl_config. Nothing errors at parse time. There is no compile step to fail. The old variable simply stops being read, the new one falls back to its default, and a hardening setting silently reverts on every host whose consumer set the value under the old name. There is no grep target, because the breaking change is the absence of a string near a role reference that does not textually mention the role at all. The consumer set a variable; it never said which role the variable was for.

This is the inverted contract at the heart of Ansible reuse. The role depends on inputs it does not declare. The playbook declares inputs without saying which role they feed. The dependency edge has no home in either file. It is the reason this is genuinely harder than any of the manifest-based ecosystems in this series, and, as we will get to, it is also the honest boundary of what any tool that reads source can claim.

The same module call has two spellings

A collection ships modules, and the real blast radius of a collection is every task that calls one of its modules. Ansible lets you write that call two ways. Fully qualified, polaris.infrastructure.configure_firewall, or as a bare configure_firewall with a play-level collections: keyword setting up a search path.

The catch is that the short form is on the way out. The official documentation describes the collections: keyword as a temporary mechanism from the 2.9 transition and tells you to rewrite content that uses it and prefer the fully qualified name. ansible-lint actively flags it. So modern, well-written playbooks tend to use the fully qualified name directly in every task and carry no collections: block at all. Which means a grep for the short name misses the modern callers, a grep for the fully qualified name misses the legacy ones, and the bare short name collides with built-ins and with identically named modules in other collections. The usage edge for a collection is real, it is increasingly written in fully qualified form in task bodies, and it is precisely where naive search is weakest.

Static and dynamic inclusion do not grep the same

roles: and import_role are static. Ansible processes them at parse time, so the role name is right there in the file. include_role is dynamic, evaluated at runtime, and it can take a computed name:

- ansible.builtin.include_role:
    name: "{{ role_for_this_environment }}"

There is no string hardening to find here. The consumer relationship is a variable that resolves during the run. This is unresolvable from source by anyone, including us, and any tool that claims otherwise is guessing. The honest move is to find the literal cases and be straight about the computed ones.

Dependencies hide behind other dependencies

Back to meta/main.yml. Role dependencies are recursive and run before the role that declares them. If webserver depends on hardening, and app-tier depends on webserver, then a playbook using app-tier is a transitive consumer of hardening two hops away, naming neither intermediate. Finding direct consumers is a string search. Finding transitive ones is a graph traversal, and you cannot fake a graph traversal with grep, which is the same wall we hit with nested GitLab CI includes.

The declaration is not in one place

Even the manifest is not one file. A requirements.yml can pull in other requirements files with an include: directive. Collections get declared in requirements.yml, in a collection's own galaxy.yml dependencies, in role metadata, and in execution-environment definitions that bundle a requirements file into a container image. ansible.cfg can redirect where roles and collections resolve from with roles_path and collections_path. The string that represents your role appears in a different shape in each of these, and a tool that reads only the root requirements.yml of each repository will quietly undercount.

What the full answer requires

Put all of that together and the requirements for actually answering "who consumes this role" fall out:

Scan every repository in the organisation, with no opt-in and no per-repo config. The consumers live in repositories other than the one you are changing, so anything that needs each team to register or annotate will be incomplete the day someone forgets.
Parse every declaration site, not just the obvious one. requirements.yml roles and collections, galaxy.yml dependencies, and meta/main.yml role-to-role dependencies, at minimum.
Capture usage edges, not only install edges. Who declared the role in a manifest is one thing. Who actually invokes it through roles:, import_role, and include_role, and who calls into a collection through fully qualified module names in task bodies, is the thing that breaks.
Reconstruct transitive role-dependency chains, so a consumer two hops away through meta dependencies is counted, not missed.
Resolve names to the in-org repository that produces the artifact, including git-URL sources, with a guard against a public role or collection that happens to share a name with one of yours.
Record the declared version constraint, or its absence. Since there is no lock file, the queryable fact you can offer is who pinned you, who ranged you, and who left it floating.
Stay current by re-scanning, not by trusting run telemetry. What ran last week is not what is declared today.
Make it one query, because a real answer you have to assemble by hand across forty repositories is not an answer you will actually run before a release.

How Riftmap does this

This is the gap Riftmap was built to close, so here is honestly what it does and does not do for Ansible.

Riftmap scans a GitLab or GitHub organisation and parses every repository deterministically. There is no language model in the parse path; the dependency edges come from parsing the actual files, and each edge carries a rule-based confidence score. That last point matters for Ansible specifically, because some of these edges are genuinely more certain than others, and the tool says so rather than flattening everything to a confident-looking line.

On the consumer side, it reads roles from requirements.yml, from meta/main.yml dependencies, and from play-level roles:, import_role, and include_role, including the modern ansible.builtin.include_role form and includes nested inside block, rescue, and always. It reads collections from requirements.yml, from galaxy.yml dependencies, from the play-level collections: keyword, and, crucially, from fully qualified module calls in task bodies, which is the usage edge that the fading collections: keyword is being replaced by. Manifest declarations resolve at full confidence; a play-level roles: entry or an include_role name is scored a little lower, because it might point at a role bundled in the same repository rather than a shared one; a fully qualified module call is scored lower still and filtered against a denylist of public namespaces so the graph is about your collections, not ansible.builtin.

On the producer side, it detects which repository publishes a role, from its meta/main.yml galaxy metadata, and which publishes a collection, from its galaxy.yml. An edge forms when a consumer declaration resolves to a producer in the same organisation. A role or collection pulled by git URL resolves to the in-org repository that produces it, which sounds obvious and was, until recently, exactly where things slipped: a collection declared by a bare git source: with no name used to be parsed and then dropped on the floor, because the resolver had no git fallback for collections. It became a permanent external edge even when the producing repository was sitting right there in the org. That is fixed now, and it is a good illustration of how well infrastructure hides its own edges, that a tool built specifically to find them still had a blind spot for the case where the dependency was declared by URL.

Because the whole thing is a graph, the transitive case is handled. The impact traversal walks meta dependency chains, so when you ask what breaks if you change hardening, you get the playbooks two hops away through webserver and app-tier, which is the answer grep structurally cannot give you. Riftmap uses no graph database to do this, just PostgreSQL and a breadth-first walk in application code, which is a deliberate choice covered elsewhere, but the relevant point here is that the traversal exists.

And the version-state question, the one the missing lock file makes load-bearing, is answered directly. Every consumer of a role or collection carries its constraint state: pinned to an exact version, floating on a range, riding a branch, or absent entirely, and it flags the ones already trailing the latest release. You can filter those consumers down to the unpinned ones and read off, in one number, how many repositories will move the instant you tag, with nobody deciding to. That is the "who is unpinned" answer the opening of this post said you needed, and it is a filter, not a research project.

The same view works for a role or a collection. Here it is for an internal collection, polaris.infrastructure, because a collection is where you can see both kinds of edge in one place: most of these consumers declared it in a requirements.yml, and one of them is in the list only because a playbook calls one of its modules by fully qualified name.

There is even a small, on-theme detail in how the detection works. The cheap pre-filter that decides whether a YAML file is worth parsing as a playbook used to key off the presence of roles: or collections: or include tokens, which meant a thoroughly modern playbook that used nothing but fully qualified module calls, no collections: block at all, would have been skipped before it was ever examined. The best-written playbook was the one that hid best. The detection now keys off hosts:, the actual marker of a play, so the modern repositories that do everything right are exactly the ones that no longer slip through.

What Riftmap does not claim

The honest limits are part of the product, so here they are plainly.

It does not trace variables, and that is on purpose. "Who consumes this variable" means modelling more than twenty precedence levels with per-repository runtime semantics, and the result would be a false-positive machine that quietly undermined every edge you actually trust. This is the inverted contract from earlier, and it is the boundary of what reading source can honestly tell you. Riftmap maps the artifact graph and is straight about the fact that the variable contract sits outside it, because the alternative is pretending to an answer nobody can derive from source without lying about it.

It does not resolve computed include_role names, because name: "{{ something }}" is unknowable until the run. It does not parse import_playbook references, which are intra-repository file paths rather than cross-repository dependencies. It does not, today, read execution-environment definitions or follow requirements.yml include: chains, both of which are on the roadmap rather than in the product. And it recognises your collections by your namespaces, so a public collection you merely consume shows up as an external dependency by design, not as a repository in your graph.

None of those gaps is hidden behind a confident interface. The point of a blast-radius tool is to be trusted on a Friday afternoon, and you do not earn that by overclaiming.

Distribution and blast radius are different questions

Step back, and the shape of it is clear. The very things that make Ansible feel light are the things that make its blast radius invisible. No lock file, so no record of what ran. No compile step, so no parse-time failure when a contract breaks. Variables resolved at runtime from wherever the consumer happened to set them. A role pulled from a repository nobody pinned. Ansible will cheerfully run a playbook it has never seen, against a host it has never met, resolving a role nobody locked and reading a variable nobody declared, and it will do it well. That flexibility is the entire point of the tool.

It is also why "what breaks if I change this role" has no answer in the box. The box was designed not to keep that answer.

Galaxy and Automation Hub tell you where a role came from and how often it was pulled. That is distribution, and it is a real and useful thing to know. It is a different question from what breaks when you change it. Distribution is about the artifact. Blast radius is about the relationships between repositories, which is the one place the artifact cannot see and the registry was never asked to look. You find every consumer by reading those relationships across the whole organisation and resolving them into a graph. Parsed, not inferred.

Your hardening fix is still sitting there, untagged. Now you can see who gets it, who pinned you and who did not, and which playbooks two hops away you were about to surprise. Tag it on purpose.

This post is part of the Find Every Consumer series, which works through the same question across Terraform modules, Docker base images, GitHub Actions, Helm charts, Go modules, GitLab CI templates, npm packages, and Python packages, one ecosystem at a time.

Riftmap maps cross-repository dependencies across a GitLab or GitHub organisation and answers the change-impact question directly: if I change this, what else breaks? It parses the relationships between your repositories rather than asking you to model them, with no per-repo config. Map your org, or book a walkthrough and we will map it with you.

DEV Community