Ansible Dynamic Inventory for EC2 Fleets: 7 Tuning Tips

#aws #devops

Originally published on kuryzhev.cloud

We were three minutes into a patch run against 600 EC2 instances when the play just... stopped. The error was botocore.exceptions.ClientError: RequestLimitExceeded, and it took us a while to realize the problem wasn't the patch playbook — it was Ansible dynamic inventory for EC2 re-querying the API on every host loop iteration. If you're running amazon.aws.aws_ec2 against anything larger than a handful of instances, these are the fixes that actually matter.

Cache Your Dynamic Inventory or Pay for It in API Throttling

Uncached inventory calls hammer the EC2 API on every single playbook run, and large fleets pay for it in throttling. The aws_ec2 plugin makes a fresh DescribeInstances call every time it builds inventory, and on fleets of 500+ instances that's slow and, eventually, rate-limited. Enable the jsonfile cache plugin with a sane cache_timeout so you're not re-fetching the same instance list for every CI job or ad-hoc command in a five-minute window.

The tradeoff is real: a longer cache timeout means you might target stale hosts during an active autoscaling event, but a too-short timeout brings back the throttling. We settled on 300 seconds for steady-state patching and drop it to 30-60 seconds during active scale-out windows.

Use keyed_groups Instead of Manual Group Definitions

Stop maintaining static host groups by hand — let tags build them for you. The keyed_groups option auto-creates groups from instance tags and state, so a new instance tagged Role: web shows up in role_web without anyone touching an inventory file. This is the single biggest maintenance win once you're managing infrastructure across multiple AWS accounts or regions.

Watch out: tag keys are case-sensitive. We had instances tagged Environment: Prod next to others tagged environment: prod from an older Terraform module, and half the fleet silently vanished from the env_prod group — no error, no warning, just missing hosts. Standardize your tagging convention before you lean on keyed_groups.

Filter at the API Level, Not in Playbook Logic

Filtering with when: conditions after inventory is already built wastes API calls and slows every run down. If you fetch 3,000 instances across all regions and then filter down to 40 with a task-level condition, you've already paid the cost of describing every terminated, stopped, and unmanaged instance in the account. Push the filter into the inventory config itself with filters.

This is the mistake we see most often on teams migrating from static inventory files: they keep the old "loop and skip" logic instead of trusting the plugin's native filtering. Filtering by instance-state-name: running and a ManagedBy tag at the API level cuts inventory build time dramatically and keeps you from accidentally targeting a stopped or terminated instance.

Patch in Batches with serial and max_fail_percentage

Without serial, Ansible runs against 100% of your fleet in parallel — one bad patch or AMI takes down everything at once. Set serial: "20%" combined with max_fail_percentage: 10 so a failing batch halts the rollout before it reaches the whole fleet. This is the difference between "20% of web servers rebooted into a broken kernel" and "the entire tier is down."

Pair this with ASG lifecycle hooks (autoscaling:EC2_INSTANCE_TERMINATING) so instances aren't yanked out from under a play mid-patch during a scale-in event. I stopped running unbatched patch playbooks after a security update rebooted an entire ASG simultaneously and took our checkout flow down for four minutes — it's not a risk worth taking twice.

Don't Hardcode Credentials — Use Instance Roles for Inventory Lookups

An IAM instance role beats a static access key every time, especially for something as routine as inventory lookups. Attach a least-privilege role to your bastion or control node scoped to ec2:DescribeInstances, ec2:DescribeTags, and ec2:DescribeInstanceStatus — nothing more. There's no legitimate reason for the inventory plugin to have write access to anything.

Never commit aws_access_key or aws_secret_key into aws_ec2.yml, even in a "private" repo. If you need multiple profiles, use the AWS_PROFILE environment variable or role assumption via AWS_ROLE_ARN, not plaintext credentials sitting next to your playbooks. Check the AWS IAM roles documentation if you're still wiring up instance profiles for this.

Verify Inventory Before You Patch — ansible-inventory --graph Is Your Sanity Check

Run ansible-inventory --graph before every patch job, not after something goes wrong. It's a thirty-second command that catches empty groups from a bad region filter or a typo'd tag key before you waste a play against zero hosts — or worse, against the wrong hosts because your filter was too loose.

A common assumption that trips people up: dynamic inventory does not refresh automatically when an ASG scales out. It's a snapshot taken at the start of the playbook run, not a live feed. If you scale from 10 to 40 instances mid-deploy, those new 30 won't appear until the next run pulls a fresh snapshot — plan your patch windows accordingly, and re-run --graph if you suspect the fleet changed underneath you. See the amazon.aws aws_ec2 inventory plugin docs for the full option list.

Handle Fleet Churn: Compose ansible_host and Skip Terminating Instances

Public DNS names lag behind instance launch, so use compose to force ansible_host to the private IP instead. A freshly-launched instance from an ASG scale-out event often doesn't have a resolvable public DNS record yet, and Ansible will hang or fail trying to connect. Setting ansible_host: private_ip_address in compose sidesteps that entirely, assuming your control node has network access via VPC peering or a VPN.

During scale-in events, exclude instances that are mid-termination using a lifecycle tag set by your ASG termination hook, combined with the instance-state-name: running filter. Otherwise you'll get intermittent UNREACHABLE failures in your patch run that have nothing to do with the patch itself — they're just instances that disappeared between inventory build and task execution.

Here's the full inventory config we run in production, tying all of the above together:


# inventories/aws_ec2.yml
# Dynamic inventory config for aws_ec2 plugin (amazon.aws >= 7.6.0)
plugin: amazon.aws.aws_ec2

# Restrict regions explicitly — omitting this queries ALL AWS regions
regions:
  - us-east-1
  - eu-west-1

# Only include running instances tagged as ansible-managed
filters:
  instance-state-name: running
  "tag:ManagedBy": ansible

# Cache results to avoid RequestLimitExceeded on large fleets
cache: true
cache_plugin: jsonfile
cache_connection: /tmp/ansible_inventory_cache
cache_timeout: 300           # seconds; tune lower during active scaling events
cache_prefix: aws_ec2_prod

# Don't fail hard if some instances lack expected tags (mixed fleets)
strict: false

# Build groups automatically from tags instead of static host files
keyed_groups:
  - key: tags.Environment
    prefix: env
  - key: tags.Role
    prefix: role
  - key: "instance_state.name"
    prefix: state

# Use private IP instead of public DNS to avoid resolution lag
# on freshly-launched instances during ASG scale-out
compose:
  ansible_host: private_ip_address
  ansible_user: "'ec2-user'"

# Skip instances that are mid-termination (lifecycle tag set by ASG hook)
exclude_hosts_pattern: "tag_lifecycle_terminating"

Install the required collection version first — the aws_ec2 plugin behavior described here needs amazon.aws 7.x or later:


ansible-galaxy collection install amazon.aws:7.6.0

And here's the pre-flight check plus the actual batched patch run, output included so you know what a healthy run looks like:


# Verify inventory before running a patch playbook — catch empty groups early
$ ansible-inventory -i inventories/aws_ec2.yml --graph

@all:
  |--@env_prod:
  |  |--@role_web:
  |  |  |--ip-10-0-1-23
  |  |  |--ip-10-0-1-45
  |--@state_running:
  |  |--ip-10-0-1-23
  |  |--ip-10-0-1-45

# Patch playbook run with batched rollout to limit blast radius
$ ansible-playbook -i inventories/aws_ec2.yml patch_fleet.yml \
    --limit "role_web:&env_prod" \
    -e "serial='20%'" \
    -e "max_fail_percentage=10"

PLAY [Patch web fleet] ****************************************
TASK [Apply security updates] *********************************
changed: [ip-10-0-1-23]
changed: [ip-10-0-1-45]

PLAY RECAP *******************************************************
ip-10-0-1-23  : ok=3  changed=1  unreachable=0  failed=0
ip-10-0-1-45  : ok=3  changed=1  unreachable=0  failed=0

None of these fixes are exotic — cache your inventory, filter at the API level, batch your rollouts, and verify before you run. But together they're the difference between a dynamic inventory setup that quietly scales with your fleet and one that throttles, misses hosts, or takes down a whole tier during a routine patch cycle. If you're building out your broader AWS automation stack around this, our DevOps_DayS archive has more patterns for treating infrastructure as continuously reconciled state rather than a one-off script.