DEV Community

ひとし 田畑
ひとし 田畑

Posted on

`terraform plan` is blind to anything not in its state. So I scan AWS directly with boto3.

terraform plan is great at one thing: telling you how your managed resources
have drifted from your config. But it has a blind spot baked into its design — it
only knows about resources that are in your state file. A security group someone
spun up in the console, an RDS instance a teammate created with the CLI, an S3
bucket left behind by a deleted stack: plan will never mention any of them,
because as far as Terraform is concerned, they don't exist.

For a drift detector that's a problem. "Drift" isn't only managed resource changed
— it's also something exists in the account that nobody declared. To catch that
second kind, you have to stop asking Terraform and start asking AWS. Here's how I do
it with boto3, the AssumeRole pattern that makes it work across accounts, and the
part that was actually hard: making the two data sources comparable.

One tool, many accounts: STS AssumeRole

The tool runs as one process but needs to read resources in several AWS accounts.
Baking long-lived access keys for each account into config is exactly the thing you
don't want. The clean answer is STS AssumeRole: the tool has one identity, and
each account it scans grants a role that identity can assume. You get short-lived,
scoped credentials per scan.

In boto3 it's small:

def get_session(role_arn=None, region='ap-northeast-1'):
    """Return a boto3 Session, optionally via STS AssumeRole."""
    if role_arn:
        sts = boto3.client('sts')
        resp = sts.assume_role(RoleArn=role_arn, RoleSessionName='syncvey-scan')
        creds = resp['Credentials']
        return boto3.Session(
            aws_access_key_id=creds['AccessKeyId'],
            aws_secret_access_key=creds['SecretAccessKey'],
            aws_session_token=creds['SessionToken'],
            region_name=region,
        )
    return boto3.Session(region_name=region)
Enter fullscreen mode Exit fullscreen mode

A couple of things worth calling out:

  • RoleSessionName shows up in CloudTrail. I set it to something identifiable (syncvey-scan) so when someone audits the target account, the assumed-role sessions are obviously "the scanner," not a mystery principal.
  • The role_arn=None fallback is deliberate. No role configured → use whatever ambient credentials the process already has (instance profile, AWS_PROFILE, env vars). Single-account users don't have to set up cross-account roles to get started; multi-account users attach a role_arn per system and it just works.

One thing I'd add before calling this production-grade: an ExternalId on the
assume-role call. Without it you're exposed to the classic confused-deputy problem
across accounts. It's a one-parameter change here and a one-line condition in the
trust policy — cheap insurance.

The actual hard part: two sources, two shapes

Here's the trap. You now have two descriptions of the same infrastructure:

  1. What Terraform thinks exists — imported from tfstate, in Terraform's attribute vocabulary (id, instance_type, tags as a flat map).
  2. What AWS says exists — straight from the boto3 API, in AWS's vocabulary (InstanceId, InstanceType, Tags as a list of {Key, Value} dicts).

If you try to diff those directly, everything looks like drift, because the
field names and value shapes don't line up. The boto3 scan is useless until you
normalize each AWS response into the same Terraform-compatible shape the rest
of the system already speaks.

So every scanner is really a translator. EC2, for example:

def scan_ec2(session):
    ec2 = session.client('ec2')
    results = []
    paginator = ec2.get_paginator('describe_instances')
    for page in paginator.paginate():
        for reservation in page['Reservations']:
            for i in reservation['Instances']:
                results.append({
                    'id':            i['InstanceId'],          # not InstanceId
                    'instance_type': i.get('InstanceType', ''),
                    'vpc_id':        i.get('VpcId', ''),
                    'private_ip':    i.get('PrivateIpAddress', ''),
                    'tags':          _tags(i.get('Tags')),     # list → map
                    '_resource_type': 'aws_instance',
                    '_scan_source':   'boto3',
                })
    return results
Enter fullscreen mode Exit fullscreen mode

The tag flattening is the smallest, most telling example of the mismatch:

def _tags(raw):
    """[{'Key': k, 'Value': v}]  →  {k: v}"""
    if not raw:
        return {}
    return {t['Key']: t['Value'] for t in raw}
Enter fullscreen mode Exit fullscreen mode

AWS gives you [{'Key': 'env', 'Value': 'prod'}]. Terraform stores {'env':
'prod'}
. Same information, different shape, and a naive comparison flags it as a
difference on every single resource. Multiply that across every field of every
resource type and you see why the normalization layer — not the API calls — is
where the real work lives. I also stamp each record with _scan_source: 'boto3'
so I always know whether a given asset came from a state import or a live scan.

Reconcile, don't overwrite

Once both sides speak the same shape, reconciliation is an upsert keyed on the
cloud resource id:

asset, created = Asset.objects.get_or_create(
    cloud_id=cloud_id,
    defaults={...},
)
if not created:
    asset.raw_data_prev = asset.raw_data   # keep the previous snapshot
    asset.raw_data      = attrs
    asset.save(update_fields=['raw_data', 'raw_data_prev', 'last_imported_at'])
Enter fullscreen mode Exit fullscreen mode

Keeping raw_data_prev means a scan doesn't just record the current state — it
preserves the one before it, so the diff between scans is a first-class thing. And
a resource that shows up in the scan but never came from any tfstate import is
exactly the unmanaged resource terraform plan could never have told you about.

Takeaways

  • terraform plan only sees what's in state. To catch resources created outside Terraform you have to query the cloud provider directly.
  • Use STS AssumeRole for multi-account scanning: one identity, short-lived per-account credentials, an identifiable RoleSessionName, and a no-role fallback to ambient creds. Add an ExternalId before you call it done.
  • The hard part isn't the API calls — it's normalizing AWS responses into the same attribute shape as your tfstate data (InstanceIdid, Tags list→map), or every comparison drowns in false drift.

This is the live-scan half of a self-hosted tool that reconciles tfstate against
what's actually running in AWS to surface drift — including resources Terraform
never knew about. Open source (MIT), one docker compose up:
syncvey.com.
How do you hunt down unmanaged resources today — terraformer, AWS Config,
Driftctl, or a script you'd rather not admit to?

Top comments (0)