The IAM Trust Policy Chicken-and-Egg (That Isn't)

#aws #iam #terraform #security

Originally published on graycloudarch.com.

The pipeline role needed to trust the deployment role. The deployment role needed to trust the pipeline role. When I wrote both in Terraform and ran plan, it stopped:

Error: Cycle: module.pipeline.aws_iam_role.exec → module.deploy.aws_iam_role.target → module.pipeline.aws_iam_role.exec

The instinct is to create one role first, then go back and edit the trust policy of the other after it exists. A manual bootstrap step. It works. It also means you can't terraform apply from a clean state and get a working result — someone has to remember the second pass. The IaC tells half the story.

There's a better answer. IAM trust policies don't validate that the ARNs they reference actually exist. AWS stores the JSON document and moves on. The cycle Terraform sees is real — it's a real edge in its dependency graph. The underlying constraint that dependency represents is not.

ARNs are deterministic before creation

IAM role ARNs follow a fixed format:

arn:aws:iam::<account-id>:role/<role-name>

The account ID is fixed. The role name is chosen at definition time. Which means the full ARN is computable before terraform apply runs — before the resource exists — as long as the name is stable.

AWS does not validate that a referenced principal ARN exists when you create or update a trust policy. It stores the JSON. The role becomes assumable once both sides exist, regardless of which one was created first.

This is different from a configuration error like referencing a nonexistent IAM role in an aws_iam_role_policy_attachment — that fails at apply time because Terraform tries to call the API and gets an error. A trust policy is just a JSON document stored against the role. If the ARN in the Principal field doesn't resolve to an existing entity yet, IAM doesn't complain. It just doesn't match anything. Yet.

The cycle Terraform sees

The dependency graph problem is real. Here's the code that creates it:

resource "aws_iam_role" "role_a" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_b.arn }  # depends on role_b
    }]
  })
}

resource "aws_iam_role" "role_b" {
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = aws_iam_role.role_a.arn }  # depends on role_a
    }]
  })
}

Terraform resolves: role_a needs role_b's ARN before creation → role_b needs role_a's ARN before creation → cycle. It stops before creating either resource.

The fix removes the dependency by computing what you already know:

data "aws_caller_identity" "current" {}

locals {
  account_id = data.aws_caller_identity.current.account_id

  role_a_arn = "arn:aws:iam::${local.account_id}:role/${var.role_a_name}"
  role_b_arn = "arn:aws:iam::${local.account_id}:role/${var.role_b_name}"
}

resource "aws_iam_role" "role_a" {
  name = var.role_a_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_b_arn }  # string, no Terraform dependency
    }]
  })
}

resource "aws_iam_role" "role_b" {
  name = var.role_b_name
  assume_role_policy = jsonencode({
    Statement = [{
      Principal = { AWS = local.role_a_arn }  # string, no Terraform dependency
    }]
  })
}

No cycle. Both roles are created in a single apply. The trust relationship is live as soon as both resources exist — which they will be, after the same plan.

Where this pattern appears in practice

Cross-account deployment pipeline

CodePipeline execution role in account A assumes a deployment role in account B. The deployment role's trust policy needs to reference the pipeline role's ARN. Each Terraform root manages its own account's roles. The ARN construction pattern resolves the cross-account dependency: each module constructs the other account's role ARN from var.pipeline_account_id and a known role name — values passed in at plan time from tfvars or remote state outputs.

ECS task role and execution role

The ECS task execution role needs iam:PassRole to hand the task role to ECS at launch. Some teams want the task role's trust policy to explicitly list the execution role's ARN as the allowed principal. You don't need to — ecs-tasks.amazonaws.com as the service principal removes the dependency entirely. But if your security posture requires explicit principal ARNs rather than the service principal, ARN construction handles it without a two-pass apply.

Permission boundary bootstrap with an SCP

An SCP requires that all new IAM roles include a specific permission boundary policy. The boundary is a managed policy that must exist before any roles referencing it can be created. This isn't a circular dependency — it's a sequential one. The boundary policy must be applied first, separately. Construct its ARN deterministically (arn:aws:iam::${var.account_id}:policy/${var.boundary_name}) and pass it in wherever roles are created. Document the bootstrap order with a Terraform precondition block or a clear README section. Different problem, different fix.

When the dependency is genuine

There's a scenario that looks identical to this but isn't: when a Terraform provisioner or data source needs to actually call a role — not just reference its ARN — during resource creation.

Example: a null_resource provisioner that runs aws sts assume-role and then operates in the target account. Here you need the role to exist and be assumable before the provisioner fires. ARN construction doesn't help — you need the resource active at execution time, not just its string value known at plan time. The correct fix is explicit depends_on, not local string construction.

The distinction: static JSON referencing an ARN string (solvable with ARN construction) vs. a runtime API call that needs the resource actually live (solvable with depends_on). If your code needs to assume the role during apply, you need ordering. If it just needs to name the role in a policy document, you don't.

The trap in the fix

Once you've internalized "construct ARNs deterministically," the next failure mode is role names that include Terraform-generated suffixes.

resource "aws_iam_role" "role_a" {
  name = "${var.prefix}-role-${random_id.suffix.hex}"  # ARN not deterministic until random_id exists
  ...
}

If the role name includes random_id.suffix.hex, the ARN can't be computed until the random_id resource is created. That brings the dependency back — you're back to needing a resource output to construct the name, and the cycle re-forms if any of those names are referenced in another role's trust policy.

The fix is stable, predictable role names: "${var.prefix}-${var.env}-pipeline" rather than generated suffixes. IAM role names are unique per account, not globally. The habit of appending random suffixes comes from S3 bucket naming, where global uniqueness is required. IAM doesn't have that constraint. There's no reason to make the name unpredictable.

If you have existing roles with generated names and need their ARNs, they're deterministic after the first apply — stored in state and readable via aws_iam_role.role_a.arn. The construction approach is for cases where you control the naming and are defining the role name yourself.

What generalizes

The IAM trust policy deadlock is the most common place engineers hit this pattern, but it's not the only one. Wherever you encounter a Terraform circular dependency involving a predictable string — ARNs, resource names, account IDs, region names — ask whether you actually need the resource output or whether you can compute the value from what you already know.

data.aws_caller_identity.current.account_id gives you the account without creating a dependency on any resource. A stable name gives you the ARN. The dependency graph edge exists only because you referenced the resource — remove the reference by computing the value directly, and the cycle disappears.

The broader principle: Terraform's graph is built from references. References that aren't necessary are constraints that aren't necessary.

Untangling IAM architecture across multiple accounts — trust policies, permission boundaries, SCPs, cross-account assume-role chains — is where subtle errors compound quietly and the blast radius is real. I work on this regularly.