DEV Community

Cover image for Why terraform apply fails when plan passes: the map(any) trap
Muhammad Hassaan Javed for Infraforge

Posted on • Originally published at infraforge.agency

Why terraform apply fails when plan passes: the map(any) trap

The on-call engineer pinged me at 4:42pm on a Friday with the release window open until 5:30. terraform apply against the staging workspace had failed with Error: Unsupported argument deep inside a child module nobody on the team had touched in seven months. terraform plan against the same workspace ran clean. They had already re-run plan twice and got fresh no-op output both times. The shape of the failure was off. plan and apply diverging is rare in the way they were describing, and you mostly see it on data sources that resolve at apply time, not on a static merge() call inside a module whose code had not changed in six months.

Problem signals:

  • terraform plan succeeds locally but terraform apply fails on a specific environment
  • The error is Error: Unsupported argument or Inappropriate value deep inside a child module
  • The traceback points at a merge() or lookup() call inside a module that has not been edited in months
  • Your root module input list has crossed 20 variables and several are typed any or map(any)
  • There is no CI job that runs terraform plan against every environment on every PR

Three hypotheses, three dead ends, twenty-two minutes left in the release window

What we ruled out in the first 18 minutes

The first thing the on-call lead suggested was state drift. Someone, somewhere, had terraform import-ed a resource by hand. We checked the audit log. No import events in the past 30 days. We checked the lock table in DynamoDB. The lock had been released cleanly by the previous successful apply at 2:11pm.

The second hypothesis was provider version drift. The team had recently bumped hashicorp/aws from 5.62 to 5.71 in versions.tf. A breaking change in a resource schema can absolutely cause an Unsupported argument error if apply pulls a newer provider than plan resolved against. We pinned both runs to 5.71 explicitly, deleted .terraform/, re-ran init, then plan, then apply. Same error, same module, same line.

The third hypothesis was a stale workspace. terraform workspaces sometimes diverge from the configuration if workspace select was bypassed by an engineer who exported TF_WORKSPACE and forgot. We ran terraform workspace show and verified it matched the intended target. The plan output even confirmed the right resource addresses.

Three explanations, three dead ends, twenty-eight minutes burned. The release window was now twenty-two minutes wide and shrinking. The on-call lead asked whether we should just roll back the deploy and figure it out Monday. I asked one more question first.

The 15th map(any) input that had been silently incubating for three weeks

Where the collision actually lived

I asked the on-call lead to walk me through what had merged into the workspace in the past two weeks. There were six commits. Five were obvious changes (image tags, a new IAM policy, a security group port). The sixth was a feature flag, added as a 15th map(any) input on the root module by an engineer who had joined six weeks earlier.

That was the lead.

The root module had 28 input variables. 14 of them were any-typed or map(any) to absorb per-environment overrides accumulated over six years of feature additions. The new feature flag added a 15th map(any) input named feature_overrides. Its values flowed through a merge() chain down to the database child module, which did its own merge(var.feature_overrides, local.legacy_db_flags) inside modules/services/database/locals.tf.

The two maps had a key collision. Both contained a key named read_replica_routing. The new input's value was a string. The legacy local's value was a map(object({ host = string, weight = number })). merge() resolves collisions by taking the last argument's value, but the argument order in this case depended on which input was non-empty at apply time, and the new feature flag was only non-empty in staging.

sequenceDiagram
  participant Op as Operator
  participant Plan as terraform plan
  participant Apply as terraform apply
  participant Child as child module
  Op->>Plan: feature_overrides (map(any))
  Plan->>Child: merge(map(any), map(any))
  Child-->>Plan: any (type-check deferred)
  Plan-->>Op: 0 to add, 0 to change (PASS)
  Op->>Apply: same input
  Apply->>Child: merge resolved to concrete value
  Child-->>Apply: Error: Unsupported argument
  Apply-->>Op: FAIL at 4:42pm
Enter fullscreen mode Exit fullscreen mode

How map(any) defers type-checking past plan and surfaces it at apply

Diagram renders at the canonical version.

The collision had been latent for three weeks. plan succeeded because terraform's planner walked the call graph with both maps' element types collapsed to any. The merged value passed type-check as any, which type-checks against anything. apply, which actually constructs the resource, evaluated the merged value against the receiving attribute's concrete type signature and discovered the value was a string where an object was required.

That is the part that hurts. Terraform's any type defers all type-checking until apply. Every map(any) input on a root module is a future apply-time failure waiting on a contributor who does not know the implicit shape.

Three options, one open release window, seven minutes to pick

What we did before running apply again

We had three options and one open release window. I walked the on-call lead through them on the bridge call.

  • 1. Delete the legacy key, Fastest. Also the riskiest: the legacy read_replica_routing key was referenced by three modules-of-modules three layers down. Deleting it would have moved the failure from staging to production an hour later.
  • 2. Rename the new key, Safe-feeling. Left the underlying any-typed contract intact. Two months later a different contributor would add another map(any) input and we would be back on a Friday afternoon with the same shape of failure.
  • 3. Rename plus add validation, Slower. Renamed the new key to feature_routing_overrides AND added a validation block on the input that explicitly rejected the colliding shape at plan time going forward. Stopped the immediate reoccurrence.

Option three carried the day. The rename took seven minutes. The validation block took twelve. apply succeeded at 5:14pm with sixteen minutes to spare on the release window. The release shipped on time.

The audit work behind option one (the one we did NOT take) is what stuck with me. The next morning, we grep-ed the entire terraform/ tree for read_replica_routing to map every consumer. Seven references across four modules. Three in modules/services/database/locals.tf itself. One in modules/monitoring/cloudwatch.tf. One in modules/services/cache/lookups.tf, which read the value to construct its own routing decision and would have broken silently if we had deleted the legacy key the night before. The remaining two were in a state-recovery helper module the team had forgotten existed. We had nearly fired the second shot of our own foot.

We left a tombstone comment on the legacy key and an open PR that would, the following week, replace its map(any) type with a proper object({ ... }) schema. That work landed five days later. The downstream consumers caught the change at plan time, and three of them needed minor patches before the type tightening could merge. None of those patches would have caught the original collision. They all caught real existing bugs the any type had been hiding.

Two policy changes and one structural fix

What we changed afterwards

Two policy changes came out of that night, and one structural fix took longer.

The first policy: no new map(any) or any-typed inputs on root modules. The team's terraform/ directory has a pre-commit hook (8 lines of grep) that fails the commit if any new variable block contains type = any or type = map(any). Existing instances are grandfathered, with a TODO list tracked against each module. Three of the original 14 have been converted to typed objects so far. The hook has fired four times in the six weeks since.

The second policy: every PR runs terraform plan against every environment, not just the one the contributor cares about. A matrix job in CI runs plan -var-file=envs/<env>.tfvars across all four environments and fails the PR if any of them errors. This would not have caught the original collision (plan succeeded everywhere), but it catches a different class of failure where one environment's tfvars hits an unwritten code path.

# Before: latent any-typed input
variable "feature_overrides" {
  type        = map(any)
  default     = {}
  description = "Per-environment feature flag overrides"
}

# In modules/services/database/locals.tf
locals {
  merged_flags = merge(
    local.legacy_db_flags,
    var.feature_overrides,
  )
}

# Above passes plan even when the two maps have a key
# whose value types disagree. The mismatch surfaces only
# at apply, when the receiving attribute is evaluated.

# After: typed, explicit, errors at plan time
variable "feature_overrides" {
  type = map(object({
    enabled     = bool
    rollout_pct = optional(number, 0)
    routing     = optional(string, "default")
  }))
  default     = {}
  description = "Per-environment feature flag overrides"

  validation {
    condition = alltrue([
      for k, v in var.feature_overrides :
      v.rollout_pct >= 0 && v.rollout_pct <= 100
    ])
    error_message = "rollout_pct must be between 0 and 100."
  }
}
Enter fullscreen mode Exit fullscreen mode

The same variable, before and after. The lower form fails plan, not apply, when a contributor passes the wrong shape.

The structural fix took longer. A 28-input root module is not a configuration problem, it is a service-boundary problem. The team running the database stack should own a database/ root module with four inputs, not a 14-input subtree of a shared 28-input root. We split the original root into three roots along ownership boundaries (network, services, observability) using a thin terragrunt overlay for the cross-cutting variables. The split took six weeks of careful state-mv work to land without downtime. We have written more on the structural fix in the Terraform and IaC debt playbook, which covers when a shared root module starts costing more than the consistency it buys.

What we tell every team now: strong types in Terraform are not bureaucracy, they are the documentation. The half-day cost to write object({ name = string, enabled = bool, ... }) instead of map(any) buys you a plan-time failure instead of an apply-time failure, and apply-time failures land at 4:42pm on Fridays. We have stopped accepting map(any) inputs in any client engagement that involves an IaC audit, and we have not had a single contributor push back once they saw the cost.

If you are looking at a 28-input root with map(any) sprinkled through it

When your own root module is past 20 inputs

If you are reading this and your terraform/ directory has a root module past 20 inputs with several map(any) types in the input list, the failure you are heading toward is not a surprise. It is a scheduled event. The trigger will be a new contributor who does not know the implicit contract, plus one bad-enough Friday. The hardest part of cleaning it up is not the typing work itself; it is the audit of downstream consumers that have been silently depending on the loose contract for years. Two layers of modules-of-modules can hide a reference that breaks the moment you tighten the type, and your CI will not warn you because plan will keep passing right up to the apply that surfaces it.

We run these recovery and audit engagements every week. The map(any) collision pattern is the third-most-common shape we see in seed-to-Series-B SaaS Terraform repos, right after stale state lock holders and provider-version-drift cascades. It is one variant of the broader terraform apply fear problem we engage on most weeks. On a typical engagement we map every any-typed input in your root modules within the first day, prioritize them by blast radius, and either convert them in-place or split the root if the input count is the real problem. If you are looking at a Terraform root with map(any) sprinkled through it and a release window that does not forgive a 4pm apply failure, book an infrastructure review with our team and we will start with a 30-minute diagnostic call this week.


Originally published at https://infraforge.agency/insights/terraform-apply-fails-map-any-trap/.

If your team is dealing with similar infrastructure debt, we offer infrastructure reviews and recovery engagements — see /review.

Top comments (0)