DEV Community

Cover image for Terraform for Data Infrastructure: Warehouse, Lakehouse, Catalogs & IAM as Code
Gowtham Potureddi
Gowtham Potureddi

Posted on

Terraform for Data Infrastructure: Warehouse, Lakehouse, Catalogs & IAM as Code

terraform for data engineering is the lingua franca of senior data platform work in 2026 — and the one tool every interview loop, on-call rotation, and audit conversation eventually circles back to, because warehouses, lakehouses, catalogs, and IAM are too important to provision by clicking around a console. Terraform turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, a Lake Formation grant, and the IAM role that ties them all together into one reviewable, diffable, peer-reviewed pull request. The cost of click-ops is measured in the audit findings nobody can explain three months later; the value of iac data infrastructure is measured in the speed with which a new env can be stood up identically on a Tuesday afternoon.

This guide is the senior-DE playbook for using terraform to ship production data infrastructure: the HCL mental model and the resource graph, the provider matrix for terraform snowflake, terraform databricks, terraform aws glue, and BigQuery, the catalog and grant tree shape of Unity Catalog and Lake Formation, terraform iam policy discipline (and why you should never write password = "…" in HCL), terraform modules composition vs Terragrunt DRY composition, the terraform state backend story with S3 + DynamoDB locking, drift detection in CI, and the terragrunt vs raw Terraform vs opentofu decision senior platform engineers are making in 2026. Every section ships a teaching block plus a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Terraform for Data Infrastructure — bold white headline 'Terraform for Data Infra' over a stylised data-stack-as-blueprint composition with a warehouse cube, a lakehouse slab, a catalog ribbon, and an IAM shield, all wired by glowing HCL braces on a dark gradient with purple, green, orange, and blue accents and a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the ETL practice library → for the infra-provisioning surface, rehearse on medium-difficulty ETL problems → to lock in pipeline-grade Terraform muscle, and stack the optimization library → for the cost/perf trade-offs that show up in every IaC review.


On this page


1. Why Terraform is the lingua franca of data infra in 2026

Click-ops dies in audit — terraform is the only honest record of how your warehouse, lakehouse, and IAM actually got built

The one-sentence invariant: Terraform is the only tool that turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, an IAM role, and the grants between them into a single peer-reviewable HCL graph with a deterministic plan, an atomic apply, and a state file that is the source of truth. Once you internalise "the state file is the contract; the cloud is the projection," every other terraform for data engineering interview probe — drift, modules, secrets, blast radius — collapses to a deduction from that one structural property.

Click-ops vs IaC — the audit and drift story interviewers always ask.

  • Click-ops is the world where someone opens the Snowflake UI, creates a database, grants a role, and Slacks the team. Three months later nobody can explain why the analytics database has read-write on raw_pii and the audit gets escalated. There is no diff, no review, no rollback.
  • IaC with Terraform is the world where the same change is a 6-line PR. The plan output shows exactly which resources change, the reviewer comments on the grant, and the apply log is the audit trail. The state file is the single source of truth; if the cloud diverges, terraform plan flags it as drift.
  • The interview punchline. A senior engineer is expected to answer "how do you audit cloud data infra changes?" with "every change is a Terraform PR; we run terraform plan in CI on every merge; drift is alerted on a nightly plan." Anything less is a junior signal.

Why Terraform won over CloudFormation, Pulumi, and CDK for data infra.

  • Provider breadth. Terraform ships native providers for Snowflake, Databricks, BigQuery, AWS, Azure, GCP, Confluent, MongoDB Atlas, Datadog, PagerDuty — the full data-platform stack. CloudFormation is AWS-only; Pulumi covers most of it but with a smaller community; CDK is AWS-centric and emits CloudFormation under the hood.
  • HCL-as-contract. HCL is intentionally restrictive — no loops in the imperative sense, no side effects, declarative blocks only. That restriction is the feature: a reviewer can read an HCL file and know exactly what infrastructure it produces. Pulumi's Python / TypeScript / Go lets you write if random.random() < 0.5: ... in your infra code; HCL forbids it.
  • The plan/apply ritual. Every change runs terraform plan first, producing a deterministic diff. The diff is reviewable; the apply is atomic. CloudFormation has change sets but the UX is heavier; CDK's cdk diff is plan-equivalent but tied to AWS.
  • Community and module ecosystem. The public Terraform Registry has 10,000+ modules; the Snowflake-Labs / databricks / hashicorp organisations alone ship hundreds of battle-tested patterns. Pulumi has fewer modules and a heavier dependency-graph problem; CDK Constructs are AWS-only.

The four "must-answer" axes for data infra Terraform.

  • Providers. Which cloud APIs Terraform talks to. For data infra: snowflake, databricks, aws (Glue, S3, IAM, Lake Formation), google (BigQuery, GCS), azurerm (Synapse, ADLS), and an auth + secrets provider (Vault, AWS Secrets Manager).
  • State. The terraform.tfstate file is the source of truth. Where it lives, how it is locked, how it is versioned, who can read it (secrets in plain text!) — these are the questions every senior interview probes.
  • Modules. Reusable units of HCL with inputs and outputs. Module design is where junior vs senior Terraform separates: senior engineers write modules with explicit variable validation, output contracts, and Git-tag versioning.
  • IAM. The grant tree across Snowflake roles, Databricks Unity Catalog, AWS IAM, and Lake Formation. The blast radius of an IAM change is the question that defines whether your Terraform deployment is safe or a Friday-afternoon outage waiting to happen.

Terraform vs OpenTofu in 2026 — the Linux Foundation fork landed in 2024.

  • HashiCorp re-licensed Terraform under the Business Source License (BSL) in August 2023, restricting commercial competitors. The community responded by forking the last MPL-licensed Terraform into OpenTofu, donated to the Linux Foundation.
  • OpenTofu has been at near-feature-parity with Terraform since 1.6 and ships its own roadmap (state encryption at-rest, early variable evaluation in import blocks, dynamic provider iteration).
  • The 2026 pragmatic answer: if you are an enterprise with HCP Terraform / Terraform Cloud paid features (Sentinel, run tasks, agent pools, drift detection-as-a-service), stay on HashiCorp. If you are open-source-first or want to dodge the BSL clause, OpenTofu is the drop-in replacement. The HCL syntax, the state format, and the provider protocol are identical.
  • For interview purposes, name both and explain the licence trade-off — that is the senior signal.

What interviewers actually probe.

  • Do you describe the state file as "the contract; the cloud is the projection"? — senior signal.
  • Do you instinctively reach for remote state with locking instead of a local file? — required answer.
  • Do you mention module versioning via Git tags unprompted? — senior signal.
  • Do you push back on "just use the AWS console for a one-off" with "then it's not in state and the next plan will try to delete it"? — senior signal.
  • Do you know the difference between terraform import, terraform state mv, and terraform state rm? — required for any infra-on-call interview.

Worked example — the same Snowflake database, click-ops vs Terraform

Detailed explanation. A junior engineer is asked to "create a Snowflake database called analytics with three schemas (raw, staging, marts) and grant analyst_role read access to marts." In click-ops, this is six clicks in the Snowflake UI. In Terraform, it is one HCL file. The diff between the two is where senior data engineers earn their salary.

Question. Express the click-ops sequence as Terraform HCL and explain why the HCL version is the only acceptable answer in a production-grade environment.

Input.

Step Click-ops action Terraform resource
1 Create database analytics snowflake_database.analytics
2 Create schema raw snowflake_schema.raw
3 Create schema staging snowflake_schema.staging
4 Create schema marts snowflake_schema.marts
5 Create role analyst_role snowflake_role.analyst
6 Grant USAGE + SELECT on marts to analyst_role snowflake_grant_privileges_to_role.analyst_marts_read

Code.

# providers.tf
terraform {
  required_version = ">= 1.6"
  required_providers {
    snowflake = {
      source  = "Snowflake-Labs/snowflake"
      version = "~> 0.95"
    }
  }
}

provider "snowflake" {
  account  = var.snowflake_account
  username = var.snowflake_admin_username
  # password sourced from a secret store, never inline
  password = data.aws_secretsmanager_secret_version.snowflake_admin.secret_string
  role     = "SYSADMIN"
}

# main.tf
resource "snowflake_database" "analytics" {
  name    = "ANALYTICS"
  comment = "Curated analytics database — managed by Terraform"
}

resource "snowflake_schema" "raw" {
  database = snowflake_database.analytics.name
  name     = "RAW"
}

resource "snowflake_schema" "staging" {
  database = snowflake_database.analytics.name
  name     = "STAGING"
}

resource "snowflake_schema" "marts" {
  database = snowflake_database.analytics.name
  name     = "MARTS"
}

resource "snowflake_role" "analyst" {
  name    = "ANALYST_ROLE"
  comment = "Read access to curated marts"
}

resource "snowflake_grant_privileges_to_role" "analyst_marts_read" {
  role_name  = snowflake_role.analyst.name
  privileges = ["USAGE"]

  on_schema {
    schema_name = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
  }
}

resource "snowflake_grant_privileges_to_role" "analyst_marts_select" {
  role_name  = snowflake_role.analyst.name
  privileges = ["SELECT"]

  on_schema_object {
    all {
      object_type_plural = "TABLES"
      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The terraform block pins the required Terraform version and the Snowflake provider to ~> 0.95 (a known-good range). This is the lock against silent provider drift.
  2. The provider "snowflake" block authenticates against the account; the password is sourced from AWS Secrets Manager (never inline) — data.aws_secretsmanager_secret_version.snowflake_admin.secret_string.
  3. Each resource block describes the desired state. Terraform builds a DAG from the resource references (snowflake_schema.raw.database = snowflake_database.analytics.name is a dependency edge).
  4. On terraform apply, the engine traverses the DAG, creating the database first, then the three schemas in parallel, then the role, then the grants. The state file records every resource id.
  5. If anyone later edits a schema name in the Snowflake UI, the next terraform plan flags it as drift and proposes to revert. That is the audit trail click-ops cannot give you.

Output (the audit value).

Concern Click-ops Terraform
Diff in PR review none full HCL diff
Reproducibility on a new env manual sequence terraform apply
Drift detection none nightly terraform plan
Atomic rollback manual undo git revert + apply
Audit trail UI logs only Git history + state versions

Rule of thumb. Every change to cloud data infra that touches a name, a grant, a warehouse size, or a network policy belongs in HCL. The 5-minute "I'll just click it in the UI" change is the one that explodes in audit three months later.

Worked example — the plan/apply ritual on a Glue catalog change

Detailed explanation. Terraform's core loop is init → plan → apply. The plan is the deterministic diff: it reads state, refreshes the cloud, and computes what would change. Senior engineers read every plan output before approving it. Junior engineers click apply and pray. The ritual is the muscle memory that makes Terraform safe.

Question. Walk through the plan/apply ritual for adding a new aws_glue_catalog_table to a lakehouse, including what the plan output looks like and what a reviewer should look for.

Input.

Stage What runs What is produced
terraform init provider download + backend init .terraform/ cache + lock file
terraform plan -out=plan.tfplan refresh + diff plan.tfplan (binary, signed)
terraform show plan.tfplan human-readable diff text output to review
terraform apply plan.tfplan atomic apply state file updated

Code.

# lakehouse.tf — adding a new Iceberg-format Glue table
resource "aws_glue_catalog_table" "orders_iceberg" {
  name          = "orders"
  database_name = aws_glue_catalog_database.lake.name
  table_type    = "EXTERNAL_TABLE"

  parameters = {
    "format-version"        = "2"
    "table_type"            = "ICEBERG"
    "metadata_location"     = "s3://${aws_s3_bucket.lake.bucket}/orders/metadata/00000-init.metadata.json"
    "write.format.default"  = "parquet"
  }

  storage_descriptor {
    location      = "s3://${aws_s3_bucket.lake.bucket}/orders/"
    input_format  = "org.apache.iceberg.mr.hive.HiveIcebergInputFormat"
    output_format = "org.apache.iceberg.mr.hive.HiveIcebergOutputFormat"

    ser_de_info {
      serialization_library = "org.apache.iceberg.mr.hive.HiveIcebergSerDe"
    }

    columns {
      name = "order_id"
      type = "bigint"
    }
    columns {
      name = "customer_id"
      type = "bigint"
    }
    columns {
      name = "amount"
      type = "decimal(18,2)"
    }
    columns {
      name = "order_ts"
      type = "timestamp"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.40"...
- Installing hashicorp/aws v5.42.0...

$ terraform plan -out=plan.tfplan
Terraform will perform the following actions:

  # aws_glue_catalog_table.orders_iceberg will be created
  + resource "aws_glue_catalog_table" "orders_iceberg" {
      + name          = "orders"
      + database_name = "lake"
      + table_type    = "EXTERNAL_TABLE"
      + parameters    = {
          + "format-version"    = "2"
          + "table_type"        = "ICEBERG"
          ...
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

$ terraform apply plan.tfplan
aws_glue_catalog_table.orders_iceberg: Creating...
aws_glue_catalog_table.orders_iceberg: Creation complete after 2s
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. terraform init downloads the AWS provider and initialises the backend (where the state file lives). The .terraform.lock.hcl file pins provider versions for everyone on the team.
  2. terraform plan -out=plan.tfplan produces a signed binary plan. It contains the exact set of API calls Terraform will make if you apply. Saving it to a file lets CI archive it for audit.
  3. terraform show plan.tfplan (or letting CI render the diff) displays the human-readable diff. A reviewer looks for: unexpected destroys, unexpected updates-in-place that secretly recreate (the ~/- lines), and any resource not mentioned in the PR description.
  4. terraform apply plan.tfplan consumes the saved plan and executes it atomically. Because the plan was already computed, no surprise drift is introduced between plan and apply.
  5. The state file (terraform.tfstate) is updated; the new resource's id is recorded. The next terraform plan sees zero changes.

Output (plan diff legend reviewers must know).

Symbol Meaning Reviewer action
+ resource will be created confirm name and config
- resource will be destroyed confirm explicitly in PR description
~ resource update in place check which attributes
-/+ resource destroy + recreate DANGER — data loss likely
<= read data.X data source refresh usually safe

Rule of thumb. Never approve a plan you have not read line by line. The -/+ symbol on a stateful resource (database, table, bucket) is a Friday-afternoon outage waiting to happen — flag it in PR review and force an explicit migration plan.

Worked example — the OpenTofu fork pragmatic switch

Detailed explanation. A team running open-source Terraform 1.5 wants to upgrade. The 1.6 release is the first BSL-licensed version. The team is open-source-first and wants to avoid licence ambiguity. The pragmatic answer is OpenTofu — drop-in replacement, identical state format, identical provider protocol. The switch is a 10-minute exercise per repo.

Question. Walk through the migration from terraform 1.5 to opentofu 1.7 and show what changes (binary, CLI, CI configuration) and what does not (HCL, state, providers).

Input.

Component Terraform 1.5 OpenTofu 1.7
CLI binary terraform tofu
State format tfstate v4 tfstate v4 (identical)
Provider protocol v5 / v6 v5 / v6 (identical)
HCL syntax HCL2 HCL2 (identical)
Module sources Terraform Registry OpenTofu Registry + Terraform Registry
Licence BSL (since 1.6) MPL 2.0

Code.

# 1. Install opentofu (replaces terraform binary in CI)
brew install opentofu
# Or, in a CI image:
# RUN curl -L https://github.com/opentofu/opentofu/releases/download/v1.7.0/tofu_1.7.0_linux_amd64.zip -o tofu.zip
# RUN unzip tofu.zip && mv tofu /usr/local/bin/

# 2. Verify state compatibility (read-only)
tofu version
# OpenTofu v1.7.0

tofu init      # same backend init
tofu plan      # same HCL, same plan output
tofu apply     # same atomic apply
Enter fullscreen mode Exit fullscreen mode
# .github/workflows/infra.yml — CI config swap
jobs:
  plan:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: opentofu/setup-opentofu@v1   # was hashicorp/setup-terraform
        with:
          tofu_version: 1.7.0
      - run: tofu init
      - run: tofu plan -out=plan.tfplan
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Install the tofu binary alongside (or replacing) terraform. On macOS: brew install opentofu. The two binaries can coexist on the same machine.
  2. Run tofu init against an existing state file written by Terraform. The state format is identical — OpenTofu reads and writes the same terraform.tfstate v4 schema.
  3. Run tofu plan against the same HCL. Output is byte-identical for the same providers (the only difference is the header line: OpenTofu used the selected providers...).
  4. Swap the CI setup action from hashicorp/setup-terraform to opentofu/setup-opentofu. The rest of the pipeline (init / plan / apply) is unchanged.
  5. Module sources from the Terraform Registry continue to work; OpenTofu also ships its own Registry mirror at registry.opentofu.org for licence-pure consumption.

Output (the migration cost ledger).

Step Time Risk
Install tofu in CI image 5 min low
Swap setup action 2 min low
Test plan on dev workspace 10 min low
Apply on dev 5 min low
Roll out to prod after a week of dev runs gradual low
Total per repo ~25 min low

Rule of thumb. For new projects in 2026, default to OpenTofu unless you have a paid HashiCorp contract (HCP Terraform, Sentinel, run tasks). For existing repos, the migration is cheap; do it once your prod env is stable. The interview signal is naming both and articulating the licence trade-off.

Senior interview question on the case for Terraform in data infra

A senior interviewer often opens with: "You join a 30-person data team that provisions all their cloud infra via the AWS console, Snowflake UI, and Databricks workspace UI. The platform lead wants you to introduce Terraform. Walk me through the 90-day plan — what do you Terraform first, how do you handle existing resources, and what is the success criterion?"

Solution Using import-first, environments-second, modules-third progression

90-day Terraform rollout — 3 phases

Phase 1 (days 1–30) — Import existing prod, no new infra in HCL yet
-------------------------------------------------------------------
- Stand up remote state backend (S3 + DynamoDB lock).
- Pick the smallest stateful slice: the Snowflake roles + grant tree.
- Run `terraform import` for every existing resource.
- Validate `terraform plan` shows zero changes (state matches reality).
- Wire a read-only nightly CI plan that alerts on drift.

Phase 2 (days 31–60) — Add new envs / new resources via HCL
-----------------------------------------------------------
- Build a `staging` env from scratch using the imported prod HCL.
- Validate that staging is a faithful reproduction of prod.
- Every NEW prod resource (new schema, new role, new bucket)
  must be added via Terraform PR. UI changes blocked.

Phase 3 (days 61–90) — Modules + governance
-------------------------------------------
- Extract repeated patterns (Snowflake database, Glue table,
  IAM role-for-Glue) into modules with versioned Git tags.
- Add OPA / Conftest checks on the plan output.
- Roll plan/apply into the team's PR workflow with required approvals.
- Define the success metric: drift count over 7 days = 0.
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Day Action Coverage
1 Provision S3 + DynamoDB backend state infra live
5 Import 1 Snowflake DB + 10 schemas 100% of one slice
10 Import all roles + grants core IAM in state
20 Nightly CI plan green for 7 days drift baseline = 0
35 Provision staging from HCL env parity proven
50 Block UI changes on prod (Snowflake account policy) enforcement live
70 Publish first internal module (snowflake-database) DRY enabled
85 OPA policy: no 0.0.0.0/0 ingress on any new SG policy-as-code live
90 Drift rate over 7 days = 0; new infra via PR-only success criterion met

The 90-day plan front-loads risk: imports first (no new infra), then env parity (no prod-only HCL), then modules + policy. Each phase is independently shippable.

Output:

Phase Deliverable Risk profile
1 — Import terraform plan is no-op on prod low (read-only)
2 — New envs staging mirrors prod medium (writes to a fresh env)
3 — Modules + policy DRY HCL + OPA gates low (refactor + governance)

Why this works — concept by concept:

  • Import first, never greenfield-rewrite — rewriting prod infra from scratch in HCL is a disaster waiting to happen. terraform import brings existing resources into state without recreating them. The plan should be a no-op after import; that proves the HCL matches reality.
  • Backend before code — never start with a local state file. Stand up S3 + DynamoDB locking first; commit zero HCL until the backend is alive. This avoids the "I lost my state file" disaster.
  • Env parity proves the HCL — if staging stood up from the same HCL is byte-identical to prod, the HCL is provably faithful. Otherwise, prod has hidden state nobody captured.
  • Modules come last — premature abstraction kills Terraform projects. Wait until you have 3+ usages of the same pattern, then extract a module. Versioned via Git tag (?ref=v1.2.0) so updates are opt-in.
  • Cost — the rollout is 30-50% of one senior engineer's time for 90 days. Pays back the first time an audit asks "show me when this grant was added"; you point at a Git commit and a PR review.

ETL
Topic — etl
Infrastructure-as-code rollout problems

Practice →

ETL Topic — etl · medium Medium-difficulty ETL design problems

Practice →


2. The Terraform mental model for data infra

terraform builds a DAG from your HCL, then executes it through plan and apply — the state file is the contract

The mental model in one line: HCL describes resources, references between resources build a DAG, terraform plan diffs the DAG against the state, terraform apply walks the DAG in dependency order, and the state file is the persistent source of truth that survives between runs. Once you say "the DAG is the contract; plan diffs it; apply executes it; state remembers it," the entire iac data infrastructure interview surface becomes a deduction from those four stages.

Iconographic Terraform mental-model diagram — a central HCL-brace-as-engine on the left, a downstream resource-DAG of icons (warehouse, table, role, secret) on the right, with a state-file vault below the engine and a thin plan-apply arc connecting the engine to the DAG.

The five core HCL primitives.

  • provider — the plugin that talks to a cloud API. provider "aws" {}, provider "snowflake" {}, provider "databricks" {}. Authenticated per workspace; pinned via required_providers.
  • resource — a single cloud object. resource "aws_s3_bucket" "lake" {} is one S3 bucket. Each resource has a type (aws_s3_bucket) and a name (lake); the pair aws_s3_bucket.lake is the resource address.
  • data — a read-only lookup against a cloud API. data "aws_caller_identity" "current" {} reads the current AWS account id. Data sources are refreshed on every plan.
  • variable — typed input. variable "env" { type = string }. Set via terraform.tfvars, -var CLI flag, or TF_VAR_env env var.
  • output — exported value consumed by other modules or shown after apply. output "database_id" { value = snowflake_database.analytics.id }.

The provider ecosystem for data infra (2026).

  • Snowflake (Snowflake-Labs/snowflake) — snowflake_database, snowflake_schema, snowflake_warehouse, snowflake_role, snowflake_grant_privileges_to_role, snowflake_external_table, snowflake_stage. The community provider is now the canonical one.
  • Databricks (databricks/databricks) — databricks_workspace, databricks_cluster, databricks_job, databricks_catalog, databricks_schema, databricks_table, databricks_grant, databricks_external_location. Unity Catalog is fully Terraform-native.
  • AWS (hashicorp/aws) — aws_glue_catalog_database, aws_glue_catalog_table, aws_glue_job, aws_s3_bucket, aws_iam_role, aws_iam_policy, aws_lakeformation_permissions, aws_athena_workgroup.
  • Google (hashicorp/google) — google_bigquery_dataset, google_bigquery_table, google_bigquery_routine, google_dataproc_cluster, google_storage_bucket.
  • Azure (hashicorp/azurerm) — azurerm_synapse_workspace, azurerm_storage_account, azurerm_storage_data_lake_gen2_filesystem, azurerm_databricks_workspace.
  • Cross-cuttinghashicorp/vault for secrets, confluentinc/confluent for Kafka, mongodb/mongodbatlas for Mongo, datadog/datadog for observability resources.

The declarative graph — Terraform builds a DAG.

  • Every resource reference (x = resource.foo.bar.attr) is a dependency edge. Terraform parses every HCL file, builds a graph, then walks it in topological order.
  • Independent resources are created in parallel (default concurrency 10; tune via -parallelism=N).
  • The DAG also handles destroys in reverse dependency order — a child resource is destroyed before its parent, so the parent destroy sees no live children.
  • Cycles in the DAG (resource A depends on B which depends on A) are a build-time error. The most common cause is a circular module reference; the fix is usually an extra layer or a depends_on indirection.

The state file as source of truth — not the cloud.

  • terraform.tfstate is a JSON document holding the mapping (resource address) → (cloud id) + (attributes).
  • On plan, Terraform reads state, optionally refreshes (terraform plan -refresh-only or auto-refresh on plan), then diffs the desired HCL against the recorded attributes.
  • If the cloud changes outside Terraform (someone clicked in the UI), the next plan shows drift — proposed reverts to bring the cloud back to what state says.
  • The state file contains secrets in plain text (passwords, tokens written during creation). Never commit it; always remote-state it with at-rest encryption.

Lock files (.terraform.lock.hcl) — provider version pinning.

  • Generated by terraform init and committed to Git. Records the exact provider versions and a checksum for each platform (linux_amd64, darwin_arm64, etc).
  • Forces every team member and every CI run to use the same provider version. Without it, init could pull a new minor and break determinism.
  • Update with terraform init -upgrade when you actually want to bump versions. Review the diff in PR — provider upgrades can change resource behaviour.

Workspaces vs directories — the layout debate.

  • Workspaces (terraform workspace new staging) — multiple state files in a single backend, switched by env var or CLI. Light-weight but easy to footgun (terraform plan on the wrong workspace destroys prod).
  • Directories — one folder per env (envs/prod/, envs/staging/, envs/dev/) each with its own main.tf and its own backend config. Heavier but explicit; no chance of wrong-workspace apply.
  • The senior consensus in 2026 — directories per env, modules shared across envs, never rely on workspaces for prod separation. Workspaces are fine for ephemeral feature branches.

Common interview probes on the mental model.

  • "What is a Terraform DAG?" — the dependency graph built from resource references; walked in topological order on apply.
  • "What is in terraform.tfstate?" — JSON mapping of resource addresses to cloud ids and attributes; secrets in plain text; never commit.
  • "What does terraform refresh do, and why was it deprecated as a standalone command in 1.6?" — it queries the cloud for current state of every resource and updates tfstate. Deprecated as standalone because every modern workflow does it implicitly during plan.
  • "Workspaces vs directories?" — directories for env separation, workspaces for ephemeral isolation. Never workspace prod.

Worked example — building the DAG from HCL

Detailed explanation. A simple S3 + Glue + IAM stack illustrates how Terraform infers the DAG. You write resources in any order; Terraform reads references and builds the graph. The order in main.tf is irrelevant; the graph determines apply order.

Question. Given an HCL file with four resources written in arbitrary order, show the DAG Terraform builds and the apply order it produces.

Input (the HCL, written in non-dependency order).

Line Resource References
1 aws_glue_catalog_table.orders aws_glue_catalog_database.lake.name, aws_s3_bucket.raw.bucket
2 aws_s3_bucket.raw
3 aws_iam_role.glue
4 aws_glue_catalog_database.lake

Code.

# Order in the file is irrelevant; the DAG is what matters

resource "aws_glue_catalog_table" "orders" {
  database_name = aws_glue_catalog_database.lake.name
  name          = "orders"
  storage_descriptor {
    location = "s3://${aws_s3_bucket.raw.bucket}/orders/"
    columns { name = "id"  type = "bigint" }
    columns { name = "amt" type = "decimal(18,2)" }
  }
}

resource "aws_s3_bucket" "raw" {
  bucket = "lakehouse-raw-${var.env}"
}

resource "aws_iam_role" "glue" {
  name               = "glue-${var.env}"
  assume_role_policy = data.aws_iam_policy_document.glue_assume.json
}

resource "aws_glue_catalog_database" "lake" {
  name = "lake_${var.env}"
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Terraform parses every .tf file in the directory and builds an internal graph. Each resource is a node; each reference (aws_glue_catalog_database.lake.name inside aws_glue_catalog_table.orders) is an edge from lake to orders.
  2. The graph builder also adds implicit edges from data sources to the resources that reference them.
  3. Topological sort produces the apply order. aws_s3_bucket.raw, aws_iam_role.glue, and aws_glue_catalog_database.lake have no dependencies — they are all created in parallel (level 0).
  4. aws_glue_catalog_table.orders depends on lake and raw. It is in level 1 — created after both level 0 resources finish.
  5. terraform graph | dot -Tpng > graph.png visualises this. Senior engineers use it during reviews of big modules.

Output (the resolved apply order).

Level Resources created in parallel
0 aws_s3_bucket.raw, aws_iam_role.glue, aws_glue_catalog_database.lake
1 aws_glue_catalog_table.orders

Rule of thumb. Write HCL in whatever logical order helps readers. Trust the DAG — never use depends_on to force a sequence unless you have a hidden dependency (e.g. AWS eventual-consistency races) that Terraform cannot infer from references.

Worked example — drift detection on a Snowflake warehouse

Detailed explanation. A team has a Snowflake warehouse LOAD_WH provisioned via Terraform with warehouse_size = "SMALL". Someone with SYSADMIN clicks "Resize" in the Snowflake UI and bumps it to MEDIUM to fix a slow load. Three days later, the nightly terraform plan runs in CI and flags the drift.

Question. Show the drift, the plan output, and the three legitimate ways to resolve it (revert, accept, or refactor).

Input.

Resource HCL says Cloud says
snowflake_warehouse.load SMALL MEDIUM

Code (the HCL — unchanged).

resource "snowflake_warehouse" "load" {
  name           = "LOAD_WH"
  warehouse_size = "SMALL"           # HCL says SMALL
  auto_suspend   = 60
  auto_resume    = true
}
Enter fullscreen mode Exit fullscreen mode
$ terraform plan
Refreshing state...
snowflake_warehouse.load: Refreshing state... [id=LOAD_WH]

Terraform detected the following changes made outside of Terraform since the
last "terraform apply" which may have affected this plan:

  # snowflake_warehouse.load has been changed
  ~ resource "snowflake_warehouse" "load" {
        id             = "LOAD_WH"
      ~ warehouse_size = "SMALL" -> "MEDIUM"
        # (4 unchanged attributes hidden)
    }

Unless you have made equivalent changes to your configuration, or ignored the
relevant attributes using ignore_changes, the following plan may include
actions to undo or respond to these changes.

Terraform will perform the following actions:

  # snowflake_warehouse.load will be updated in-place
  ~ resource "snowflake_warehouse" "load" {
      ~ warehouse_size = "MEDIUM" -> "SMALL"
    }

Plan: 0 to add, 1 to change, 0 to destroy.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. terraform plan refreshes state by calling snowflake_warehouse_show for every resource. It sees the cloud value MEDIUM, but the HCL still says SMALL.
  2. Terraform's response: propose to revert the cloud back to SMALL. That is correct behaviour — HCL is the contract; the cloud is the projection.
  3. Resolution option 1 (revert): apply the plan. The warehouse goes back to SMALL. Whoever resized it must come back through a PR if the change was intentional.
  4. Resolution option 2 (accept): update the HCL to warehouse_size = "MEDIUM" and re-plan. Plan becomes no-op; the team now owns the change in Git.
  5. Resolution option 3 (refactor): if the warehouse size needs to vary per env, lift it to a variable: warehouse_size = var.warehouse_size, set per env in terraform.tfvars.

Output (the three drift-resolution options).

Option When to choose
Revert (apply the plan) The click-ops change was a mistake; revert and re-do via PR
Accept (update HCL) The click-ops change was correct; codify it now
Refactor (variable-ise) The attribute should vary per env; extract to a variable

Rule of thumb. Every drift event is a teaching moment. Run drift detection nightly in CI; for each drift event, the on-call data engineer must pick revert, accept, or refactor — never "ignore". Drift left unhandled compounds.

Worked example — the lock file in action

Detailed explanation. A team has 10 engineers running Terraform locally and a CI pipeline running plans on every PR. Without a lock file, each terraform init could pull a slightly different provider version, leading to "works on my machine" plans. With .terraform.lock.hcl committed to Git, every init pulls the exact same version on every platform.

Question. Show the lock file format and walk through how a planned upgrade looks (the senior-DE-approved way).

Input.

Step Lock file action
1 Engineer runs terraform init first time → lock file generated
2 Commit .terraform.lock.hcl to Git
3 Other engineers / CI run init — pull the exact pinned versions
4 Months later: upgrade aws provider → terraform init -upgrade
5 Review the lock-file diff in PR

Code.

# .terraform.lock.hcl  (generated by terraform init)
provider "registry.terraform.io/hashicorp/aws" {
  version     = "5.42.0"
  constraints = "~> 5.40"
  hashes = [
    "h1:abc123...",
    "zh:def456...",
    "zh:789ghi...",
    # one hash per platform
  ]
}

provider "registry.terraform.io/snowflake-labs/snowflake" {
  version     = "0.95.0"
  constraints = "~> 0.95"
  hashes = [
    "h1:xyz890...",
  ]
}
Enter fullscreen mode Exit fullscreen mode
# Upgrade the AWS provider intentionally
$ terraform init -upgrade
Upgrading modules...
- registry.terraform.io/hashicorp/aws v5.42.0 -> v5.50.0

# Review the diff in PR
$ git diff .terraform.lock.hcl
- version     = "5.42.0"
+ version     = "5.50.0"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The first time terraform init runs against a config, it computes the latest provider version matching the required_providers version constraints and writes the lock file. The lock file has hashes for every supported platform (so a macOS engineer and a linux CI runner get matching versions).
  2. The lock file is committed to Git. Subsequent inits read the lock file first and pull only those exact versions.
  3. To upgrade, run terraform init -upgrade. The lock file is regenerated with the latest versions matching the constraints in required_providers.
  4. The diff in .terraform.lock.hcl is reviewed in the PR. The reviewer also re-runs terraform plan to catch any provider-behaviour changes.
  5. Without the lock file, two engineers could see different plans for the same HCL — a debugging nightmare. With it, plans are reproducible across machines.

Output.

Without lock file With lock file
Plans differ by machine Plans identical across machines
"Works on my machine" debugging Deterministic CI
Surprise version drift on init Explicit upgrade via -upgrade
Provider behaviour changes silently Lock-file diff visible in PR

Rule of thumb. Commit .terraform.lock.hcl to Git. Always. Treat lock-file diffs as significant in PR review; they signal a provider upgrade that may have behaviour changes.

Senior interview question on the Terraform DAG and refresh

A senior interviewer might ask: "Explain what happens inside terraform plan from the moment you press enter to the moment you see the diff. What does Terraform read, what API calls does it make, and what is the role of the state file in each stage?"

Solution Using the parse → graph → refresh → diff pipeline

What `terraform plan` actually does
===================================

1) Parse phase
   - Walk every *.tf file in the working dir + module sources.
   - Validate HCL syntax, resolve variables, expand for_each / count.
   - Result: an in-memory representation of the desired state.

2) Graph phase
   - Build the DAG from resource/data references.
   - Detect cycles → error out.
   - Topologically sort.

3) Refresh phase (default; -refresh=false to skip)
   - For each resource in state, call provider's Read API.
   - Update in-memory state with current cloud attributes.
   - Drift between state and cloud is captured here.

4) Diff phase
   - For each resource in the desired graph:
     - If absent from state → CREATE.
     - If present and attributes match → no-op.
     - If present and attributes differ → UPDATE or REPLACE
       (depending on the resource's ForceNew flags).
   - For each resource in state but not in HCL → DESTROY.
   - Output: the human-readable plan.

5) (optional) -out=plan.tfplan
   - Serialise the plan as a signed binary file.
   - Apply must consume this file (no fresh plan computed).
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage What runs Output
Parse HCL parser + variable resolver in-memory desired state
Graph DAG builder + cycle check sorted resource order
Refresh provider Read calls for every state resource refreshed in-memory state
Diff per-resource compare desired vs refreshed plan delta
Output textual diff (or binary plan if -out) reviewer reads this

The refresh phase is the most expensive in network terms — one API call per resource. For a 500-resource state, that is 500 reads. Skip it with -refresh=false when you know nothing has drifted (e.g. immediate re-plan after an apply).

Output:

Component Reads Writes
Parser .tf files in-memory AST
Graph builder AST DAG
Refresh provider APIs in-memory state
Diff refreshed state + DAG plan
-out plan signed binary plan.tfplan

Why this works — concept by concept:

  • Parse is local, refresh is network, diff is compute — each phase has a different cost shape. Parse is cheap; refresh dominates wall-clock for large states; diff is fast once refresh is done.
  • The DAG is the contract — graph order determines apply order. Cycles fail fast at graph-build time, not at apply time when half the resources are already created.
  • Refresh is the drift-detection mechanism — without refresh, plan would lie. With refresh, plan reflects current cloud reality vs desired state.
  • Signed binary plan-out=plan.tfplan is the audit primitive. It captures exactly what apply will do; the apply cannot deviate from it. CI/CD pipelines should always plan-then-apply with the binary.
  • Cost — plan is O(resources) for refresh + O(graph edges) for diff. For most data infra repos (< 1000 resources), plan completes in 10-60s; for very large states (5000+ resources), splitting into smaller modules / states becomes the right answer.

ETL
Topic — etl
Resource-graph design problems

Practice →

Optimization Topic — optimization Plan-cost optimization problems

Practice →


3. Provisioning a warehouse + lakehouse

terraform snowflake and terraform databricks give you a full warehouse + lakehouse in one HCL graph — same plan/apply ritual, two cloud-native stacks

The mental model in one line: a production warehouse + lakehouse is just a graph of provider-specific resources — snowflake_database and snowflake_warehouse on one side, aws_s3_bucket + aws_glue_catalog_database + aws_glue_catalog_table + aws_lakeformation_permissions on the other — wired by HCL into one plan, one apply, one state. Once you say "one HCL, two stacks, one apply ritual," the entire terraform aws glue and terraform databricks interview surface becomes a deduction from "which provider resource maps to which cloud object?"

Iconographic dual-stack diagram — left card shows a Snowflake warehouse stack (database, schema, warehouse, role) provisioned by HCL; right card shows an AWS lakehouse stack (S3 bucket, Glue database, Iceberg table, Lake Formation permission) provisioned by HCL; a central terraform engine icon wires both stacks.

The Snowflake provider — the warehouse stack.

  • snowflake_database — top-level container. Holds schemas, stages, and snowpipes. One per logical project (ANALYTICS, RAW_LANDING, MARTS).
  • snowflake_schema — namespace inside a database. Holds tables, views, sequences, functions, procedures, file formats, masking policies.
  • snowflake_warehouse — the compute. warehouse_size (XS / SMALL / MEDIUM / LARGE / XLARGE / 2XLARGE / 3XLARGE / 4XLARGE / 5XLARGE / 6XLARGE), auto_suspend (seconds before idle suspend), auto_resume (resume on query), min_cluster_count / max_cluster_count for multi-cluster warehouses.
  • snowflake_role — RBAC primitive. Roles are granted to users and to other roles (role hierarchy). Privileges are granted to roles, never to users directly.
  • snowflake_grant_privileges_to_role — the grant object. Pivots on (role, privilege, target) where target is a schema, table, warehouse, or account-level grant.

The Databricks provider — the lakehouse compute stack.

  • databricks_workspace (provisioned via the AWS / Azure / GCP provider) — the Databricks deployment itself. Usually provisioned once per account, then everything else hangs off it.
  • databricks_cluster — the compute cluster. spark_version, node_type_id, num_workers (or autoscale {}), data_security_mode = "USER_ISOLATION" for Unity Catalog.
  • databricks_job — a scheduled or triggered job. task blocks point at notebooks, JARs, dbt projects, or SQL files. Has its own job-level IAM via databricks_permissions.
  • databricks_catalog / databricks_schema / databricks_table — the Unity Catalog three-level namespace. catalog.schema.table instead of legacy hive_metastore.schema.table.
  • databricks_grant / databricks_grants — grant tree on catalogs, schemas, tables, external locations, storage credentials. The Unity Catalog answer to RBAC.

The BigQuery provider — Google's warehouse.

  • google_bigquery_dataset — top-level namespace (BigQuery's "database" equivalent). Includes location (multi-region or single region), default_table_expiration_ms, access {} blocks for dataset-level IAM.
  • google_bigquery_table — table or view (set view {} block) or materialised view. Schema via JSON or schema {} blocks. time_partitioning {} for partitioned tables, clustering {} for cluster keys.
  • google_bigquery_routine — UDFs (SQL or JavaScript) and stored procedures. Versioned in Git alongside the table HCL.

The AWS lakehouse stack — S3 + Glue + Lake Formation + Athena.

  • aws_s3_bucket — the storage layer. Production-grade buckets add aws_s3_bucket_versioning, aws_s3_bucket_server_side_encryption_configuration, aws_s3_bucket_lifecycle_configuration, aws_s3_bucket_public_access_block — six resources for one bucket is normal.
  • aws_glue_catalog_database — the Hive-style database in the Glue Data Catalog. The catalog itself is per-account; databases are namespaces inside it.
  • aws_glue_catalog_table — table metadata pointing at S3 location, with format params (table_type = "ICEBERG" for Iceberg tables, classification = "parquet" for Parquet/Spark tables).
  • aws_lakeformation_permissions — the grant primitive. Pivots on (principal, permission, resource) where permission is SELECT, DESCRIBE, DATA_LOCATION_ACCESS, ALL, etc.
  • aws_athena_workgroup — the Athena query layer. Workgroup defines result-bucket location, encryption, cost controls. Separate from compute (Athena is serverless).

The Snowflake + Databricks dual-stack pattern.

  • Many production teams run both: Snowflake for SQL analytics, Databricks for ML/Spark/streaming. Terraform composes both stacks into one apply.
  • The shared layer is the catalog: a single source of metadata that both engines can read. Patterns include: Iceberg tables in Glue (Databricks writes, Snowflake reads via external tables) or Unity Catalog with federated Snowflake catalogs.
  • The grant tree must be designed deliberately — same human identity (via SSO / SCIM) maps to a Snowflake role and a Databricks group with overlapping access. Terraform makes the symmetry explicit.

Common interview probes on the provider surface.

  • "How do you create a Snowflake database in Terraform?" — snowflake_database + snowflake_schema + the grant tree.
  • "How do you create a Databricks Unity Catalog table?" — databricks_catalogdatabricks_schemadatabricks_table (or external table via databricks_external_location).
  • "What is the minimal HCL for a lakehouse?" — aws_s3_bucket + aws_glue_catalog_database + aws_glue_catalog_table (Iceberg) + an IAM role for whoever reads/writes.
  • "How do you wire Snowflake to read Iceberg tables from S3?" — snowflake_storage_integration + snowflake_external_volume + snowflake_iceberg_table referencing the S3 path; the Glue catalog can be the external catalog source.

Worked example — lakehouse-in-100-lines end-to-end

Detailed explanation. A complete lakehouse from scratch: an S3 bucket for raw and curated data, a Glue database, an Iceberg-format Glue table, an Athena workgroup, and an IAM role for whatever pipeline writes the data. Each resource is small; together they are a usable lakehouse.

Question. Write the complete HCL for an end-to-end lakehouse (S3 + Glue + Iceberg + Athena + IAM) at minimum production quality. Annotate the dependencies.

Input.

Component Resource Depends on
Storage aws_s3_bucket.lake
Encryption aws_s3_bucket_server_side_encryption_configuration.lake bucket
Versioning aws_s3_bucket_versioning.lake bucket
Catalog DB aws_glue_catalog_database.lake
Iceberg table aws_glue_catalog_table.orders bucket + catalog DB
Athena workgroup aws_athena_workgroup.lake bucket
Pipeline IAM aws_iam_role.pipeline + policies bucket + catalog DB

Code.

locals {
  name = "lakehouse-${var.env}"
}

resource "aws_s3_bucket" "lake" {
  bucket = local.name
}

resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
  bucket = aws_s3_bucket.lake.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_versioning" "lake" {
  bucket = aws_s3_bucket.lake.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_public_access_block" "lake" {
  bucket                  = aws_s3_bucket.lake.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_glue_catalog_database" "lake" {
  name = replace(local.name, "-", "_")
}

resource "aws_glue_catalog_table" "orders" {
  name          = "orders"
  database_name = aws_glue_catalog_database.lake.name
  table_type    = "EXTERNAL_TABLE"

  parameters = {
    "table_type"        = "ICEBERG"
    "format-version"    = "2"
    "metadata_location" = "s3://${aws_s3_bucket.lake.bucket}/orders/metadata/00000-init.metadata.json"
  }

  storage_descriptor {
    location = "s3://${aws_s3_bucket.lake.bucket}/orders/"

    columns { name = "order_id"    type = "bigint" }
    columns { name = "customer_id" type = "bigint" }
    columns { name = "amount"      type = "decimal(18,2)" }
    columns { name = "order_ts"    type = "timestamp" }
  }
}

resource "aws_athena_workgroup" "lake" {
  name = local.name

  configuration {
    enforce_workgroup_configuration = true
    publish_cloudwatch_metrics_enabled = true

    result_configuration {
      output_location = "s3://${aws_s3_bucket.lake.bucket}/athena-results/"

      encryption_configuration {
        encryption_option = "SSE_S3"
      }
    }
  }
}

data "aws_iam_policy_document" "pipeline_assume" {
  statement {
    actions = ["sts:AssumeRole"]
    principals {
      type        = "Service"
      identifiers = ["glue.amazonaws.com"]
    }
  }
}

resource "aws_iam_role" "pipeline" {
  name               = "${local.name}-pipeline"
  assume_role_policy = data.aws_iam_policy_document.pipeline_assume.json
}

data "aws_iam_policy_document" "pipeline_perms" {
  statement {
    actions = [
      "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"
    ]
    resources = [
      aws_s3_bucket.lake.arn,
      "${aws_s3_bucket.lake.arn}/*",
    ]
  }
  statement {
    actions = [
      "glue:GetDatabase", "glue:GetTable", "glue:GetPartitions",
      "glue:CreateTable", "glue:UpdateTable", "glue:BatchCreatePartition",
    ]
    resources = ["*"]   # narrow further with catalog ARN in prod
  }
}

resource "aws_iam_role_policy" "pipeline" {
  role   = aws_iam_role.pipeline.id
  policy = data.aws_iam_policy_document.pipeline_perms.json
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. locals.name gives every resource a consistent prefix (lakehouse-prod, lakehouse-staging). The replace(...) for the Glue DB name handles the dash → underscore rename (Glue forbids dashes).
  2. The S3 bucket is provisioned with four companion resources: encryption, versioning, public-access block, and lifecycle (omitted here for brevity). This is the production minimum; never ship a bucket without these four.
  3. aws_glue_catalog_database.lake is created with no explicit catalog id — defaults to the account's primary catalog. References work by name (aws_glue_catalog_database.lake.name).
  4. aws_glue_catalog_table.orders is the Iceberg-format table. The Iceberg-specific magic lives in parameters (table_type = "ICEBERG", metadata_location pointing at the bootstrap metadata file). Schema is declared inline; in production, you would point at an existing metadata file managed by the writer (Spark / Glue ETL).
  5. aws_athena_workgroup.lake configures the query layer. The enforce_workgroup_configuration = true flag prevents users from overriding the result location (the most common Athena security mistake).
  6. The IAM role grants Glue catalog read/write + S3 bucket read/write. Production grants would scope glue:* actions to the specific catalog ARN; the wildcard is a teaching simplification.

Output (the apply order Terraform derives).

Level Resources
0 aws_s3_bucket.lake, aws_glue_catalog_database.lake, aws_iam_role.pipeline
1 bucket-companion resources, aws_glue_catalog_table.orders, aws_athena_workgroup.lake, aws_iam_role_policy.pipeline

Rule of thumb. A "minimal" lakehouse is never just S3 + Glue. Production-grade always means S3 + 4 companion resources + Glue + table + Athena workgroup + IAM. Bake all 10 into a lakehouse module so the next env stands up in one variable change.

Worked example — Snowflake warehouse with auto-suspend per env

Detailed explanation. Same Snowflake warehouse, different settings per env. Dev should auto-suspend after 60 seconds; prod stays warm at 5 minutes for low-latency dashboards. Variable-ised HCL handles both from one module.

Question. Write a Snowflake warehouse resource parameterised by env, and show the per-env tfvars files.

Input.

env size auto_suspend (s) min_cluster max_cluster
dev XSMALL 60 1 1
staging SMALL 120 1 2
prod MEDIUM 300 2 5

Code.

# variables.tf
variable "env" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.env)
    error_message = "env must be one of dev, staging, prod."
  }
}

variable "warehouse_size" {
  type    = string
  default = "XSMALL"
}

variable "auto_suspend_s" {
  type    = number
  default = 60
}

variable "min_cluster_count" {
  type    = number
  default = 1
}

variable "max_cluster_count" {
  type    = number
  default = 1
}

# warehouse.tf
resource "snowflake_warehouse" "load" {
  name                = "LOAD_WH_${upper(var.env)}"
  warehouse_size      = var.warehouse_size
  auto_suspend        = var.auto_suspend_s
  auto_resume         = true
  min_cluster_count   = var.min_cluster_count
  max_cluster_count   = var.max_cluster_count
  scaling_policy      = "STANDARD"
  initially_suspended = true
}
Enter fullscreen mode Exit fullscreen mode
# envs/dev.tfvars
env               = "dev"
warehouse_size    = "XSMALL"
auto_suspend_s    = 60
min_cluster_count = 1
max_cluster_count = 1
Enter fullscreen mode Exit fullscreen mode
# envs/prod.tfvars
env               = "prod"
warehouse_size    = "MEDIUM"
auto_suspend_s    = 300
min_cluster_count = 2
max_cluster_count = 5
Enter fullscreen mode Exit fullscreen mode
$ terraform plan -var-file=envs/prod.tfvars
$ terraform apply -var-file=envs/prod.tfvars
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. variable "env" has a validation block — contains(...) ensures only the three known envs are accepted. Wrong env at apply time → fast error, not a half-built warehouse.
  2. Each tuning knob (warehouse_size, auto_suspend_s, min_cluster_count, max_cluster_count) is its own variable with a sensible dev-grade default. Production overrides via tfvars.
  3. The LOAD_WH_${upper(var.env)} naming makes the env explicit in Snowflake itself. Operators see the env in the warehouse list.
  4. initially_suspended = true is the senior-grade default — never auto-resume on first create; let an explicit query be the first thing that wakes the warehouse.
  5. Per-env tfvars files in envs/dev.tfvars, envs/staging.tfvars, envs/prod.tfvars are committed to Git. CI picks the right file based on the branch / workflow.

Output (per-env apply result).

env warehouse name size suspend (s) clusters
dev LOAD_WH_DEV XSMALL 60 1
staging LOAD_WH_STAGING SMALL 120 1–2
prod LOAD_WH_PROD MEDIUM 300 2–5

Rule of thumb. Every per-env tuning knob is a variable, never a hard-coded value. Tfvars files in envs/ map cleanly to CI workflows; the alternative (workspaces) is harder to enforce safely.

Worked example — Databricks Unity Catalog three-level provisioning

Detailed explanation. A team migrates from legacy hive_metastore.schema.table to Unity Catalog's catalog.schema.table. Terraform provisions the catalog, the schemas, the tables, and the grants in one HCL graph. The big change vs legacy: explicit catalog as the top namespace, and grants at every level.

Question. Write the HCL for a Unity Catalog with two schemas (raw, marts) and a single sample table in marts. Show the grant tree giving an analytics_group SELECT on marts.

Input.

Level Resource Name
1 databricks_catalog analytics
2 databricks_schema analytics.raw, analytics.marts
3 databricks_table analytics.marts.orders (external Iceberg)
4 databricks_grants analytics_group SELECT on marts

Code.

resource "databricks_catalog" "analytics" {
  name         = "analytics"
  comment      = "Curated analytics — managed by Terraform"
  storage_root = "s3://${aws_s3_bucket.lake.bucket}/uc/analytics/"
}

resource "databricks_schema" "raw" {
  catalog_name = databricks_catalog.analytics.name
  name         = "raw"
}

resource "databricks_schema" "marts" {
  catalog_name = databricks_catalog.analytics.name
  name         = "marts"
}

resource "databricks_external_location" "orders" {
  name = "orders"
  url  = "s3://${aws_s3_bucket.lake.bucket}/orders/"
  credential_name = databricks_storage_credential.lake.name
}

resource "databricks_table" "orders" {
  catalog_name = databricks_catalog.analytics.name
  schema_name  = databricks_schema.marts.name
  name         = "orders"
  table_type   = "EXTERNAL"
  data_source_format = "DELTA"
  storage_location = databricks_external_location.orders.url

  column { name = "order_id"    type = "BIGINT" }
  column { name = "customer_id" type = "BIGINT" }
  column { name = "amount"      type = "DECIMAL(18,2)" }
  column { name = "order_ts"    type = "TIMESTAMP" }
}

resource "databricks_grants" "marts_read" {
  schema = "${databricks_catalog.analytics.name}.${databricks_schema.marts.name}"

  grant {
    principal  = "analytics_group"
    privileges = ["USE_SCHEMA", "SELECT"]
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. databricks_catalog.analytics is the top-level namespace. storage_root pins the default S3 prefix for managed tables in the catalog. External tables can live anywhere with their own databricks_external_location.
  2. Two schemas (raw and marts) live inside the catalog. Each is a sub-namespace; tables and views go inside schemas.
  3. databricks_external_location.orders is the Unity Catalog primitive that links an S3 path to a storage credential (databricks_storage_credential.lake). Without it, Databricks cannot read/write the path; with it, every read/write checks the location's grant tree.
  4. databricks_table.orders is the external Delta table. Schema is declared inline; on apply, Databricks creates the metadata pointing at the S3 location.
  5. databricks_grants.marts_read grants USE_SCHEMA + SELECT to the analytics_group. Unity Catalog requires the parent-level USE_SCHEMA before any table-level grant counts — a common interview gotcha.

Output (apply order Terraform derives).

Level Resources
0 databricks_catalog.analytics, databricks_storage_credential.lake
1 schemas raw + marts, databricks_external_location.orders
2 databricks_table.orders, databricks_grants.marts_read

Rule of thumb. Always declare grants on the schema (USE_SCHEMA + SELECT/INSERT/...) — a table-level grant alone does not work without the parent USE on the schema. The same rule applies one level up at the catalog (USE_CATALOG).

Senior interview question on provisioning a dual-stack data platform

A senior interviewer might ask: "Your company runs Snowflake for SQL analytics and Databricks for ML/streaming. They share a Glue Iceberg catalog. Walk me through the Terraform layout — how do you organise the HCL across providers, what goes in which state file, and how do the two stacks reference each other?"

Solution Using one root module per env, shared state via remote-state data source

Repo layout for a dual-stack data platform
==========================================

infra/
├── modules/
│   ├── lakehouse/        # S3 + Glue + Iceberg + Athena + Lake Formation
│   ├── warehouse/        # Snowflake DB + schemas + warehouse + roles
│   ├── lakehouse-compute/# Databricks workspace + cluster + UC catalog
│   └── iam/              # Cross-stack IAM (the bridge)
└── envs/
    ├── dev/
    │   ├── backend.tf    # s3 backend, key = "dev/data-platform.tfstate"
    │   ├── main.tf       # module "lakehouse" "warehouse" "lakehouse-compute"
    │   ├── variables.tf
    │   └── terraform.tfvars
    ├── staging/  (same shape)
    └── prod/     (same shape)

State files
-----------
- One state per env (NOT one per stack within an env).
- Reasoning: snowflake → glue grant references span stacks;
  splitting state forces remote-state data lookups everywhere.

Cross-stack references
----------------------
Inside one env, modules reference each other directly:
  module.warehouse.snowflake_role_id
  module.lakehouse.glue_database_name

Across envs (rare):
  data "terraform_remote_state" "prod_lakehouse" {
    backend = "s3"
    config  = { bucket = "...", key = "prod/data-platform.tfstate" }
  }
  # then read data.terraform_remote_state.prod_lakehouse.outputs.bucket_arn
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Decision Choice Reason
State per env vs per stack per env tight cross-stack references
Module per stack yes reusable across envs
Backend per env yes (s3 key includes env) env isolation
Workspaces no directory-per-env is safer
Cross-env reads terraform_remote_state data source explicit + read-only
IAM bridge dedicated iam module grants are first-class

The one-state-per-env layout keeps cross-stack references trivial (direct module outputs) while preserving env isolation (separate backend keys). The alternative — one state per stack — multiplies remote-state lookups and creates ordering nightmares.

Output:

Per-env state contents Cross-stack reference pattern
Snowflake DB + warehouse + roles module.warehouse.outputs.x
AWS S3 + Glue + Iceberg + Athena module.lakehouse.outputs.x
Databricks workspace + UC + clusters module.lakehouse-compute.outputs.x
Cross-stack IAM bridge module.iam.outputs.x

Why this works — concept by concept:

  • One state per env, not per stack — cross-stack grants (Databricks reading a Glue table) are common. Putting them in one state makes references direct (module.x.y). Splitting state forces terraform_remote_state lookups, which are slower and harder to reason about.
  • Modules per stack — the lakehouse, warehouse, lakehouse-compute, and iam modules are reusable across envs. Each env's root composes them with env-specific variables.
  • Directory-per-env, never workspaces in prodenvs/prod/, envs/staging/, envs/dev/ make wrong-env applies impossible (you're physically in the wrong directory). Workspaces let you accidentally apply prod from a dev terminal.
  • The IAM module is the bridge — cross-cloud IAM (Glue role assumed by Snowflake's storage integration, Databricks instance profile reading Glue catalog) lives in its own module. Makes the bridge explicit.
  • Cost — one state per env means each plan refreshes every resource in the env (slow for large states). Mitigation: target-plan during iteration (-target=module.warehouse), full plan in CI.

ETL
Topic — etl
Warehouse + lakehouse design problems

Practice →

ETL Topic — etl · medium Medium pipeline-infra problems

Practice →


4. Catalogs, IAM & policy-as-code

terraform iam policy is the only audit trail you trust — Unity Catalog, Snowflake RBAC, and Lake Formation all collapse to the same grant pattern

The mental model in one line: every catalog (Unity Catalog, Snowflake's database/schema/table tree, AWS Glue catalog) is a hierarchical namespace where grants flow from parent to child, and Terraform expresses both the namespace and the grants as one HCL graph — the same principal, privilege, resource triple regardless of provider. Once you say "namespace + grant tree + Terraform-as-the-only-author," the entire terraform iam policy interview surface becomes a deduction from "who can do what on which resource, and where is that documented?"

Iconographic three-tier IAM diagram — top tier Unity Catalog ribbon with catalog/schema/table glyphs, middle tier Snowflake RBAC pyramid with role-hierarchy nodes, bottom tier AWS Lake Formation shield with policy chips; a vertical policy-as-code spine on the right with OPA/Checkov badges.

Unity Catalog via terraform-databricks — the three-level grant tree.

  • Catalog — top level. Grant USE_CATALOG to anyone who needs to enter the catalog at all.
  • Schema — sub-namespace. Grant USE_SCHEMA to anyone who needs to see tables in the schema.
  • Table — leaf. Grant SELECT, MODIFY, INSERT, etc.
  • The principal can be a user (alice@company.com), a group (analytics_group), or a service principal (databricks-pat-svc).
  • Grants are additive — there is no DENY in Unity Catalog. Removing access means removing the grant via Terraform.

Snowflake RBAC — role hierarchy + privilege grants.

  • Roles are the primary primitive (snowflake_role). Privileges are granted to roles, never directly to users.
  • Role hierarchy — roles can be granted to other roles (snowflake_grant_account_role). A user assumes a role; the role inherits all privileges of all roles granted to it (transitive).
  • Account-level grantsCREATE DATABASE, MONITOR USAGE, MANAGE GRANTS. Granted via snowflake_grant_account_role or snowflake_grant_privileges_to_role with on_account = true.
  • Database/schema/table-level grantsUSAGE, SELECT, INSERT, UPDATE, etc. Granted via snowflake_grant_privileges_to_role with on_schema {} or on_schema_object {} blocks.
  • Functional vs access roles — the senior-DE pattern: define access roles (ROLE_ANALYTICS_RAW_READ) granting privileges, then functional roles (ROLE_ANALYST_PROD) granted access roles. Users are granted functional roles only. Two layers of indirection, infinite flexibility.

AWS Lake Formation — data-lake permissions as Terraform resources.

  • aws_lakeformation_resource — registers an S3 location as a Lake Formation resource. Required before grants can reference the location.
  • aws_lakeformation_permissions — the grant primitive. principal = aws_iam_role.x.arn, permissions = ["SELECT", "DESCRIBE"], table {} or data_location {} block names the target.
  • The Lake Formation gotcha — by default, both IAM permissions and Lake Formation permissions are required. The "Lake Formation mode" (use_lf_tag_based_access_control) flips this so Lake Formation alone gates access — but it must be explicitly enabled.
  • Tag-based access (aws_lakeformation_lf_tag + aws_lakeformation_lf_tag_policy) — assign tags to databases / tables, grant permissions on the tags. The 2026 senior-DE answer to scaling Lake Formation across hundreds of tables.

Cross-account share patterns.

  • Snowflake reader account — provision via snowflake_share + snowflake_grant_privileges_to_share. A consumer account reads the share read-only, with no per-row copy of the data.
  • Iceberg REST share — register an external catalog via the Iceberg REST spec; the consumer's catalog client points at the producer's REST endpoint. Terraform can provision the credentials and the endpoint registration.
  • AWS RAM (Resource Access Manager) for Glue catalogaws_ram_resource_share + aws_ram_principal_association. Share a Glue catalog database across accounts; the consumer account's IAM grants apply on top.

Policy-as-code — combining Terraform with OPA / Conftest / Checkov.

  • OPA (Open Policy Agent) + Conftest — write policy rules in Rego that gate Terraform plans. Examples: "no S3 bucket without encryption", "no IAM policy with Action: * on Resource: *", "every Snowflake warehouse must have auto_suspend ≤ 300".
  • Checkov — pre-built rule library scanning HCL for misconfigurations. Ships hundreds of rules; runs in CI; output in SARIF / JSON / human-readable formats.
  • HashiCorp Sentinel — proprietary policy-as-code for HCP Terraform / Terraform Cloud. Same idea, different syntax, only inside the HashiCorp ecosystem.
  • The senior pattern — Checkov for general scans, custom OPA/Rego for company-specific rules, Sentinel only if you are paying for HCP Terraform. All three gate the PR before plan is approved.

Secrets handling — never password = "…" in HCL.

  • AWS Secrets Managerdata "aws_secretsmanager_secret_version" "x" { secret_id = "..." }, then secret.x.secret_string in the resource that needs it.
  • HashiCorp Vaultdata "vault_generic_secret" "x" { path = "secret/..." }, then data.vault_generic_secret.x.data["password"].
  • GCP Secret Managerdata "google_secret_manager_secret_version" "x" { secret = "..." }, then data.google_secret_manager_secret_version.x.secret_data.
  • Sensitive variable markervariable "x" { sensitive = true }. Hides the value from plan output. Does not encrypt it in state — the state file still has it in plain text. Always use remote state with at-rest encryption.

Common interview probes on catalogs + IAM.

  • "How do you grant SELECT on a Unity Catalog table?" — three grants: USE_CATALOG, USE_SCHEMA, SELECT. Parent before child.
  • "How do you handle secrets in Terraform?" — data source against Secrets Manager / Vault; never inline; never commit a tfvars with secrets.
  • "What is the difference between Snowflake account-level and database-level grants?" — account-level (e.g. CREATE DATABASE) is granted via on_account; database-level (USAGE) via on_schema {} or on_account_object {} blocks.
  • "How do you scope an IAM policy to least privilege?" — never Action: * or Resource: *; explicit action lists; resource ARNs with conditions where possible.

Worked example — full Snowflake RBAC tree with access + functional roles

Detailed explanation. A team has three logical access patterns: read raw, read marts, write marts. Two functional roles: analyst_prod (read marts only), engineer_prod (read raw + write marts). The senior-DE pattern uses access roles as the building blocks and functional roles as the user-facing layer.

Question. Write the HCL for the three access roles, the two functional roles, the role-hierarchy grants, and the privilege grants. Show the apply order Terraform derives.

Input.

Access role Privileges
ROLE_RAW_READ USAGE on db + USAGE on schema raw + SELECT on all tables in raw
ROLE_MARTS_READ USAGE on db + USAGE on schema marts + SELECT on all tables in marts
ROLE_MARTS_WRITE USAGE on db + USAGE on schema marts + INSERT/UPDATE/DELETE on all tables in marts
ROLE_ANALYST_PROD granted ROLE_MARTS_READ
ROLE_ENGINEER_PROD granted ROLE_RAW_READ + ROLE_MARTS_READ + ROLE_MARTS_WRITE

Code.

# Access roles
resource "snowflake_role" "raw_read"     { name = "ROLE_RAW_READ"     }
resource "snowflake_role" "marts_read"   { name = "ROLE_MARTS_READ"   }
resource "snowflake_role" "marts_write"  { name = "ROLE_MARTS_WRITE"  }

# Functional roles
resource "snowflake_role" "analyst_prod"  { name = "ROLE_ANALYST_PROD"  }
resource "snowflake_role" "engineer_prod" { name = "ROLE_ENGINEER_PROD" }

# Role hierarchy — functional roles inherit access roles
resource "snowflake_grant_account_role" "analyst_marts_read" {
  role_name        = snowflake_role.marts_read.name
  parent_role_name = snowflake_role.analyst_prod.name
}

resource "snowflake_grant_account_role" "engineer_raw_read" {
  role_name        = snowflake_role.raw_read.name
  parent_role_name = snowflake_role.engineer_prod.name
}

resource "snowflake_grant_account_role" "engineer_marts_read" {
  role_name        = snowflake_role.marts_read.name
  parent_role_name = snowflake_role.engineer_prod.name
}

resource "snowflake_grant_account_role" "engineer_marts_write" {
  role_name        = snowflake_role.marts_write.name
  parent_role_name = snowflake_role.engineer_prod.name
}

# Privilege grants — db USAGE for all access roles
resource "snowflake_grant_privileges_to_role" "raw_read_db_usage" {
  role_name = snowflake_role.raw_read.name
  privileges = ["USAGE"]
  on_account_object {
    object_type = "DATABASE"
    object_name = snowflake_database.analytics.name
  }
}
# (… similar blocks for marts_read, marts_write …)

# Schema-level grants
resource "snowflake_grant_privileges_to_role" "raw_read_schema_usage" {
  role_name = snowflake_role.raw_read.name
  privileges = ["USAGE"]
  on_schema {
    schema_name = "${snowflake_database.analytics.name}.${snowflake_schema.raw.name}"
  }
}

# Table-level grants — applies to all current + future tables in schema
resource "snowflake_grant_privileges_to_role" "raw_read_select" {
  role_name = snowflake_role.raw_read.name
  privileges = ["SELECT"]
  on_schema_object {
    future {
      object_type_plural = "TABLES"
      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.raw.name}"
    }
  }
}

resource "snowflake_grant_privileges_to_role" "marts_write_dml" {
  role_name = snowflake_role.marts_write.name
  privileges = ["INSERT", "UPDATE", "DELETE"]
  on_schema_object {
    future {
      object_type_plural = "TABLES"
      in_schema          = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Five role resources: three access roles (the privilege bundles) and two functional roles (the user-facing layer). Functional roles inherit access roles via snowflake_grant_account_role.
  2. The role hierarchy is the indirection. Adding ROLE_ANALYST_STAGE only requires creating one new functional role and granting the relevant access roles — no privilege grants to write.
  3. Privilege grants use three levels: on_account_object { object_type = "DATABASE" } for db-level USAGE, on_schema {} for schema-level USAGE, on_schema_object { future {} } for "all current and future tables in this schema".
  4. The future grant is the senior-DE pattern — it future-proofs the grant tree. New tables get the grant automatically; no Terraform re-run needed when a new dbt model lands.
  5. Users are granted only functional roles (SCIM_ANALYST group → ROLE_ANALYST_PROD). Access roles are never user-facing; they are internal building blocks.

Output (the resolved grant tree).

Functional role Inherits Effective privileges
ROLE_ANALYST_PROD ROLE_MARTS_READ USAGE on db + USAGE schema marts + SELECT marts.*
ROLE_ENGINEER_PROD ROLE_RAW_READ + ROLE_MARTS_READ + ROLE_MARTS_WRITE USAGE db + USAGE raw + USAGE marts + SELECT raw.* + INSERT/UPDATE/DELETE marts.*

Rule of thumb. Always separate access roles from functional roles. Users get functional only; functional inherits access; access has the privileges. Adding a new env, a new dataset, or a new user is then a one-resource change.

Worked example — Lake Formation tag-based access at scale

Detailed explanation. A team has 200 tables in Glue. Granting permissions per table per role is unmanageable. Lake Formation tags let you tag tables (sensitivity = pii, domain = sales) and grant on the tag pattern. One grant covers every table that matches.

Question. Write HCL that creates two LF tags (sensitivity, domain), applies them to tables, and grants analytics_role SELECT on every table where domain = sales AND sensitivity = non_pii.

Input.

Resource Value
Tag sensitivity values: pii, non_pii
Tag domain values: sales, marketing, finance
Grant analytics_role SELECT on (domain=sales AND sensitivity=non_pii)

Code.

resource "aws_lakeformation_lf_tag" "sensitivity" {
  key    = "sensitivity"
  values = ["pii", "non_pii"]
}

resource "aws_lakeformation_lf_tag" "domain" {
  key    = "domain"
  values = ["sales", "marketing", "finance"]
}

# Tag a specific table
resource "aws_lakeformation_resource_lf_tags" "orders_tags" {
  table {
    database_name = aws_glue_catalog_database.lake.name
    name          = aws_glue_catalog_table.orders.name
  }
  lf_tag {
    key   = aws_lakeformation_lf_tag.sensitivity.key
    value = "non_pii"
  }
  lf_tag {
    key   = aws_lakeformation_lf_tag.domain.key
    value = "sales"
  }
}

# Tag-based grant
resource "aws_lakeformation_permissions" "analytics_sales_non_pii" {
  principal   = aws_iam_role.analytics.arn
  permissions = ["SELECT", "DESCRIBE"]

  lf_tag_policy {
    resource_type = "TABLE"

    expression {
      key    = "domain"
      values = ["sales"]
    }
    expression {
      key    = "sensitivity"
      values = ["non_pii"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. aws_lakeformation_lf_tag defines the tag and its allowed values. Snowflake-style enums; the tag is account-scoped.
  2. aws_lakeformation_resource_lf_tags attaches tag values to a specific table. In production, the tagging is often done by the table-owner module — each table resource gets a companion tagging resource.
  3. aws_lakeformation_permissions grants permissions to a principal via a tag policy. The lf_tag_policy block's expression blocks are ANDed — the grant applies only where every expression matches.
  4. The runtime effect: analytics_role can SELECT on any current or future table with domain = sales AND sensitivity = non_pii. Add a new table with those tags → grant applies automatically.
  5. The scaling win: a team with 200 tables can replace 200 per-table grants with 5-10 tag-based grants. Adding a table is one tagging resource; the grant fans out.

Output.

Tag combination Tables matching Role with SELECT
domain=sales AND sensitivity=non_pii orders, customers, line_items analytics_role
domain=sales AND sensitivity=pii sales_pii (no grant — fenced off)
domain=marketing AND sensitivity=non_pii (other grants) (other roles)

Rule of thumb. Above 20-30 tables, switch to LF tag-based access. The setup cost is small (define tags once, tag tables once); the maintenance win is massive (new tables auto-inherit grants).

Worked example — secrets handling via AWS Secrets Manager

Detailed explanation. A Snowflake provider needs a password. Hard-coding it in HCL is the most common Terraform security mistake. The senior-DE pattern: store the password in AWS Secrets Manager, read it via a data source, reference it in the provider block. The state file still contains it (Terraform limitation), so remote state with encryption is mandatory.

Question. Show the full pattern: secret creation (one-time, manual), data-source read, provider use, and the state-file-encryption gotcha.

Input.

Step What runs
1 (one-time) Create the secret manually or via separate Terraform run
2 (HCL) data "aws_secretsmanager_secret_version" reads it
3 (HCL) Provider block consumes the value
4 (backend) S3 backend with KMS encryption

Code.

# secret.tf — one-time, can live in a separate state if desired
resource "aws_secretsmanager_secret" "snowflake_admin" {
  name        = "snowflake/admin"
  description = "Snowflake admin password — rotated quarterly"
  # NEVER set the value in HCL; rotate via aws cli or Vault
}

# main.tf
data "aws_secretsmanager_secret_version" "snowflake_admin" {
  secret_id = "snowflake/admin"
}

provider "snowflake" {
  account  = var.snowflake_account
  username = var.snowflake_admin_username
  password = data.aws_secretsmanager_secret_version.snowflake_admin.secret_string
  role     = "SYSADMIN"
}

# variables.tf
variable "snowflake_admin_username" {
  type      = string
  sensitive = true
}

# backend.tf — REMOTE state with at-rest encryption
terraform {
  backend "s3" {
    bucket         = "company-tfstate"
    key            = "data-platform/prod.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "arn:aws:kms:us-east-1:123456789012:key/abc-def-..."
    dynamodb_table = "tf-locks"
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The secret is created once (outside Terraform or in a bootstrapping state). The HCL never sets the secret value — only the resource shell with name + description.
  2. The data source aws_secretsmanager_secret_version.snowflake_admin reads the current value at plan/apply time. Rotation is then orthogonal to Terraform — rotate the secret value, run apply, the new value flows through.
  3. The provider block references data...secret_string. Terraform marks this value sensitive (won't print in plan output).
  4. The state-file gotcha — Terraform copies the secret value into the state file when the data source is refreshed. Even though it is sensitive, it's there in plain text JSON. Always use remote state with backend-level encryption (encrypt = true + kms_key_id).
  5. The S3 backend with KMS encryption ensures the state file is encrypted at rest. The DynamoDB lock table prevents concurrent applies (the lock itself is harmless without the encrypted bucket).

Output (the secret-handling layer cake).

Layer Mechanism Where the secret lives
Source of truth Secrets Manager encrypted in AWS
HCL data "aws_secretsmanager_secret_version" reference only
Plan output (sensitive value) hidden
State file refreshed value in JSON plain text → encrypt with KMS
Backend S3 + KMS + DynamoDB encrypted bucket only

Rule of thumb. Never password = "…" in HCL. Always Secrets Manager / Vault / Secret Manager. Always remote state with at-rest encryption. Audit the state-file access list — anyone who can read the bucket can read every secret your Terraform touched.

Senior interview question on the cross-stack IAM grant design

A senior interviewer might ask: "Walk me through how you would design the IAM bridge between Snowflake and AWS for a Snowflake storage integration that lets Snowflake read Iceberg tables from S3 via the Glue catalog. What resources do you provision, in what order, and where can the design break?"

Solution Using a storage-integration + role + bucket-policy three-part bridge

Snowflake ↔ AWS bridge for external Iceberg

Step 1 — Snowflake side: create the storage integration
       snowflake_storage_integration
         type = EXTERNAL_STAGE
         storage_provider = S3
         storage_aws_role_arn = "arn:aws:iam::123:role/snowflake-bridge"
         storage_allowed_locations = ["s3://lake/orders/"]

Step 2 — Snowflake gives back AWS-side IAM details
       data.snowflake_storage_integration.x.outputs:
         storage_aws_external_id = "ABC123..."
         storage_aws_iam_user_arn = "arn:aws:iam::SF-ACCOUNT:user/..."

Step 3 — AWS side: create the role with TRUST to Snowflake's IAM user
       aws_iam_role.snowflake_bridge
         assume_role_policy = {
           Principal = "arn:aws:iam::SF-ACCOUNT:user/..."
           Condition = { sts:ExternalId = "ABC123..." }
         }

Step 4 — AWS side: attach a policy granting S3 + Glue read on the bucket
       aws_iam_role_policy.snowflake_bridge_perms
         Action: [s3:GetObject, s3:ListBucket, glue:GetTable, glue:GetPartitions]
         Resource: ["arn:aws:s3:::lake/*", "arn:aws:glue:...:catalog", ...]

Step 5 — Snowflake side: create the external volume + iceberg table
       snowflake_external_volume {
         storage_locations = [...]
       }
       snowflake_iceberg_table {
         external_volume = snowflake_external_volume.x.name
         catalog          = "glue"
       }
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Resource Direction
1 snowflake_storage_integration Snowflake side — declares the intent
2 (read outputs) Snowflake → AWS
3 aws_iam_role.snowflake_bridge AWS side — trusts Snowflake IAM user
4 aws_iam_role_policy.snowflake_bridge_perms AWS side — S3 + Glue read
5 snowflake_iceberg_table Snowflake side — points at Glue catalog

The order matters: Snowflake-side integration first (produces the external id), then AWS-side role (trusts the external id), then Snowflake-side table (uses the integration). The DAG handles it via references.

Output:

Resource Purpose Where state lives
snowflake_storage_integration declares the AWS role Snowflake will assume Snowflake-side state
aws_iam_role.snowflake_bridge the role with trust policy from Snowflake AWS-side state
aws_iam_role_policy.snowflake_bridge_perms S3 + Glue grants AWS-side state
snowflake_external_volume wraps the integration Snowflake-side
snowflake_iceberg_table the actual table Snowflake-side

Why this works — concept by concept:

  • External id as the binding — the external id Snowflake generates is checked by AWS during AssumeRole. Without it, anyone with the role ARN could impersonate Snowflake. With it, only Snowflake's IAM user (verified externally) can assume.
  • Trust policy on the AWS side — the bridge is one-way: AWS trusts Snowflake, not vice versa. Snowflake's storage integration says "here's the role I will use"; AWS says "yes, you may use it under these conditions".
  • Least-privilege on the policy — the role's policy is scoped to specific S3 prefixes (s3://lake/orders/*) and the Glue catalog + database, never Action: * or Resource: *. Senior reviewers will reject any wildcard here.
  • Iceberg catalog as the metadata source — Snowflake reads the Iceberg metadata files directly from S3, but uses the Glue catalog to find them. The bridge is two-protocol: S3 data access + Glue catalog access.
  • Cost — five resources for the bridge, plus the table definition. The pattern is reusable as a snowflake-aws-iceberg-bridge module. State-management complexity rises slightly (cross-stack references), but the alternative — manual UI configuration — is unauditable.

ETL
Topic — etl
Cross-cloud IAM design problems

Practice →

Optimization Topic — optimization Least-privilege optimization problems

Practice →


5. Modules, state & CI/CD ritual

terraform modules + remote state + plan/apply CI is the production triangle — every senior data platform team converges on it

The mental model in one line: modules abstract the repeated patterns, remote state survives the team turnover, and the plan/apply CI ritual is the only safe path to production — together they form the triangle that turns Terraform from "scripts run on a laptop" into "platform engineering with audit trails". Once you say "modules for DRY, remote state for durability, CI for safety," the entire terraform modules + terraform state backend + terragrunt interview surface becomes a deduction from those three pillars.

Iconographic CI/CD ritual diagram — left a reusable module-stack with input/output ports, centre a plan/apply pipeline with PR-review gate, right a remote state vault with S3 + DynamoDB lock icons, all wired by glowing GitHub Actions ribbons.

Module design — input variables, outputs, versioned.

  • Input variables with types and validation — every input is typed (string, number, bool, list(...), object({...})) and validated where it matters (validation { condition = contains(["dev", "staging", "prod"], var.env) }).
  • Outputs as contract — every cross-module reference is an explicit output. Modules without outputs are useless to compose; modules with too many outputs leak internal detail.
  • Versioned via Git tagssource = "git::https://github.com/company/tf-modules.git//snowflake-database?ref=v1.2.0". The ?ref=v1.2.0 pins the module version; updates are opt-in by bumping the ref.
  • Or via Terraform Registry — public modules live at registry.terraform.io; private modules can live at HCP Terraform's private registry. Same ?ref= semantics.
  • README + examples — every module ships a README with usage examples. The reviewer reads the README; the consumer copies the example.

Root module composition vs Terragrunt DRY composition.

  • Root module composition — each env directory has a main.tf that calls modules: module "lakehouse" { source = "..." env = "prod" }. Plain Terraform; no extra tools.
  • Terragrunt — a thin wrapper that adds DRY (don't repeat yourself) via include blocks: every env inherits a common terragrunt.hcl configuration. Reduces backend.tf and provider.tf duplication across envs.
  • The senior consensus in 2026 — for ≤ 5 envs, plain Terraform is simpler. For 10+ envs (multi-region, multi-cloud), Terragrunt's DRY wins. For OpenTofu users, native tofu 1.7+ adds early variable evaluation that eliminates 80% of the Terragrunt use case.

Remote state backends — S3 + DynamoDB lock, GCS, Azure Blob, HCP Terraform.

  • S3 + DynamoDB — the AWS-native choice. S3 holds the state file with versioning + KMS encryption; DynamoDB holds the lock (one row per state file). terraform { backend "s3" { bucket = "..." key = "env/x.tfstate" dynamodb_table = "tf-locks" } }.
  • GCS — Google Cloud Storage with object-level locking. Single resource type covers both state and lock.
  • Azure Blob — same idea, blob storage + lease-based locking.
  • HCP Terraform / Terraform Cloud — managed remote state with versioning, locking, encryption, and a web UI. Adds run-as-a-service (plan/apply happens in HashiCorp's cloud).
  • Local state — useful only for tutorials and one-off labs. Never for production; never for any state shared by more than one person.

Drift detection — terraform plan in CI, terraform refresh deprecated in 1.6+.

  • Nightly drift detection — a scheduled CI job runs terraform plan on every state. Any non-no-op plan is reported (Slack alert, GitHub issue, PagerDuty).
  • terraform refresh deprecation — the standalone command was deprecated in 1.6; the modern equivalent is terraform apply -refresh-only, which refreshes state without proposing changes. Useful when you need to import-aware updates.
  • The drift-resolution discipline — every drift event must be triaged: revert (apply the plan), accept (update HCL), or refactor (variable-ise). Ignoring drift compounds; week-old drift becomes month-old drift becomes "we don't know what's in our cloud".

Plan/apply CI pipeline — PR → plan → review → apply.

  • PR creates plan — pull-request CI runs terraform init + terraform plan -out=plan.tfplan. Plan output is posted to the PR as a comment.
  • Review — humans read the plan. Required approvers depend on the env (prod = senior+; dev = anyone).
  • Apply on merge — merge to main triggers terraform apply plan.tfplan (the saved plan). The apply is atomic; either all resources change or none.
  • ToolsAtlantis (open-source, runs in your cluster, posts to GitHub), HCP Terraform (paid, runs in HashiCorp's cloud), env0 / Spacelift (paid SaaS competitors), or homegrown GitHub Actions / GitLab CI.

The blast-radius question — per-env state, per-service state, per-team state.

  • Per-env state — minimum split. dev.tfstate, staging.tfstate, prod.tfstate. The default.
  • Per-service state within env — split state by data-product or by stack (prod/warehouse.tfstate, prod/lakehouse.tfstate). Useful when one stack's plan time is dominating CI.
  • Per-team state — each team owns their state. Tighter blast radius; cross-team references via terraform_remote_state data sources.
  • The trade-off — finer splits = smaller blast radius but more cross-stack references. The senior answer: split when one state file's plan exceeds 5-10 minutes or has > 500 resources.

Common interview probes on modules + state + CI.

  • "How do you version a Terraform module?" — Git tag + ?ref=v1.2.0 in source URL, or publish to a Registry with semver.
  • "How do you handle a terraform destroy blast radius?" — per-env state + targeted-destroy + protection on prod resources (prevent_destroy = true).
  • "What is the difference between terraform apply -refresh-only and terraform refresh?" — the latter is deprecated; the former is the modern equivalent that updates state without proposing changes.
  • "What does Atlantis do?" — runs Terraform plan/apply on PR comments in a controlled environment, posts results back to the PR.

Worked example — a reusable snowflake-database module

Detailed explanation. A team repeats the same pattern: create a Snowflake database with N schemas, a load warehouse, a read role, a write role, and the grant tree. Repeating this 10 times across envs / projects is wasteful. A module encapsulates it; envs invoke it with a few variables.

Question. Write the module shell — variables, resources, outputs — and show an env-level invocation.

Input.

Module input Type Example
name string "ANALYTICS"
schemas list(string) ["raw", "staging", "marts"]
warehouse_size string "SMALL"
auto_suspend_s number 60

Code.

# modules/snowflake-database/variables.tf
variable "name" {
  type = string
  validation {
    condition     = upper(var.name) == var.name
    error_message = "name must be UPPERCASE."
  }
}

variable "schemas" {
  type    = list(string)
  default = ["raw", "staging", "marts"]
}

variable "warehouse_size" {
  type    = string
  default = "XSMALL"
}

variable "auto_suspend_s" {
  type    = number
  default = 60
}

# modules/snowflake-database/main.tf
resource "snowflake_database" "this" {
  name = var.name
}

resource "snowflake_schema" "schemas" {
  for_each = toset(var.schemas)
  database = snowflake_database.this.name
  name     = upper(each.value)
}

resource "snowflake_warehouse" "load" {
  name           = "LOAD_${var.name}"
  warehouse_size = var.warehouse_size
  auto_suspend   = var.auto_suspend_s
  auto_resume    = true
}

resource "snowflake_role" "read" {
  name = "ROLE_${var.name}_READ"
}

resource "snowflake_role" "write" {
  name = "ROLE_${var.name}_WRITE"
}

resource "snowflake_grant_privileges_to_role" "read_db_usage" {
  role_name  = snowflake_role.read.name
  privileges = ["USAGE"]
  on_account_object {
    object_type = "DATABASE"
    object_name = snowflake_database.this.name
  }
}

# modules/snowflake-database/outputs.tf
output "database_name" {
  value = snowflake_database.this.name
}

output "warehouse_name" {
  value = snowflake_warehouse.load.name
}

output "read_role_name" {
  value = snowflake_role.read.name
}

output "write_role_name" {
  value = snowflake_role.write.name
}
Enter fullscreen mode Exit fullscreen mode
# envs/prod/main.tf — invoking the module
module "analytics_db" {
  source = "git::https://github.com/company/tf-modules.git//snowflake-database?ref=v1.2.0"

  name           = "ANALYTICS"
  schemas        = ["raw", "staging", "marts", "audit"]
  warehouse_size = "MEDIUM"
  auto_suspend_s = 300
}

# Cross-module reference
resource "snowflake_grant_account_role" "analyst_inherits_read" {
  role_name        = module.analytics_db.read_role_name
  parent_role_name = snowflake_role.analyst_prod.name
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The module's variables.tf defines its public interface — every input is typed and validated where useful. validation { condition = upper(var.name) == var.name } enforces uppercase naming convention.
  2. The module's main.tf uses for_each to expand schemas dynamically. With ["raw", "staging", "marts"], three schema resources are created; with [..., "audit"], four. No code change needed.
  3. The module's outputs.tf exposes the four critical names back to callers. The consumer treats these as the module's contract.
  4. The env-level main.tf invokes the module via source = "git::...". The ?ref=v1.2.0 pins the version; bumping to v1.3.0 is an explicit opt-in.
  5. Cross-module references use the output names: module.analytics_db.read_role_name. Terraform's DAG picks up the dependency automatically.

Output (the module's contract).

Module call Resources created Outputs exposed
module "analytics_db" with 4 schemas 1 db + 4 schemas + 1 warehouse + 2 roles + 1 grant 4 names

Rule of thumb. Extract a module after the third copy-paste. Earlier is premature abstraction; later is technical debt. The third repetition is when patterns become real.

Worked example — S3 + DynamoDB remote state with KMS encryption

Detailed explanation. The minimum production-grade remote state setup: an S3 bucket with versioning + KMS encryption + a DynamoDB table for locking. Pre-provisioned via a "bootstrap" Terraform run that uses local state (chicken-and-egg), then every other state file lives in this backend.

Question. Write the bootstrap HCL (the state-backend resources) and the consumer-side backend config that points at it.

Input.

Resource Purpose
aws_s3_bucket.tfstate state storage
aws_s3_bucket_versioning.tfstate undo history
aws_s3_bucket_server_side_encryption_configuration.tfstate at-rest encryption
aws_kms_key.tfstate encryption key
aws_dynamodb_table.tflocks apply locking

Code.

# bootstrap/main.tf — local-state Terraform that creates the remote backend
resource "aws_kms_key" "tfstate" {
  description             = "KMS key for Terraform state encryption"
  deletion_window_in_days = 30
  enable_key_rotation     = true
}

resource "aws_kms_alias" "tfstate" {
  name          = "alias/tfstate"
  target_key_id = aws_kms_key.tfstate.key_id
}

resource "aws_s3_bucket" "tfstate" {
  bucket = "company-tfstate"
}

resource "aws_s3_bucket_versioning" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  versioning_configuration { status = "Enabled" }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
  bucket = aws_s3_bucket.tfstate.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.tfstate.arn
    }
  }
}

resource "aws_s3_bucket_public_access_block" "tfstate" {
  bucket                  = aws_s3_bucket.tfstate.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_dynamodb_table" "tflocks" {
  name         = "tf-locks"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "LockID"

  attribute {
    name = "LockID"
    type = "S"
  }

  point_in_time_recovery { enabled = true }
}
Enter fullscreen mode Exit fullscreen mode
# envs/prod/backend.tf — consumer-side backend config
terraform {
  required_version = ">= 1.6"
  backend "s3" {
    bucket         = "company-tfstate"
    key            = "envs/prod/data-platform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    kms_key_id     = "alias/tfstate"
    dynamodb_table = "tf-locks"
  }
}
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The bootstrap HCL creates the backend resources using local state (no backend block). This is the only acceptable use of local state in production — provisioning the backend itself. Once done, the bootstrap state can be terraform import-ed into the new remote backend for self-management.
  2. The KMS key + alias is created first. Rotation is enabled (enable_key_rotation = true) — non-negotiable for any encryption key.
  3. The S3 bucket has four companions: versioning (undo state history), encryption (KMS), public-access-block (the four "block public" toggles must all be true), and lifecycle (omitted, but production-grade keeps state versions ≥ 90 days).
  4. The DynamoDB table holds the lock rows. Each running apply writes a row with the state-file key; on completion, deletes the row. Concurrent applies block on the lock. PAY_PER_REQUEST keeps cost negligible for the typical low-volume access.
  5. The consumer-side backend.tf points at the bucket + key + lock table. kms_key_id = "alias/tfstate" ensures state writes use the KMS key for encryption.

Output.

Backend component Cost / month (typical) Purpose
S3 bucket (KMS) $1-5 (state versioning included) state storage
KMS key $1 encryption
DynamoDB lock table < $1 (low traffic) concurrent-apply locking
Total per env ~$5-10/month full remote backend

Rule of thumb. Never start a Terraform repo without remote state. The 30-minute bootstrap cost is repaid the first time someone runs terraform apply on the wrong workspace or pushes a state file to Git by accident.

Worked example — GitHub Actions plan/apply pipeline

Detailed explanation. The minimum production-grade CI pipeline for Terraform: PR opens → init + plan → plan posted as comment → reviewer approves → merge triggers apply with the saved plan. OIDC authentication to AWS (no long-lived secrets).

Question. Write the GitHub Actions workflow with two jobs (plan on PR, apply on merge) using OIDC for AWS auth and posting plan output to the PR.

Input.

Trigger Job Action
PR opened/synchronised plan init + plan + comment
Push to main apply init + apply with saved plan

Code.

# .github/workflows/terraform.yml
name: Terraform

on:
  pull_request:
    paths: ["envs/prod/**", "modules/**"]
  push:
    branches: [main]
    paths: ["envs/prod/**", "modules/**"]

permissions:
  id-token: write    # OIDC to AWS
  contents: read
  pull-requests: write

jobs:
  plan:
    if: github.event_name == 'pull_request'
    runs-on: ubuntu-latest
    defaults:
      run:
        working-directory: envs/prod
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-tf-plan
          aws-region: us-east-1
      - uses: opentofu/setup-opentofu@v1
        with:
          tofu_version: 1.7.0
      - run: tofu init -input=false
      - run: tofu plan -out=plan.tfplan -no-color
        id: plan
        continue-on-error: false
      - name: Post plan to PR
        uses: actions/github-script@v7
        with:
          script: |
            const output = `### Terraform plan (envs/prod)\n\n\`\`\`\n${process.env.PLAN}\n\`\`\``;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });
        env:
          PLAN: ${{ steps.plan.outputs.stdout }}

  apply:
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    environment: production         # requires manual approval
    defaults:
      run:
        working-directory: envs/prod
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-tf-apply
          aws-region: us-east-1
      - uses: opentofu/setup-opentofu@v1
        with:
          tofu_version: 1.7.0
      - run: tofu init -input=false
      - run: tofu plan -out=plan.tfplan -no-color
      - run: tofu apply -auto-approve plan.tfplan
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The workflow triggers on PRs (for plan) and pushes to main (for apply). paths: filter limits runs to changes that actually touch infra files.
  2. permissions.id-token: write enables OIDC. The GitHub-issued token is exchanged for an AWS role via aws-actions/configure-aws-credentials@v4. No long-lived AWS keys in secrets.
  3. The plan job runs on PR; output is posted back to the PR as a comment so reviewers see the diff inline.
  4. The apply job runs on push to main, after the PR's plan was reviewed and merged. environment: production triggers GitHub's manual-approval step — a designated reviewer must click approve before apply starts.
  5. The apply re-runs tofu plan -out=plan.tfplan and then applies the saved plan. The re-plan is a safety net: if anything drifted between merge and apply, the diff is visible (the apply will fail if the diff is unexpected; you can configure stricter checks).

Output (the gate sequence per PR).

Gate What runs Required approval
PR opened plan job + comment reviewer reads diff
PR approved + merged (CI does not block here) branch protection settings
Apply on push to main apply job environment manual approval
Apply complete state updated post-apply notification

Rule of thumb. Every prod apply needs two human approvals: PR review and environment-gate approval. Two humans, two pairs of eyes, two distinct moments. The cost is small; the safety is large.

Senior interview question on Terraform state surgery

A senior interviewer might ask: "A production Terraform apply renamed a critical Snowflake role from ANALYST to ANALYST_PROD by accident (someone refactored). The role's old privileges are gone; users are locked out. Walk me through how you fix this without re-creating the role (which would lose its grant history). What Terraform state commands do you use, in what order?"

Solution Using terraform state mv + terraform import + targeted apply

State surgery — recover from a refactor that renamed a critical role
====================================================================

Diagnosis
---------
- The PR renamed `snowflake_role.analyst` → `snowflake_role.analyst_prod`.
- Terraform saw the rename as DESTROY + CREATE.
- On apply, the old role was dropped (grants gone with it).
- Now: state has `snowflake_role.analyst_prod`; Snowflake has `ANALYST_PROD` (newly created).

Step 1 — DON'T just re-apply. Stop the bleeding.
- Restore yesterday's state file from S3 versioning to a sandbox path.
- Diff the old vs new state to confirm what was lost.

Step 2 — Re-create the role manually in Snowflake with its OLD grants.
- Run the grant statements via SnowSQL from a backup script.
- This restores user access immediately; downtime ends.

Step 3 — Import the (now-restored) role into the NEW state address.
   terraform import 'snowflake_role.analyst_prod' 'ANALYST_PROD'
- This binds the existing role to the new HCL address without recreating.

Step 4 — Run `terraform plan` and verify NO changes.
- If plan shows changes, your HCL drifted from the manual restore;
  fix HCL until plan is no-op.

Step 5 — Postmortem.
- Add `lifecycle { prevent_destroy = true }` to critical roles.
- Add a CI gate that blocks any DESTROY of a `snowflake_role` resource.

State commands used
-------------------
- terraform state list                  → see current addresses
- terraform state show <addr>           → inspect attributes
- terraform state mv <old> <new>        → rename without destroy
- terraform state rm <addr>             → remove from state (no destroy)
- terraform import <addr> <cloud-id>    → add to state (no create)
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Command Effect
1 restore old tfstate from S3 versioning sandbox copy for diff
2 run grant statements manually (SnowSQL) users unblocked
3 terraform import 'snowflake_role.analyst_prod' 'ANALYST_PROD' state now points at the restored role
4 terraform plan should be no-op
5 add prevent_destroy = true to HCL + CI gate future-proof

The key insight: the refactor renamed an HCL address, but Terraform interprets address-rename as destroy-then-create. The fix is the state mv command (had it been used preemptively), or import after the fact, never blind re-apply.

Output:

Recovery state Action
Users locked out restore grants manually
HCL has new name, state has new name, cloud has both (old destroyed) sandbox diff to confirm
State and HCL agree on new name import the manually re-created role
Plan = no-op recovery complete
Postmortem add prevent_destroy + CI gate

Why this works — concept by concept:

  • terraform state mv is the right tool for renames — had the PR author used terraform state mv 'snowflake_role.analyst' 'snowflake_role.analyst_prod' before the apply, no destroy would have happened. The state surgery commands exist for exactly this reason.
  • terraform import recovers the missing binding — after manually restoring the role, import re-creates the (address → cloud id) binding in state. Plan should then be no-op.
  • S3 versioning is the rollback anchor — the old state file is one S3 version-restore away. Without versioning enabled, recovery is much harder; with it, you have at least a baseline to compare against.
  • prevent_destroy = true for critical resources — Terraform refuses to destroy any resource with this lifecycle directive. Roles, prod databases, prod S3 buckets should all have it; pure data resources (table contents) cannot benefit.
  • Cost — state surgery is O(1) per resource (constant time per command). The cost lives in the time to diagnose. Practising state surgery on a sandbox env in advance pays for itself the first real-world incident.

ETL
Topic — etl
Module composition + state problems

Practice →

Optimization
Topic — optimization
CI/CD ritual optimization problems

Practice →


Cheat sheet — Terraform for data infra recipes

  • The 5-line Snowflake DB pattern. snowflake_databasesnowflake_schema (x N via for_each) → snowflake_warehousesnowflake_rolesnowflake_grant_privileges_to_role. The minimum production-grade Snowflake provisioning unit; wrap as a module after the third copy.
  • Databricks Unity Catalog three-level provisioning. databricks_catalogdatabricks_schema (x N) → databricks_table (external Delta or Iceberg) → databricks_grants with USE_CATALOG + USE_SCHEMA + SELECT. Always grant the parent USE first; table-level SELECT alone never works.
  • Glue + S3 + Iceberg minimum lakehouse. aws_s3_bucket + 4 companion resources (versioning, encryption, public-access-block, lifecycle) → aws_glue_catalog_databaseaws_glue_catalog_table with table_type = "ICEBERG". Production starts here; never less.
  • IAM role with Glue + S3 + Athena + Lake Formation. aws_iam_role (with assume-role policy) → aws_iam_role_policy (scoped to specific bucket + glue catalog) → aws_lakeformation_permissions (SELECT/DESCRIBE on table). Three resources for one principal's access.
  • Module input/output template. variables.tf with type and validation per input → main.tf with for_each for collections → outputs.tf exposing names + ARNs only. README with usage example; Git tag for versioning (?ref=v1.2.0).
  • Remote state backend with locking (S3 + DynamoDB). aws_s3_bucket + versioning + KMS encryption + public-access-block → aws_dynamodb_table with LockID hash_key + PITR enabled → terraform { backend "s3" { ... kms_key_id ... dynamodb_table } } in every consumer config.
  • OIDC GitHub Actions for AWS. permissions.id-token: write + aws-actions/configure-aws-credentials@v4 with role-to-assume → zero long-lived AWS keys. Two separate roles: one for plan (read-only), one for apply (read-write); the apply role used only on push: main.
  • Plan/apply ritual. terraform plan -out=plan.tfplan always saves the binary plan; terraform apply plan.tfplan consumes it. Reviewers read the plan; apply is atomic. Never apply without a saved plan in CI.
  • Secrets discipline. data "aws_secretsmanager_secret_version" "x" (or Vault data "vault_generic_secret") — never password = "..." inline. variable { sensitive = true } hides plan output but does not encrypt state; always pair with remote state + KMS encryption.
  • Drift detection. Nightly scheduled CI job: terraform init && terraform plan -detailed-exitcode. Exit code 2 means drift; pipe to Slack / GitHub issue. Triage each drift: revert, accept, or refactor.
  • State surgery commands. terraform state list (inspect) → terraform state mv <old> <new> (rename without destroy) → terraform state rm <addr> (forget without destroy) → terraform import <addr> <cloud-id> (adopt without create). Practice on a sandbox before you need them in prod.
  • OpenTofu drop-in. Swap terraform binary for tofu, swap hashicorp/setup-terraform for opentofu/setup-opentofu. State, HCL, providers identical. MPL licence; community-governed under Linux Foundation.
  • Module versioning via Git tag. source = "git::https://github.com/x/y.git//modules/z?ref=v1.2.0". Updates are opt-in by bumping the ref; never auto-pull main.
  • Lake Formation tag-based access. aws_lakeformation_lf_tag (define) → aws_lakeformation_resource_lf_tags (apply) → aws_lakeformation_permissions with lf_tag_policy block. One grant covers 100s of tables matching the tag expression.
  • prevent_destroy on critical resources. Add lifecycle { prevent_destroy = true } to any resource whose destroy would cause data loss or audit failure. Terraform refuses; the override is removing the directive (which itself is a PR change).

Frequently asked questions

Why use Terraform for data infrastructure instead of click-ops?

Because every change has to be auditable, reproducible, and reviewable — and click-ops gives you none of those. Terraform turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, and the IAM grants between them into one reviewable HCL pull request; the plan shows exactly what changes, the apply is atomic, and the state file is the source of truth. With terraform for data engineering, drift detection runs nightly and flags any change that bypassed Terraform; with click-ops, you find out three months later in an audit. Click-ops also fails the "stand up a new env identically on a Tuesday" test — a senior interviewer's favourite probe. The 30-minute up-front cost of Terraforming a slice pays back the first time a regulator asks "show me when this grant was created."

Terraform vs OpenTofu in 2026 — which should I pick?

For new open-source-first projects, default to OpenTofu — it's the Linux Foundation fork of the last MPL-licensed Terraform, ships at near-feature-parity with HashiCorp Terraform, and dodges the BSL licence ambiguity. The state format, HCL syntax, and provider protocol are identical; tofu is a drop-in replacement for terraform. Stay on HashiCorp Terraform if you are an enterprise paying for HCP Terraform / Terraform Cloud features (Sentinel policy-as-code, run tasks, agent pools, drift detection-as-a-service) — that paid surface has no OpenTofu equivalent yet. For interview purposes, the senior signal is naming both, articulating the August 2023 re-licensing and the 2024 LF fork, and stating your default. PipeCode practice problems on terragrunt and opentofu adoption decisions ship in the platform-engineering library and are worth drilling before your next infra-on-call interview.

How do you manage Terraform state for data infra?

Always use a remote state backend with locking — for AWS that's S3 + KMS encryption + DynamoDB lock table; for GCP it's GCS with object-level locking; for Azure it's Blob with lease-based locking; for HashiCorp paid users it's HCP Terraform. The state file contains secrets in plain text (passwords, tokens written during creation), so encryption at rest is non-negotiable. Use one state file per environment (envs/prod/data-platform.tfstate) rather than per-stack within an env, because cross-stack references (Databricks reads a Glue table, Snowflake assumes an AWS role) become trivial when they're in the same state. Enable S3 versioning on the state bucket — the first time you need to recover from a botched apply, you'll be grateful. The terraform state mv, terraform state rm, and terraform import commands are your state-surgery toolkit; practice them on a sandbox before you need them in prod.

Can Terraform create Snowflake databases and Unity Catalog?

Yes — both are fully Terraform-native via community-maintained providers. terraform snowflake ships via the Snowflake-Labs/snowflake provider with resources for snowflake_database, snowflake_schema, snowflake_warehouse, snowflake_role, snowflake_grant_privileges_to_role, snowflake_external_table, snowflake_storage_integration, and the full Iceberg-on-Snowflake surface. terraform databricks ships via the databricks/databricks provider with databricks_workspace, databricks_cluster, databricks_job, databricks_catalog, databricks_schema, databricks_table, databricks_grant, and databricks_external_location for Unity Catalog. The senior pattern is one HCL module per logical unit (a "Snowflake database" module, a "Unity Catalog" module), versioned via Git tag, invoked from per-env root modules. PipeCode drills on the Snowflake RBAC tree and Unity Catalog three-level grant pattern are the fastest way to internalise the provider surface.

What is the blast radius of terraform destroy?

Catastrophic unless scoped. terraform destroy walks the entire state in reverse dependency order and deletes every managed resource. On a multi-stack prod state, that is your warehouse, your lakehouse, your IAM roles, your S3 buckets — gone. The senior-DE controls are: (1) lifecycle { prevent_destroy = true } on any resource whose loss would cause data loss or audit failure — Terraform refuses to destroy it; (2) state separation per envenvs/dev/, envs/staging/, envs/prod/ each have their own backend key, so terraform destroy in dev cannot touch prod; (3) CI gates that block any plan with DESTROY actions on production resource types (snowflake_database, aws_s3_bucket); (4) target-destroyterraform destroy -target=... limits the blast radius to a single resource or module. The interview answer to "what is the blast radius of terraform destroy?" is "the entire state — which is why we never run it unguarded; we use -target or prevent_destroy or per-env state isolation."

How do you handle secrets in Terraform for data infra?

Never inline. Every secret — a Snowflake admin password, a Databricks PAT, an AWS access key for a federated identity, an OAuth client secret — must live in a dedicated secret store (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault) and be referenced from Terraform via a data source: data "aws_secretsmanager_secret_version" "x" then data.x.secret_string inside the provider or resource. Variables that carry secrets are marked sensitive = true to hide them from plan output. The state-file gotcha: Terraform copies the refreshed value into state, so even with secrets data-sourced from Vault, the state file ends up with the value in plain text. The fix is remote state with at-rest encryption — S3 + KMS, GCS with CMEK, Azure Blob with customer-managed keys. Audit the read access on the state bucket; anyone who can read it can read every secret your Terraform has ever touched. Senior reviewers will reject any PR with password = "..." in HCL on sight.

Practice on PipeCode

Lock in Terraform for data infra muscle memory

Provider docs explain the resource. PipeCode drills explain the decision — when a module pays back, when state surgery is safer than `terraform import`, when a Snowflake grant tree explodes. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.

Practice ETL & infra problems →
Practice optimization problems →

Top comments (0)