terraform for data engineering is the lingua franca of senior data platform work in 2026 — and the one tool every interview loop, on-call rotation, and audit conversation eventually circles back to, because warehouses, lakehouses, catalogs, and IAM are too important to provision by clicking around a console. Terraform turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, a Lake Formation grant, and the IAM role that ties them all together into one reviewable, diffable, peer-reviewed pull request. The cost of click-ops is measured in the audit findings nobody can explain three months later; the value of iac data infrastructure is measured in the speed with which a new env can be stood up identically on a Tuesday afternoon.
This guide is the senior-DE playbook for using terraform to ship production data infrastructure: the HCL mental model and the resource graph, the provider matrix for terraform snowflake, terraform databricks, terraform aws glue, and BigQuery, the catalog and grant tree shape of Unity Catalog and Lake Formation, terraform iam policy discipline (and why you should never write password = "…" in HCL), terraform modules composition vs Terragrunt DRY composition, the terraform state backend story with S3 + DynamoDB locking, drift detection in CI, and the terragrunt vs raw Terraform vs opentofu decision senior platform engineers are making in 2026. Every section ships a teaching block plus a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.
When you want hands-on reps immediately after reading, drill the ETL practice library → for the infra-provisioning surface, rehearse on medium-difficulty ETL problems → to lock in pipeline-grade Terraform muscle, and stack the optimization library → for the cost/perf trade-offs that show up in every IaC review.
On this page
- Why Terraform is the lingua franca of data infra in 2026
- The Terraform mental model for data infra
- Provisioning a warehouse + lakehouse
- Catalogs, IAM & policy-as-code
- Modules, state & CI/CD ritual
- Cheat sheet — Terraform for data infra recipes
- Frequently asked questions
- Practice on PipeCode
1. Why Terraform is the lingua franca of data infra in 2026
Click-ops dies in audit — terraform is the only honest record of how your warehouse, lakehouse, and IAM actually got built
The one-sentence invariant: Terraform is the only tool that turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, an IAM role, and the grants between them into a single peer-reviewable HCL graph with a deterministic plan, an atomic apply, and a state file that is the source of truth. Once you internalise "the state file is the contract; the cloud is the projection," every other terraform for data engineering interview probe — drift, modules, secrets, blast radius — collapses to a deduction from that one structural property.
Click-ops vs IaC — the audit and drift story interviewers always ask.
-
Click-ops is the world where someone opens the Snowflake UI, creates a database, grants a role, and Slacks the team. Three months later nobody can explain why the
analyticsdatabase has read-write onraw_piiand the audit gets escalated. There is no diff, no review, no rollback. -
IaC with Terraform is the world where the same change is a 6-line PR. The plan output shows exactly which resources change, the reviewer comments on the grant, and the apply log is the audit trail. The state file is the single source of truth; if the cloud diverges,
terraform planflags it as drift. -
The interview punchline. A senior engineer is expected to answer "how do you audit cloud data infra changes?" with "every change is a Terraform PR; we run
terraform planin CI on every merge; drift is alerted on a nightly plan." Anything less is a junior signal.
Why Terraform won over CloudFormation, Pulumi, and CDK for data infra.
- Provider breadth. Terraform ships native providers for Snowflake, Databricks, BigQuery, AWS, Azure, GCP, Confluent, MongoDB Atlas, Datadog, PagerDuty — the full data-platform stack. CloudFormation is AWS-only; Pulumi covers most of it but with a smaller community; CDK is AWS-centric and emits CloudFormation under the hood.
-
HCL-as-contract. HCL is intentionally restrictive — no loops in the imperative sense, no side effects, declarative blocks only. That restriction is the feature: a reviewer can read an HCL file and know exactly what infrastructure it produces. Pulumi's Python / TypeScript / Go lets you write
if random.random() < 0.5: ...in your infra code; HCL forbids it. -
The plan/apply ritual. Every change runs
terraform planfirst, producing a deterministic diff. The diff is reviewable; the apply is atomic. CloudFormation has change sets but the UX is heavier; CDK'scdk diffis plan-equivalent but tied to AWS. - Community and module ecosystem. The public Terraform Registry has 10,000+ modules; the Snowflake-Labs / databricks / hashicorp organisations alone ship hundreds of battle-tested patterns. Pulumi has fewer modules and a heavier dependency-graph problem; CDK Constructs are AWS-only.
The four "must-answer" axes for data infra Terraform.
-
Providers. Which cloud APIs Terraform talks to. For data infra:
snowflake,databricks,aws(Glue, S3, IAM, Lake Formation),google(BigQuery, GCS),azurerm(Synapse, ADLS), and an auth + secrets provider (Vault, AWS Secrets Manager). -
State. The
terraform.tfstatefile is the source of truth. Where it lives, how it is locked, how it is versioned, who can read it (secrets in plain text!) — these are the questions every senior interview probes. -
Modules. Reusable units of HCL with inputs and outputs. Module design is where junior vs senior Terraform separates: senior engineers write modules with explicit
variablevalidation,outputcontracts, and Git-tag versioning. - IAM. The grant tree across Snowflake roles, Databricks Unity Catalog, AWS IAM, and Lake Formation. The blast radius of an IAM change is the question that defines whether your Terraform deployment is safe or a Friday-afternoon outage waiting to happen.
Terraform vs OpenTofu in 2026 — the Linux Foundation fork landed in 2024.
- HashiCorp re-licensed Terraform under the Business Source License (BSL) in August 2023, restricting commercial competitors. The community responded by forking the last MPL-licensed Terraform into OpenTofu, donated to the Linux Foundation.
- OpenTofu has been at near-feature-parity with Terraform since 1.6 and ships its own roadmap (state encryption at-rest, early variable evaluation in import blocks, dynamic provider iteration).
- The 2026 pragmatic answer: if you are an enterprise with HCP Terraform / Terraform Cloud paid features (Sentinel, run tasks, agent pools, drift detection-as-a-service), stay on HashiCorp. If you are open-source-first or want to dodge the BSL clause, OpenTofu is the drop-in replacement. The HCL syntax, the state format, and the provider protocol are identical.
- For interview purposes, name both and explain the licence trade-off — that is the senior signal.
What interviewers actually probe.
- Do you describe the state file as "the contract; the cloud is the projection"? — senior signal.
- Do you instinctively reach for remote state with locking instead of a local file? — required answer.
- Do you mention module versioning via Git tags unprompted? — senior signal.
- Do you push back on "just use the AWS console for a one-off" with "then it's not in state and the next plan will try to delete it"? — senior signal.
- Do you know the difference between
terraform import,terraform state mv, andterraform state rm? — required for any infra-on-call interview.
Worked example — the same Snowflake database, click-ops vs Terraform
Detailed explanation. A junior engineer is asked to "create a Snowflake database called analytics with three schemas (raw, staging, marts) and grant analyst_role read access to marts." In click-ops, this is six clicks in the Snowflake UI. In Terraform, it is one HCL file. The diff between the two is where senior data engineers earn their salary.
Question. Express the click-ops sequence as Terraform HCL and explain why the HCL version is the only acceptable answer in a production-grade environment.
Input.
| Step | Click-ops action | Terraform resource |
|---|---|---|
| 1 | Create database analytics
|
snowflake_database.analytics |
| 2 | Create schema raw
|
snowflake_schema.raw |
| 3 | Create schema staging
|
snowflake_schema.staging |
| 4 | Create schema marts
|
snowflake_schema.marts |
| 5 | Create role analyst_role
|
snowflake_role.analyst |
| 6 | Grant USAGE + SELECT on marts to analyst_role | snowflake_grant_privileges_to_role.analyst_marts_read |
Code.
# providers.tf
terraform {
required_version = ">= 1.6"
required_providers {
snowflake = {
source = "Snowflake-Labs/snowflake"
version = "~> 0.95"
}
}
}
provider "snowflake" {
account = var.snowflake_account
username = var.snowflake_admin_username
# password sourced from a secret store, never inline
password = data.aws_secretsmanager_secret_version.snowflake_admin.secret_string
role = "SYSADMIN"
}
# main.tf
resource "snowflake_database" "analytics" {
name = "ANALYTICS"
comment = "Curated analytics database — managed by Terraform"
}
resource "snowflake_schema" "raw" {
database = snowflake_database.analytics.name
name = "RAW"
}
resource "snowflake_schema" "staging" {
database = snowflake_database.analytics.name
name = "STAGING"
}
resource "snowflake_schema" "marts" {
database = snowflake_database.analytics.name
name = "MARTS"
}
resource "snowflake_role" "analyst" {
name = "ANALYST_ROLE"
comment = "Read access to curated marts"
}
resource "snowflake_grant_privileges_to_role" "analyst_marts_read" {
role_name = snowflake_role.analyst.name
privileges = ["USAGE"]
on_schema {
schema_name = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
}
}
resource "snowflake_grant_privileges_to_role" "analyst_marts_select" {
role_name = snowflake_role.analyst.name
privileges = ["SELECT"]
on_schema_object {
all {
object_type_plural = "TABLES"
in_schema = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
}
}
}
Step-by-step explanation.
- The
terraformblock pins the required Terraform version and the Snowflake provider to~> 0.95(a known-good range). This is the lock against silent provider drift. - The
provider "snowflake"block authenticates against the account; the password is sourced from AWS Secrets Manager (never inline) —data.aws_secretsmanager_secret_version.snowflake_admin.secret_string. - Each resource block describes the desired state. Terraform builds a DAG from the resource references (
snowflake_schema.raw.database = snowflake_database.analytics.nameis a dependency edge). - On
terraform apply, the engine traverses the DAG, creating the database first, then the three schemas in parallel, then the role, then the grants. The state file records every resource id. - If anyone later edits a schema name in the Snowflake UI, the next
terraform planflags it as drift and proposes to revert. That is the audit trail click-ops cannot give you.
Output (the audit value).
| Concern | Click-ops | Terraform |
|---|---|---|
| Diff in PR review | none | full HCL diff |
| Reproducibility on a new env | manual sequence | terraform apply |
| Drift detection | none | nightly terraform plan
|
| Atomic rollback | manual undo |
git revert + apply |
| Audit trail | UI logs only | Git history + state versions |
Rule of thumb. Every change to cloud data infra that touches a name, a grant, a warehouse size, or a network policy belongs in HCL. The 5-minute "I'll just click it in the UI" change is the one that explodes in audit three months later.
Worked example — the plan/apply ritual on a Glue catalog change
Detailed explanation. Terraform's core loop is init → plan → apply. The plan is the deterministic diff: it reads state, refreshes the cloud, and computes what would change. Senior engineers read every plan output before approving it. Junior engineers click apply and pray. The ritual is the muscle memory that makes Terraform safe.
Question. Walk through the plan/apply ritual for adding a new aws_glue_catalog_table to a lakehouse, including what the plan output looks like and what a reviewer should look for.
Input.
| Stage | What runs | What is produced |
|---|---|---|
terraform init |
provider download + backend init |
.terraform/ cache + lock file |
terraform plan -out=plan.tfplan |
refresh + diff |
plan.tfplan (binary, signed) |
terraform show plan.tfplan |
human-readable diff | text output to review |
terraform apply plan.tfplan |
atomic apply | state file updated |
Code.
# lakehouse.tf — adding a new Iceberg-format Glue table
resource "aws_glue_catalog_table" "orders_iceberg" {
name = "orders"
database_name = aws_glue_catalog_database.lake.name
table_type = "EXTERNAL_TABLE"
parameters = {
"format-version" = "2"
"table_type" = "ICEBERG"
"metadata_location" = "s3://${aws_s3_bucket.lake.bucket}/orders/metadata/00000-init.metadata.json"
"write.format.default" = "parquet"
}
storage_descriptor {
location = "s3://${aws_s3_bucket.lake.bucket}/orders/"
input_format = "org.apache.iceberg.mr.hive.HiveIcebergInputFormat"
output_format = "org.apache.iceberg.mr.hive.HiveIcebergOutputFormat"
ser_de_info {
serialization_library = "org.apache.iceberg.mr.hive.HiveIcebergSerDe"
}
columns {
name = "order_id"
type = "bigint"
}
columns {
name = "customer_id"
type = "bigint"
}
columns {
name = "amount"
type = "decimal(18,2)"
}
columns {
name = "order_ts"
type = "timestamp"
}
}
}
$ terraform init
Initializing the backend...
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.40"...
- Installing hashicorp/aws v5.42.0...
$ terraform plan -out=plan.tfplan
Terraform will perform the following actions:
# aws_glue_catalog_table.orders_iceberg will be created
+ resource "aws_glue_catalog_table" "orders_iceberg" {
+ name = "orders"
+ database_name = "lake"
+ table_type = "EXTERNAL_TABLE"
+ parameters = {
+ "format-version" = "2"
+ "table_type" = "ICEBERG"
...
}
}
Plan: 1 to add, 0 to change, 0 to destroy.
$ terraform apply plan.tfplan
aws_glue_catalog_table.orders_iceberg: Creating...
aws_glue_catalog_table.orders_iceberg: Creation complete after 2s
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
Step-by-step explanation.
-
terraform initdownloads the AWS provider and initialises the backend (where the state file lives). The.terraform.lock.hclfile pins provider versions for everyone on the team. -
terraform plan -out=plan.tfplanproduces a signed binary plan. It contains the exact set of API calls Terraform will make if you apply. Saving it to a file lets CI archive it for audit. -
terraform show plan.tfplan(or letting CI render the diff) displays the human-readable diff. A reviewer looks for: unexpected destroys, unexpected updates-in-place that secretly recreate (the~/-lines), and any resource not mentioned in the PR description. -
terraform apply plan.tfplanconsumes the saved plan and executes it atomically. Because the plan was already computed, no surprise drift is introduced between plan and apply. - The state file (
terraform.tfstate) is updated; the new resource's id is recorded. The nextterraform plansees zero changes.
Output (plan diff legend reviewers must know).
| Symbol | Meaning | Reviewer action |
|---|---|---|
+ resource |
will be created | confirm name and config |
- resource |
will be destroyed | confirm explicitly in PR description |
~ resource |
update in place | check which attributes |
-/+ resource |
destroy + recreate | DANGER — data loss likely |
<= read data.X |
data source refresh | usually safe |
Rule of thumb. Never approve a plan you have not read line by line. The -/+ symbol on a stateful resource (database, table, bucket) is a Friday-afternoon outage waiting to happen — flag it in PR review and force an explicit migration plan.
Worked example — the OpenTofu fork pragmatic switch
Detailed explanation. A team running open-source Terraform 1.5 wants to upgrade. The 1.6 release is the first BSL-licensed version. The team is open-source-first and wants to avoid licence ambiguity. The pragmatic answer is OpenTofu — drop-in replacement, identical state format, identical provider protocol. The switch is a 10-minute exercise per repo.
Question. Walk through the migration from terraform 1.5 to opentofu 1.7 and show what changes (binary, CLI, CI configuration) and what does not (HCL, state, providers).
Input.
| Component | Terraform 1.5 | OpenTofu 1.7 |
|---|---|---|
| CLI binary | terraform |
tofu |
| State format | tfstate v4 | tfstate v4 (identical) |
| Provider protocol | v5 / v6 | v5 / v6 (identical) |
| HCL syntax | HCL2 | HCL2 (identical) |
| Module sources | Terraform Registry | OpenTofu Registry + Terraform Registry |
| Licence | BSL (since 1.6) | MPL 2.0 |
Code.
# 1. Install opentofu (replaces terraform binary in CI)
brew install opentofu
# Or, in a CI image:
# RUN curl -L https://github.com/opentofu/opentofu/releases/download/v1.7.0/tofu_1.7.0_linux_amd64.zip -o tofu.zip
# RUN unzip tofu.zip && mv tofu /usr/local/bin/
# 2. Verify state compatibility (read-only)
tofu version
# OpenTofu v1.7.0
tofu init # same backend init
tofu plan # same HCL, same plan output
tofu apply # same atomic apply
# .github/workflows/infra.yml — CI config swap
jobs:
plan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: opentofu/setup-opentofu@v1 # was hashicorp/setup-terraform
with:
tofu_version: 1.7.0
- run: tofu init
- run: tofu plan -out=plan.tfplan
Step-by-step explanation.
- Install the
tofubinary alongside (or replacing)terraform. On macOS:brew install opentofu. The two binaries can coexist on the same machine. - Run
tofu initagainst an existing state file written by Terraform. The state format is identical — OpenTofu reads and writes the sameterraform.tfstatev4 schema. - Run
tofu planagainst the same HCL. Output is byte-identical for the same providers (the only difference is the header line:OpenTofu used the selected providers...). - Swap the CI setup action from
hashicorp/setup-terraformtoopentofu/setup-opentofu. The rest of the pipeline (init / plan / apply) is unchanged. - Module sources from the Terraform Registry continue to work; OpenTofu also ships its own Registry mirror at
registry.opentofu.orgfor licence-pure consumption.
Output (the migration cost ledger).
| Step | Time | Risk |
|---|---|---|
Install tofu in CI image |
5 min | low |
| Swap setup action | 2 min | low |
| Test plan on dev workspace | 10 min | low |
| Apply on dev | 5 min | low |
| Roll out to prod after a week of dev runs | gradual | low |
| Total per repo | ~25 min | low |
Rule of thumb. For new projects in 2026, default to OpenTofu unless you have a paid HashiCorp contract (HCP Terraform, Sentinel, run tasks). For existing repos, the migration is cheap; do it once your prod env is stable. The interview signal is naming both and articulating the licence trade-off.
Senior interview question on the case for Terraform in data infra
A senior interviewer often opens with: "You join a 30-person data team that provisions all their cloud infra via the AWS console, Snowflake UI, and Databricks workspace UI. The platform lead wants you to introduce Terraform. Walk me through the 90-day plan — what do you Terraform first, how do you handle existing resources, and what is the success criterion?"
Solution Using import-first, environments-second, modules-third progression
90-day Terraform rollout — 3 phases
Phase 1 (days 1–30) — Import existing prod, no new infra in HCL yet
-------------------------------------------------------------------
- Stand up remote state backend (S3 + DynamoDB lock).
- Pick the smallest stateful slice: the Snowflake roles + grant tree.
- Run `terraform import` for every existing resource.
- Validate `terraform plan` shows zero changes (state matches reality).
- Wire a read-only nightly CI plan that alerts on drift.
Phase 2 (days 31–60) — Add new envs / new resources via HCL
-----------------------------------------------------------
- Build a `staging` env from scratch using the imported prod HCL.
- Validate that staging is a faithful reproduction of prod.
- Every NEW prod resource (new schema, new role, new bucket)
must be added via Terraform PR. UI changes blocked.
Phase 3 (days 61–90) — Modules + governance
-------------------------------------------
- Extract repeated patterns (Snowflake database, Glue table,
IAM role-for-Glue) into modules with versioned Git tags.
- Add OPA / Conftest checks on the plan output.
- Roll plan/apply into the team's PR workflow with required approvals.
- Define the success metric: drift count over 7 days = 0.
Step-by-step trace.
| Day | Action | Coverage |
|---|---|---|
| 1 | Provision S3 + DynamoDB backend | state infra live |
| 5 | Import 1 Snowflake DB + 10 schemas | 100% of one slice |
| 10 | Import all roles + grants | core IAM in state |
| 20 | Nightly CI plan green for 7 days | drift baseline = 0 |
| 35 | Provision staging from HCL |
env parity proven |
| 50 | Block UI changes on prod (Snowflake account policy) | enforcement live |
| 70 | Publish first internal module (snowflake-database) |
DRY enabled |
| 85 | OPA policy: no 0.0.0.0/0 ingress on any new SG |
policy-as-code live |
| 90 | Drift rate over 7 days = 0; new infra via PR-only | success criterion met |
The 90-day plan front-loads risk: imports first (no new infra), then env parity (no prod-only HCL), then modules + policy. Each phase is independently shippable.
Output:
| Phase | Deliverable | Risk profile |
|---|---|---|
| 1 — Import |
terraform plan is no-op on prod |
low (read-only) |
| 2 — New envs |
staging mirrors prod |
medium (writes to a fresh env) |
| 3 — Modules + policy | DRY HCL + OPA gates | low (refactor + governance) |
Why this works — concept by concept:
-
Import first, never greenfield-rewrite — rewriting prod infra from scratch in HCL is a disaster waiting to happen.
terraform importbrings existing resources into state without recreating them. The plan should be a no-op after import; that proves the HCL matches reality. - Backend before code — never start with a local state file. Stand up S3 + DynamoDB locking first; commit zero HCL until the backend is alive. This avoids the "I lost my state file" disaster.
- Env parity proves the HCL — if staging stood up from the same HCL is byte-identical to prod, the HCL is provably faithful. Otherwise, prod has hidden state nobody captured.
-
Modules come last — premature abstraction kills Terraform projects. Wait until you have 3+ usages of the same pattern, then extract a module. Versioned via Git tag (
?ref=v1.2.0) so updates are opt-in. - Cost — the rollout is 30-50% of one senior engineer's time for 90 days. Pays back the first time an audit asks "show me when this grant was added"; you point at a Git commit and a PR review.
ETL
Topic — etl
Infrastructure-as-code rollout problems
2. The Terraform mental model for data infra
terraform builds a DAG from your HCL, then executes it through plan and apply — the state file is the contract
The mental model in one line: HCL describes resources, references between resources build a DAG, terraform plan diffs the DAG against the state, terraform apply walks the DAG in dependency order, and the state file is the persistent source of truth that survives between runs. Once you say "the DAG is the contract; plan diffs it; apply executes it; state remembers it," the entire iac data infrastructure interview surface becomes a deduction from those four stages.
The five core HCL primitives.
-
provider— the plugin that talks to a cloud API.provider "aws" {},provider "snowflake" {},provider "databricks" {}. Authenticated per workspace; pinned viarequired_providers. -
resource— a single cloud object.resource "aws_s3_bucket" "lake" {}is one S3 bucket. Each resource has a type (aws_s3_bucket) and a name (lake); the pairaws_s3_bucket.lakeis the resource address. -
data— a read-only lookup against a cloud API.data "aws_caller_identity" "current" {}reads the current AWS account id. Data sources are refreshed on every plan. -
variable— typed input.variable "env" { type = string }. Set viaterraform.tfvars,-varCLI flag, orTF_VAR_envenv var. -
output— exported value consumed by other modules or shown after apply.output "database_id" { value = snowflake_database.analytics.id }.
The provider ecosystem for data infra (2026).
-
Snowflake (
Snowflake-Labs/snowflake) —snowflake_database,snowflake_schema,snowflake_warehouse,snowflake_role,snowflake_grant_privileges_to_role,snowflake_external_table,snowflake_stage. The community provider is now the canonical one. -
Databricks (
databricks/databricks) —databricks_workspace,databricks_cluster,databricks_job,databricks_catalog,databricks_schema,databricks_table,databricks_grant,databricks_external_location. Unity Catalog is fully Terraform-native. -
AWS (
hashicorp/aws) —aws_glue_catalog_database,aws_glue_catalog_table,aws_glue_job,aws_s3_bucket,aws_iam_role,aws_iam_policy,aws_lakeformation_permissions,aws_athena_workgroup. -
Google (
hashicorp/google) —google_bigquery_dataset,google_bigquery_table,google_bigquery_routine,google_dataproc_cluster,google_storage_bucket. -
Azure (
hashicorp/azurerm) —azurerm_synapse_workspace,azurerm_storage_account,azurerm_storage_data_lake_gen2_filesystem,azurerm_databricks_workspace. -
Cross-cutting —
hashicorp/vaultfor secrets,confluentinc/confluentfor Kafka,mongodb/mongodbatlasfor Mongo,datadog/datadogfor observability resources.
The declarative graph — Terraform builds a DAG.
- Every resource reference (
x = resource.foo.bar.attr) is a dependency edge. Terraform parses every HCL file, builds a graph, then walks it in topological order. - Independent resources are created in parallel (default concurrency 10; tune via
-parallelism=N). - The DAG also handles destroys in reverse dependency order — a child resource is destroyed before its parent, so the parent destroy sees no live children.
- Cycles in the DAG (resource A depends on B which depends on A) are a build-time error. The most common cause is a circular module reference; the fix is usually an extra layer or a
depends_onindirection.
The state file as source of truth — not the cloud.
-
terraform.tfstateis a JSON document holding the mapping(resource address) → (cloud id) + (attributes). - On
plan, Terraform reads state, optionally refreshes (terraform plan -refresh-onlyor auto-refresh on plan), then diffs the desired HCL against the recorded attributes. - If the cloud changes outside Terraform (someone clicked in the UI), the next plan shows drift — proposed reverts to bring the cloud back to what state says.
- The state file contains secrets in plain text (passwords, tokens written during creation). Never commit it; always remote-state it with at-rest encryption.
Lock files (.terraform.lock.hcl) — provider version pinning.
- Generated by
terraform initand committed to Git. Records the exact provider versions and a checksum for each platform (linux_amd64, darwin_arm64, etc). - Forces every team member and every CI run to use the same provider version. Without it,
initcould pull a new minor and break determinism. - Update with
terraform init -upgradewhen you actually want to bump versions. Review the diff in PR — provider upgrades can change resource behaviour.
Workspaces vs directories — the layout debate.
-
Workspaces (
terraform workspace new staging) — multiple state files in a single backend, switched by env var or CLI. Light-weight but easy to footgun (terraform planon the wrong workspace destroys prod). -
Directories — one folder per env (
envs/prod/,envs/staging/,envs/dev/) each with its ownmain.tfand its own backend config. Heavier but explicit; no chance of wrong-workspace apply. - The senior consensus in 2026 — directories per env, modules shared across envs, never rely on workspaces for prod separation. Workspaces are fine for ephemeral feature branches.
Common interview probes on the mental model.
- "What is a Terraform DAG?" — the dependency graph built from resource references; walked in topological order on apply.
- "What is in
terraform.tfstate?" — JSON mapping of resource addresses to cloud ids and attributes; secrets in plain text; never commit. - "What does
terraform refreshdo, and why was it deprecated as a standalone command in 1.6?" — it queries the cloud for current state of every resource and updatestfstate. Deprecated as standalone because every modern workflow does it implicitly duringplan. - "Workspaces vs directories?" — directories for env separation, workspaces for ephemeral isolation. Never workspace prod.
Worked example — building the DAG from HCL
Detailed explanation. A simple S3 + Glue + IAM stack illustrates how Terraform infers the DAG. You write resources in any order; Terraform reads references and builds the graph. The order in main.tf is irrelevant; the graph determines apply order.
Question. Given an HCL file with four resources written in arbitrary order, show the DAG Terraform builds and the apply order it produces.
Input (the HCL, written in non-dependency order).
| Line | Resource | References |
|---|---|---|
| 1 | aws_glue_catalog_table.orders |
aws_glue_catalog_database.lake.name, aws_s3_bucket.raw.bucket
|
| 2 | aws_s3_bucket.raw |
— |
| 3 | aws_iam_role.glue |
— |
| 4 | aws_glue_catalog_database.lake |
— |
Code.
# Order in the file is irrelevant; the DAG is what matters
resource "aws_glue_catalog_table" "orders" {
database_name = aws_glue_catalog_database.lake.name
name = "orders"
storage_descriptor {
location = "s3://${aws_s3_bucket.raw.bucket}/orders/"
columns { name = "id" type = "bigint" }
columns { name = "amt" type = "decimal(18,2)" }
}
}
resource "aws_s3_bucket" "raw" {
bucket = "lakehouse-raw-${var.env}"
}
resource "aws_iam_role" "glue" {
name = "glue-${var.env}"
assume_role_policy = data.aws_iam_policy_document.glue_assume.json
}
resource "aws_glue_catalog_database" "lake" {
name = "lake_${var.env}"
}
Step-by-step explanation.
- Terraform parses every
.tffile in the directory and builds an internal graph. Each resource is a node; each reference (aws_glue_catalog_database.lake.nameinsideaws_glue_catalog_table.orders) is an edge fromlaketoorders. - The graph builder also adds implicit edges from
datasources to the resources that reference them. - Topological sort produces the apply order.
aws_s3_bucket.raw,aws_iam_role.glue, andaws_glue_catalog_database.lakehave no dependencies — they are all created in parallel (level 0). -
aws_glue_catalog_table.ordersdepends onlakeandraw. It is in level 1 — created after both level 0 resources finish. -
terraform graph | dot -Tpng > graph.pngvisualises this. Senior engineers use it during reviews of big modules.
Output (the resolved apply order).
| Level | Resources created in parallel |
|---|---|
| 0 |
aws_s3_bucket.raw, aws_iam_role.glue, aws_glue_catalog_database.lake
|
| 1 | aws_glue_catalog_table.orders |
Rule of thumb. Write HCL in whatever logical order helps readers. Trust the DAG — never use depends_on to force a sequence unless you have a hidden dependency (e.g. AWS eventual-consistency races) that Terraform cannot infer from references.
Worked example — drift detection on a Snowflake warehouse
Detailed explanation. A team has a Snowflake warehouse LOAD_WH provisioned via Terraform with warehouse_size = "SMALL". Someone with SYSADMIN clicks "Resize" in the Snowflake UI and bumps it to MEDIUM to fix a slow load. Three days later, the nightly terraform plan runs in CI and flags the drift.
Question. Show the drift, the plan output, and the three legitimate ways to resolve it (revert, accept, or refactor).
Input.
| Resource | HCL says | Cloud says |
|---|---|---|
snowflake_warehouse.load |
SMALL | MEDIUM |
Code (the HCL — unchanged).
resource "snowflake_warehouse" "load" {
name = "LOAD_WH"
warehouse_size = "SMALL" # HCL says SMALL
auto_suspend = 60
auto_resume = true
}
$ terraform plan
Refreshing state...
snowflake_warehouse.load: Refreshing state... [id=LOAD_WH]
Terraform detected the following changes made outside of Terraform since the
last "terraform apply" which may have affected this plan:
# snowflake_warehouse.load has been changed
~ resource "snowflake_warehouse" "load" {
id = "LOAD_WH"
~ warehouse_size = "SMALL" -> "MEDIUM"
# (4 unchanged attributes hidden)
}
Unless you have made equivalent changes to your configuration, or ignored the
relevant attributes using ignore_changes, the following plan may include
actions to undo or respond to these changes.
Terraform will perform the following actions:
# snowflake_warehouse.load will be updated in-place
~ resource "snowflake_warehouse" "load" {
~ warehouse_size = "MEDIUM" -> "SMALL"
}
Plan: 0 to add, 1 to change, 0 to destroy.
Step-by-step explanation.
-
terraform planrefreshes state by callingsnowflake_warehouse_showfor every resource. It sees the cloud valueMEDIUM, but the HCL still saysSMALL. - Terraform's response: propose to revert the cloud back to
SMALL. That is correct behaviour — HCL is the contract; the cloud is the projection. - Resolution option 1 (revert): apply the plan. The warehouse goes back to
SMALL. Whoever resized it must come back through a PR if the change was intentional. - Resolution option 2 (accept): update the HCL to
warehouse_size = "MEDIUM"and re-plan. Plan becomes no-op; the team now owns the change in Git. - Resolution option 3 (refactor): if the warehouse size needs to vary per env, lift it to a variable:
warehouse_size = var.warehouse_size, set per env interraform.tfvars.
Output (the three drift-resolution options).
| Option | When to choose |
|---|---|
| Revert (apply the plan) | The click-ops change was a mistake; revert and re-do via PR |
| Accept (update HCL) | The click-ops change was correct; codify it now |
| Refactor (variable-ise) | The attribute should vary per env; extract to a variable |
Rule of thumb. Every drift event is a teaching moment. Run drift detection nightly in CI; for each drift event, the on-call data engineer must pick revert, accept, or refactor — never "ignore". Drift left unhandled compounds.
Worked example — the lock file in action
Detailed explanation. A team has 10 engineers running Terraform locally and a CI pipeline running plans on every PR. Without a lock file, each terraform init could pull a slightly different provider version, leading to "works on my machine" plans. With .terraform.lock.hcl committed to Git, every init pulls the exact same version on every platform.
Question. Show the lock file format and walk through how a planned upgrade looks (the senior-DE-approved way).
Input.
| Step | Lock file action |
|---|---|
| 1 | Engineer runs terraform init first time → lock file generated |
| 2 | Commit .terraform.lock.hcl to Git |
| 3 | Other engineers / CI run init — pull the exact pinned versions |
| 4 | Months later: upgrade aws provider → terraform init -upgrade
|
| 5 | Review the lock-file diff in PR |
Code.
# .terraform.lock.hcl (generated by terraform init)
provider "registry.terraform.io/hashicorp/aws" {
version = "5.42.0"
constraints = "~> 5.40"
hashes = [
"h1:abc123...",
"zh:def456...",
"zh:789ghi...",
# one hash per platform
]
}
provider "registry.terraform.io/snowflake-labs/snowflake" {
version = "0.95.0"
constraints = "~> 0.95"
hashes = [
"h1:xyz890...",
]
}
# Upgrade the AWS provider intentionally
$ terraform init -upgrade
Upgrading modules...
- registry.terraform.io/hashicorp/aws v5.42.0 -> v5.50.0
# Review the diff in PR
$ git diff .terraform.lock.hcl
- version = "5.42.0"
+ version = "5.50.0"
Step-by-step explanation.
- The first time
terraform initruns against a config, it computes the latest provider version matching therequired_providersversion constraints and writes the lock file. The lock file has hashes for every supported platform (so a macOS engineer and a linux CI runner get matching versions). - The lock file is committed to Git. Subsequent inits read the lock file first and pull only those exact versions.
- To upgrade, run
terraform init -upgrade. The lock file is regenerated with the latest versions matching the constraints inrequired_providers. - The diff in
.terraform.lock.hclis reviewed in the PR. The reviewer also re-runsterraform planto catch any provider-behaviour changes. - Without the lock file, two engineers could see different plans for the same HCL — a debugging nightmare. With it, plans are reproducible across machines.
Output.
| Without lock file | With lock file |
|---|---|
| Plans differ by machine | Plans identical across machines |
| "Works on my machine" debugging | Deterministic CI |
| Surprise version drift on init | Explicit upgrade via -upgrade
|
| Provider behaviour changes silently | Lock-file diff visible in PR |
Rule of thumb. Commit .terraform.lock.hcl to Git. Always. Treat lock-file diffs as significant in PR review; they signal a provider upgrade that may have behaviour changes.
Senior interview question on the Terraform DAG and refresh
A senior interviewer might ask: "Explain what happens inside terraform plan from the moment you press enter to the moment you see the diff. What does Terraform read, what API calls does it make, and what is the role of the state file in each stage?"
Solution Using the parse → graph → refresh → diff pipeline
What `terraform plan` actually does
===================================
1) Parse phase
- Walk every *.tf file in the working dir + module sources.
- Validate HCL syntax, resolve variables, expand for_each / count.
- Result: an in-memory representation of the desired state.
2) Graph phase
- Build the DAG from resource/data references.
- Detect cycles → error out.
- Topologically sort.
3) Refresh phase (default; -refresh=false to skip)
- For each resource in state, call provider's Read API.
- Update in-memory state with current cloud attributes.
- Drift between state and cloud is captured here.
4) Diff phase
- For each resource in the desired graph:
- If absent from state → CREATE.
- If present and attributes match → no-op.
- If present and attributes differ → UPDATE or REPLACE
(depending on the resource's ForceNew flags).
- For each resource in state but not in HCL → DESTROY.
- Output: the human-readable plan.
5) (optional) -out=plan.tfplan
- Serialise the plan as a signed binary file.
- Apply must consume this file (no fresh plan computed).
Step-by-step trace.
| Stage | What runs | Output |
|---|---|---|
| Parse | HCL parser + variable resolver | in-memory desired state |
| Graph | DAG builder + cycle check | sorted resource order |
| Refresh | provider Read calls for every state resource | refreshed in-memory state |
| Diff | per-resource compare desired vs refreshed | plan delta |
| Output | textual diff (or binary plan if -out) |
reviewer reads this |
The refresh phase is the most expensive in network terms — one API call per resource. For a 500-resource state, that is 500 reads. Skip it with -refresh=false when you know nothing has drifted (e.g. immediate re-plan after an apply).
Output:
| Component | Reads | Writes |
|---|---|---|
| Parser |
.tf files |
in-memory AST |
| Graph builder | AST | DAG |
| Refresh | provider APIs | in-memory state |
| Diff | refreshed state + DAG | plan |
-out |
plan | signed binary plan.tfplan
|
Why this works — concept by concept:
- Parse is local, refresh is network, diff is compute — each phase has a different cost shape. Parse is cheap; refresh dominates wall-clock for large states; diff is fast once refresh is done.
- The DAG is the contract — graph order determines apply order. Cycles fail fast at graph-build time, not at apply time when half the resources are already created.
- Refresh is the drift-detection mechanism — without refresh, plan would lie. With refresh, plan reflects current cloud reality vs desired state.
-
Signed binary plan —
-out=plan.tfplanis the audit primitive. It captures exactly what apply will do; the apply cannot deviate from it. CI/CD pipelines should always plan-then-apply with the binary. - Cost — plan is O(resources) for refresh + O(graph edges) for diff. For most data infra repos (< 1000 resources), plan completes in 10-60s; for very large states (5000+ resources), splitting into smaller modules / states becomes the right answer.
ETL
Topic — etl
Resource-graph design problems
3. Provisioning a warehouse + lakehouse
terraform snowflake and terraform databricks give you a full warehouse + lakehouse in one HCL graph — same plan/apply ritual, two cloud-native stacks
The mental model in one line: a production warehouse + lakehouse is just a graph of provider-specific resources — snowflake_database and snowflake_warehouse on one side, aws_s3_bucket + aws_glue_catalog_database + aws_glue_catalog_table + aws_lakeformation_permissions on the other — wired by HCL into one plan, one apply, one state. Once you say "one HCL, two stacks, one apply ritual," the entire terraform aws glue and terraform databricks interview surface becomes a deduction from "which provider resource maps to which cloud object?"
The Snowflake provider — the warehouse stack.
-
snowflake_database— top-level container. Holds schemas, stages, and snowpipes. One per logical project (ANALYTICS,RAW_LANDING,MARTS). -
snowflake_schema— namespace inside a database. Holds tables, views, sequences, functions, procedures, file formats, masking policies. -
snowflake_warehouse— the compute.warehouse_size(XS / SMALL / MEDIUM / LARGE / XLARGE / 2XLARGE / 3XLARGE / 4XLARGE / 5XLARGE / 6XLARGE),auto_suspend(seconds before idle suspend),auto_resume(resume on query),min_cluster_count/max_cluster_countfor multi-cluster warehouses. -
snowflake_role— RBAC primitive. Roles are granted to users and to other roles (role hierarchy). Privileges are granted to roles, never to users directly. -
snowflake_grant_privileges_to_role— the grant object. Pivots on (role, privilege, target) where target is a schema, table, warehouse, or account-level grant.
The Databricks provider — the lakehouse compute stack.
-
databricks_workspace(provisioned via the AWS / Azure / GCP provider) — the Databricks deployment itself. Usually provisioned once per account, then everything else hangs off it. -
databricks_cluster— the compute cluster.spark_version,node_type_id,num_workers(orautoscale {}),data_security_mode = "USER_ISOLATION"for Unity Catalog. -
databricks_job— a scheduled or triggered job.taskblocks point at notebooks, JARs, dbt projects, or SQL files. Has its own job-level IAM viadatabricks_permissions. -
databricks_catalog/databricks_schema/databricks_table— the Unity Catalog three-level namespace.catalog.schema.tableinstead of legacyhive_metastore.schema.table. -
databricks_grant/databricks_grants— grant tree on catalogs, schemas, tables, external locations, storage credentials. The Unity Catalog answer to RBAC.
The BigQuery provider — Google's warehouse.
-
google_bigquery_dataset— top-level namespace (BigQuery's "database" equivalent). Includeslocation(multi-region or single region),default_table_expiration_ms,access {}blocks for dataset-level IAM. -
google_bigquery_table— table or view (setview {}block) or materialised view. Schema via JSON orschema {}blocks.time_partitioning {}for partitioned tables,clustering {}for cluster keys. -
google_bigquery_routine— UDFs (SQL or JavaScript) and stored procedures. Versioned in Git alongside the table HCL.
The AWS lakehouse stack — S3 + Glue + Lake Formation + Athena.
-
aws_s3_bucket— the storage layer. Production-grade buckets addaws_s3_bucket_versioning,aws_s3_bucket_server_side_encryption_configuration,aws_s3_bucket_lifecycle_configuration,aws_s3_bucket_public_access_block— six resources for one bucket is normal. -
aws_glue_catalog_database— the Hive-style database in the Glue Data Catalog. The catalog itself is per-account; databases are namespaces inside it. -
aws_glue_catalog_table— table metadata pointing at S3 location, with format params (table_type = "ICEBERG"for Iceberg tables,classification = "parquet"for Parquet/Spark tables). -
aws_lakeformation_permissions— the grant primitive. Pivots on (principal, permission, resource) where permission isSELECT,DESCRIBE,DATA_LOCATION_ACCESS,ALL, etc. -
aws_athena_workgroup— the Athena query layer. Workgroup defines result-bucket location, encryption, cost controls. Separate from compute (Athena is serverless).
The Snowflake + Databricks dual-stack pattern.
- Many production teams run both: Snowflake for SQL analytics, Databricks for ML/Spark/streaming. Terraform composes both stacks into one apply.
- The shared layer is the catalog: a single source of metadata that both engines can read. Patterns include: Iceberg tables in Glue (Databricks writes, Snowflake reads via external tables) or Unity Catalog with federated Snowflake catalogs.
- The grant tree must be designed deliberately — same human identity (via SSO / SCIM) maps to a Snowflake role and a Databricks group with overlapping access. Terraform makes the symmetry explicit.
Common interview probes on the provider surface.
- "How do you create a Snowflake database in Terraform?" —
snowflake_database+snowflake_schema+ the grant tree. - "How do you create a Databricks Unity Catalog table?" —
databricks_catalog→databricks_schema→databricks_table(or external table viadatabricks_external_location). - "What is the minimal HCL for a lakehouse?" —
aws_s3_bucket+aws_glue_catalog_database+aws_glue_catalog_table(Iceberg) + an IAM role for whoever reads/writes. - "How do you wire Snowflake to read Iceberg tables from S3?" —
snowflake_storage_integration+snowflake_external_volume+snowflake_iceberg_tablereferencing the S3 path; the Glue catalog can be the external catalog source.
Worked example — lakehouse-in-100-lines end-to-end
Detailed explanation. A complete lakehouse from scratch: an S3 bucket for raw and curated data, a Glue database, an Iceberg-format Glue table, an Athena workgroup, and an IAM role for whatever pipeline writes the data. Each resource is small; together they are a usable lakehouse.
Question. Write the complete HCL for an end-to-end lakehouse (S3 + Glue + Iceberg + Athena + IAM) at minimum production quality. Annotate the dependencies.
Input.
| Component | Resource | Depends on |
|---|---|---|
| Storage | aws_s3_bucket.lake |
— |
| Encryption | aws_s3_bucket_server_side_encryption_configuration.lake |
bucket |
| Versioning | aws_s3_bucket_versioning.lake |
bucket |
| Catalog DB | aws_glue_catalog_database.lake |
— |
| Iceberg table | aws_glue_catalog_table.orders |
bucket + catalog DB |
| Athena workgroup | aws_athena_workgroup.lake |
bucket |
| Pipeline IAM |
aws_iam_role.pipeline + policies |
bucket + catalog DB |
Code.
locals {
name = "lakehouse-${var.env}"
}
resource "aws_s3_bucket" "lake" {
bucket = local.name
}
resource "aws_s3_bucket_server_side_encryption_configuration" "lake" {
bucket = aws_s3_bucket.lake.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
}
}
resource "aws_s3_bucket_versioning" "lake" {
bucket = aws_s3_bucket.lake.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_public_access_block" "lake" {
bucket = aws_s3_bucket.lake.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_glue_catalog_database" "lake" {
name = replace(local.name, "-", "_")
}
resource "aws_glue_catalog_table" "orders" {
name = "orders"
database_name = aws_glue_catalog_database.lake.name
table_type = "EXTERNAL_TABLE"
parameters = {
"table_type" = "ICEBERG"
"format-version" = "2"
"metadata_location" = "s3://${aws_s3_bucket.lake.bucket}/orders/metadata/00000-init.metadata.json"
}
storage_descriptor {
location = "s3://${aws_s3_bucket.lake.bucket}/orders/"
columns { name = "order_id" type = "bigint" }
columns { name = "customer_id" type = "bigint" }
columns { name = "amount" type = "decimal(18,2)" }
columns { name = "order_ts" type = "timestamp" }
}
}
resource "aws_athena_workgroup" "lake" {
name = local.name
configuration {
enforce_workgroup_configuration = true
publish_cloudwatch_metrics_enabled = true
result_configuration {
output_location = "s3://${aws_s3_bucket.lake.bucket}/athena-results/"
encryption_configuration {
encryption_option = "SSE_S3"
}
}
}
}
data "aws_iam_policy_document" "pipeline_assume" {
statement {
actions = ["sts:AssumeRole"]
principals {
type = "Service"
identifiers = ["glue.amazonaws.com"]
}
}
}
resource "aws_iam_role" "pipeline" {
name = "${local.name}-pipeline"
assume_role_policy = data.aws_iam_policy_document.pipeline_assume.json
}
data "aws_iam_policy_document" "pipeline_perms" {
statement {
actions = [
"s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket"
]
resources = [
aws_s3_bucket.lake.arn,
"${aws_s3_bucket.lake.arn}/*",
]
}
statement {
actions = [
"glue:GetDatabase", "glue:GetTable", "glue:GetPartitions",
"glue:CreateTable", "glue:UpdateTable", "glue:BatchCreatePartition",
]
resources = ["*"] # narrow further with catalog ARN in prod
}
}
resource "aws_iam_role_policy" "pipeline" {
role = aws_iam_role.pipeline.id
policy = data.aws_iam_policy_document.pipeline_perms.json
}
Step-by-step explanation.
-
locals.namegives every resource a consistent prefix (lakehouse-prod,lakehouse-staging). Thereplace(...)for the Glue DB name handles the dash → underscore rename (Glue forbids dashes). - The S3 bucket is provisioned with four companion resources: encryption, versioning, public-access block, and lifecycle (omitted here for brevity). This is the production minimum; never ship a bucket without these four.
-
aws_glue_catalog_database.lakeis created with no explicit catalog id — defaults to the account's primary catalog. References work by name (aws_glue_catalog_database.lake.name). -
aws_glue_catalog_table.ordersis the Iceberg-format table. The Iceberg-specific magic lives inparameters(table_type = "ICEBERG",metadata_locationpointing at the bootstrap metadata file). Schema is declared inline; in production, you would point at an existing metadata file managed by the writer (Spark / Glue ETL). -
aws_athena_workgroup.lakeconfigures the query layer. Theenforce_workgroup_configuration = trueflag prevents users from overriding the result location (the most common Athena security mistake). - The IAM role grants Glue catalog read/write + S3 bucket read/write. Production grants would scope
glue:*actions to the specific catalog ARN; the wildcard is a teaching simplification.
Output (the apply order Terraform derives).
| Level | Resources |
|---|---|
| 0 |
aws_s3_bucket.lake, aws_glue_catalog_database.lake, aws_iam_role.pipeline
|
| 1 | bucket-companion resources, aws_glue_catalog_table.orders, aws_athena_workgroup.lake, aws_iam_role_policy.pipeline
|
Rule of thumb. A "minimal" lakehouse is never just S3 + Glue. Production-grade always means S3 + 4 companion resources + Glue + table + Athena workgroup + IAM. Bake all 10 into a lakehouse module so the next env stands up in one variable change.
Worked example — Snowflake warehouse with auto-suspend per env
Detailed explanation. Same Snowflake warehouse, different settings per env. Dev should auto-suspend after 60 seconds; prod stays warm at 5 minutes for low-latency dashboards. Variable-ised HCL handles both from one module.
Question. Write a Snowflake warehouse resource parameterised by env, and show the per-env tfvars files.
Input.
| env | size | auto_suspend (s) | min_cluster | max_cluster |
|---|---|---|---|---|
| dev | XSMALL | 60 | 1 | 1 |
| staging | SMALL | 120 | 1 | 2 |
| prod | MEDIUM | 300 | 2 | 5 |
Code.
# variables.tf
variable "env" {
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.env)
error_message = "env must be one of dev, staging, prod."
}
}
variable "warehouse_size" {
type = string
default = "XSMALL"
}
variable "auto_suspend_s" {
type = number
default = 60
}
variable "min_cluster_count" {
type = number
default = 1
}
variable "max_cluster_count" {
type = number
default = 1
}
# warehouse.tf
resource "snowflake_warehouse" "load" {
name = "LOAD_WH_${upper(var.env)}"
warehouse_size = var.warehouse_size
auto_suspend = var.auto_suspend_s
auto_resume = true
min_cluster_count = var.min_cluster_count
max_cluster_count = var.max_cluster_count
scaling_policy = "STANDARD"
initially_suspended = true
}
# envs/dev.tfvars
env = "dev"
warehouse_size = "XSMALL"
auto_suspend_s = 60
min_cluster_count = 1
max_cluster_count = 1
# envs/prod.tfvars
env = "prod"
warehouse_size = "MEDIUM"
auto_suspend_s = 300
min_cluster_count = 2
max_cluster_count = 5
$ terraform plan -var-file=envs/prod.tfvars
$ terraform apply -var-file=envs/prod.tfvars
Step-by-step explanation.
-
variable "env"has avalidationblock —contains(...)ensures only the three known envs are accepted. Wrong env at apply time → fast error, not a half-built warehouse. - Each tuning knob (
warehouse_size,auto_suspend_s,min_cluster_count,max_cluster_count) is its own variable with a sensible dev-grade default. Production overrides via tfvars. - The
LOAD_WH_${upper(var.env)}naming makes the env explicit in Snowflake itself. Operators see the env in the warehouse list. -
initially_suspended = trueis the senior-grade default — never auto-resume on first create; let an explicit query be the first thing that wakes the warehouse. - Per-env tfvars files in
envs/dev.tfvars,envs/staging.tfvars,envs/prod.tfvarsare committed to Git. CI picks the right file based on the branch / workflow.
Output (per-env apply result).
| env | warehouse name | size | suspend (s) | clusters |
|---|---|---|---|---|
| dev | LOAD_WH_DEV | XSMALL | 60 | 1 |
| staging | LOAD_WH_STAGING | SMALL | 120 | 1–2 |
| prod | LOAD_WH_PROD | MEDIUM | 300 | 2–5 |
Rule of thumb. Every per-env tuning knob is a variable, never a hard-coded value. Tfvars files in envs/ map cleanly to CI workflows; the alternative (workspaces) is harder to enforce safely.
Worked example — Databricks Unity Catalog three-level provisioning
Detailed explanation. A team migrates from legacy hive_metastore.schema.table to Unity Catalog's catalog.schema.table. Terraform provisions the catalog, the schemas, the tables, and the grants in one HCL graph. The big change vs legacy: explicit catalog as the top namespace, and grants at every level.
Question. Write the HCL for a Unity Catalog with two schemas (raw, marts) and a single sample table in marts. Show the grant tree giving an analytics_group SELECT on marts.
Input.
| Level | Resource | Name |
|---|---|---|
| 1 | databricks_catalog |
analytics |
| 2 | databricks_schema |
analytics.raw, analytics.marts
|
| 3 | databricks_table |
analytics.marts.orders (external Iceberg) |
| 4 | databricks_grants |
analytics_group SELECT on marts |
Code.
resource "databricks_catalog" "analytics" {
name = "analytics"
comment = "Curated analytics — managed by Terraform"
storage_root = "s3://${aws_s3_bucket.lake.bucket}/uc/analytics/"
}
resource "databricks_schema" "raw" {
catalog_name = databricks_catalog.analytics.name
name = "raw"
}
resource "databricks_schema" "marts" {
catalog_name = databricks_catalog.analytics.name
name = "marts"
}
resource "databricks_external_location" "orders" {
name = "orders"
url = "s3://${aws_s3_bucket.lake.bucket}/orders/"
credential_name = databricks_storage_credential.lake.name
}
resource "databricks_table" "orders" {
catalog_name = databricks_catalog.analytics.name
schema_name = databricks_schema.marts.name
name = "orders"
table_type = "EXTERNAL"
data_source_format = "DELTA"
storage_location = databricks_external_location.orders.url
column { name = "order_id" type = "BIGINT" }
column { name = "customer_id" type = "BIGINT" }
column { name = "amount" type = "DECIMAL(18,2)" }
column { name = "order_ts" type = "TIMESTAMP" }
}
resource "databricks_grants" "marts_read" {
schema = "${databricks_catalog.analytics.name}.${databricks_schema.marts.name}"
grant {
principal = "analytics_group"
privileges = ["USE_SCHEMA", "SELECT"]
}
}
Step-by-step explanation.
-
databricks_catalog.analyticsis the top-level namespace.storage_rootpins the default S3 prefix for managed tables in the catalog. External tables can live anywhere with their owndatabricks_external_location. - Two schemas (
rawandmarts) live inside the catalog. Each is a sub-namespace; tables and views go inside schemas. -
databricks_external_location.ordersis the Unity Catalog primitive that links an S3 path to a storage credential (databricks_storage_credential.lake). Without it, Databricks cannot read/write the path; with it, every read/write checks the location's grant tree. -
databricks_table.ordersis the external Delta table. Schema is declared inline; on apply, Databricks creates the metadata pointing at the S3 location. -
databricks_grants.marts_readgrantsUSE_SCHEMA+SELECTto theanalytics_group. Unity Catalog requires the parent-levelUSE_SCHEMAbefore any table-level grant counts — a common interview gotcha.
Output (apply order Terraform derives).
| Level | Resources |
|---|---|
| 0 |
databricks_catalog.analytics, databricks_storage_credential.lake
|
| 1 | schemas raw + marts, databricks_external_location.orders
|
| 2 |
databricks_table.orders, databricks_grants.marts_read
|
Rule of thumb. Always declare grants on the schema (USE_SCHEMA + SELECT/INSERT/...) — a table-level grant alone does not work without the parent USE on the schema. The same rule applies one level up at the catalog (USE_CATALOG).
Senior interview question on provisioning a dual-stack data platform
A senior interviewer might ask: "Your company runs Snowflake for SQL analytics and Databricks for ML/streaming. They share a Glue Iceberg catalog. Walk me through the Terraform layout — how do you organise the HCL across providers, what goes in which state file, and how do the two stacks reference each other?"
Solution Using one root module per env, shared state via remote-state data source
Repo layout for a dual-stack data platform
==========================================
infra/
├── modules/
│ ├── lakehouse/ # S3 + Glue + Iceberg + Athena + Lake Formation
│ ├── warehouse/ # Snowflake DB + schemas + warehouse + roles
│ ├── lakehouse-compute/# Databricks workspace + cluster + UC catalog
│ └── iam/ # Cross-stack IAM (the bridge)
└── envs/
├── dev/
│ ├── backend.tf # s3 backend, key = "dev/data-platform.tfstate"
│ ├── main.tf # module "lakehouse" "warehouse" "lakehouse-compute"
│ ├── variables.tf
│ └── terraform.tfvars
├── staging/ (same shape)
└── prod/ (same shape)
State files
-----------
- One state per env (NOT one per stack within an env).
- Reasoning: snowflake → glue grant references span stacks;
splitting state forces remote-state data lookups everywhere.
Cross-stack references
----------------------
Inside one env, modules reference each other directly:
module.warehouse.snowflake_role_id
module.lakehouse.glue_database_name
Across envs (rare):
data "terraform_remote_state" "prod_lakehouse" {
backend = "s3"
config = { bucket = "...", key = "prod/data-platform.tfstate" }
}
# then read data.terraform_remote_state.prod_lakehouse.outputs.bucket_arn
Step-by-step trace.
| Decision | Choice | Reason |
|---|---|---|
| State per env vs per stack | per env | tight cross-stack references |
| Module per stack | yes | reusable across envs |
| Backend per env | yes (s3 key includes env) | env isolation |
| Workspaces | no | directory-per-env is safer |
| Cross-env reads |
terraform_remote_state data source |
explicit + read-only |
| IAM bridge | dedicated iam module |
grants are first-class |
The one-state-per-env layout keeps cross-stack references trivial (direct module outputs) while preserving env isolation (separate backend keys). The alternative — one state per stack — multiplies remote-state lookups and creates ordering nightmares.
Output:
| Per-env state contents | Cross-stack reference pattern |
|---|---|
| Snowflake DB + warehouse + roles | module.warehouse.outputs.x |
| AWS S3 + Glue + Iceberg + Athena | module.lakehouse.outputs.x |
| Databricks workspace + UC + clusters | module.lakehouse-compute.outputs.x |
| Cross-stack IAM bridge | module.iam.outputs.x |
Why this works — concept by concept:
-
One state per env, not per stack — cross-stack grants (Databricks reading a Glue table) are common. Putting them in one state makes references direct (
module.x.y). Splitting state forcesterraform_remote_statelookups, which are slower and harder to reason about. -
Modules per stack — the
lakehouse,warehouse,lakehouse-compute, andiammodules are reusable across envs. Each env's root composes them with env-specific variables. -
Directory-per-env, never workspaces in prod —
envs/prod/,envs/staging/,envs/dev/make wrong-env applies impossible (you're physically in the wrong directory). Workspaces let you accidentally apply prod from a dev terminal. - The IAM module is the bridge — cross-cloud IAM (Glue role assumed by Snowflake's storage integration, Databricks instance profile reading Glue catalog) lives in its own module. Makes the bridge explicit.
-
Cost — one state per env means each
planrefreshes every resource in the env (slow for large states). Mitigation: target-plan during iteration (-target=module.warehouse), full plan in CI.
ETL
Topic — etl
Warehouse + lakehouse design problems
4. Catalogs, IAM & policy-as-code
terraform iam policy is the only audit trail you trust — Unity Catalog, Snowflake RBAC, and Lake Formation all collapse to the same grant pattern
The mental model in one line: every catalog (Unity Catalog, Snowflake's database/schema/table tree, AWS Glue catalog) is a hierarchical namespace where grants flow from parent to child, and Terraform expresses both the namespace and the grants as one HCL graph — the same principal, privilege, resource triple regardless of provider. Once you say "namespace + grant tree + Terraform-as-the-only-author," the entire terraform iam policy interview surface becomes a deduction from "who can do what on which resource, and where is that documented?"
Unity Catalog via terraform-databricks — the three-level grant tree.
-
Catalog — top level. Grant
USE_CATALOGto anyone who needs to enter the catalog at all. -
Schema — sub-namespace. Grant
USE_SCHEMAto anyone who needs to see tables in the schema. -
Table — leaf. Grant
SELECT,MODIFY,INSERT, etc. - The principal can be a user (
alice@company.com), a group (analytics_group), or a service principal (databricks-pat-svc). - Grants are additive — there is no DENY in Unity Catalog. Removing access means removing the grant via Terraform.
Snowflake RBAC — role hierarchy + privilege grants.
-
Roles are the primary primitive (
snowflake_role). Privileges are granted to roles, never directly to users. -
Role hierarchy — roles can be granted to other roles (
snowflake_grant_account_role). A user assumes a role; the role inherits all privileges of all roles granted to it (transitive). -
Account-level grants —
CREATE DATABASE,MONITOR USAGE,MANAGE GRANTS. Granted viasnowflake_grant_account_roleorsnowflake_grant_privileges_to_rolewithon_account = true. -
Database/schema/table-level grants —
USAGE,SELECT,INSERT,UPDATE, etc. Granted viasnowflake_grant_privileges_to_rolewithon_schema {}oron_schema_object {}blocks. -
Functional vs access roles — the senior-DE pattern: define access roles (
ROLE_ANALYTICS_RAW_READ) granting privileges, then functional roles (ROLE_ANALYST_PROD) granted access roles. Users are granted functional roles only. Two layers of indirection, infinite flexibility.
AWS Lake Formation — data-lake permissions as Terraform resources.
-
aws_lakeformation_resource— registers an S3 location as a Lake Formation resource. Required before grants can reference the location. -
aws_lakeformation_permissions— the grant primitive.principal = aws_iam_role.x.arn,permissions = ["SELECT", "DESCRIBE"],table {}ordata_location {}block names the target. -
The Lake Formation gotcha — by default, both IAM permissions and Lake Formation permissions are required. The "Lake Formation mode" (
use_lf_tag_based_access_control) flips this so Lake Formation alone gates access — but it must be explicitly enabled. -
Tag-based access (
aws_lakeformation_lf_tag+aws_lakeformation_lf_tag_policy) — assign tags to databases / tables, grant permissions on the tags. The 2026 senior-DE answer to scaling Lake Formation across hundreds of tables.
Cross-account share patterns.
-
Snowflake reader account — provision via
snowflake_share+snowflake_grant_privileges_to_share. A consumer account reads the share read-only, with no per-row copy of the data. - Iceberg REST share — register an external catalog via the Iceberg REST spec; the consumer's catalog client points at the producer's REST endpoint. Terraform can provision the credentials and the endpoint registration.
-
AWS RAM (Resource Access Manager) for Glue catalog —
aws_ram_resource_share+aws_ram_principal_association. Share a Glue catalog database across accounts; the consumer account's IAM grants apply on top.
Policy-as-code — combining Terraform with OPA / Conftest / Checkov.
-
OPA (Open Policy Agent) + Conftest — write policy rules in Rego that gate Terraform plans. Examples: "no S3 bucket without encryption", "no IAM policy with
Action: *onResource: *", "every Snowflake warehouse must haveauto_suspend ≤ 300". - Checkov — pre-built rule library scanning HCL for misconfigurations. Ships hundreds of rules; runs in CI; output in SARIF / JSON / human-readable formats.
- HashiCorp Sentinel — proprietary policy-as-code for HCP Terraform / Terraform Cloud. Same idea, different syntax, only inside the HashiCorp ecosystem.
- The senior pattern — Checkov for general scans, custom OPA/Rego for company-specific rules, Sentinel only if you are paying for HCP Terraform. All three gate the PR before plan is approved.
Secrets handling — never password = "…" in HCL.
-
AWS Secrets Manager —
data "aws_secretsmanager_secret_version" "x" { secret_id = "..." }, thensecret.x.secret_stringin the resource that needs it. -
HashiCorp Vault —
data "vault_generic_secret" "x" { path = "secret/..." }, thendata.vault_generic_secret.x.data["password"]. -
GCP Secret Manager —
data "google_secret_manager_secret_version" "x" { secret = "..." }, thendata.google_secret_manager_secret_version.x.secret_data. -
Sensitive variable marker —
variable "x" { sensitive = true }. Hides the value from plan output. Does not encrypt it in state — the state file still has it in plain text. Always use remote state with at-rest encryption.
Common interview probes on catalogs + IAM.
- "How do you grant SELECT on a Unity Catalog table?" — three grants: USE_CATALOG, USE_SCHEMA, SELECT. Parent before child.
- "How do you handle secrets in Terraform?" — data source against Secrets Manager / Vault; never inline; never commit a tfvars with secrets.
- "What is the difference between Snowflake account-level and database-level grants?" — account-level (e.g.
CREATE DATABASE) is granted viaon_account; database-level (USAGE) viaon_schema {}oron_account_object {}blocks. - "How do you scope an IAM policy to least privilege?" — never
Action: *orResource: *; explicit action lists; resource ARNs with conditions where possible.
Worked example — full Snowflake RBAC tree with access + functional roles
Detailed explanation. A team has three logical access patterns: read raw, read marts, write marts. Two functional roles: analyst_prod (read marts only), engineer_prod (read raw + write marts). The senior-DE pattern uses access roles as the building blocks and functional roles as the user-facing layer.
Question. Write the HCL for the three access roles, the two functional roles, the role-hierarchy grants, and the privilege grants. Show the apply order Terraform derives.
Input.
| Access role | Privileges |
|---|---|
ROLE_RAW_READ |
USAGE on db + USAGE on schema raw + SELECT on all tables in raw |
ROLE_MARTS_READ |
USAGE on db + USAGE on schema marts + SELECT on all tables in marts |
ROLE_MARTS_WRITE |
USAGE on db + USAGE on schema marts + INSERT/UPDATE/DELETE on all tables in marts |
ROLE_ANALYST_PROD |
granted ROLE_MARTS_READ |
ROLE_ENGINEER_PROD |
granted ROLE_RAW_READ + ROLE_MARTS_READ + ROLE_MARTS_WRITE |
Code.
# Access roles
resource "snowflake_role" "raw_read" { name = "ROLE_RAW_READ" }
resource "snowflake_role" "marts_read" { name = "ROLE_MARTS_READ" }
resource "snowflake_role" "marts_write" { name = "ROLE_MARTS_WRITE" }
# Functional roles
resource "snowflake_role" "analyst_prod" { name = "ROLE_ANALYST_PROD" }
resource "snowflake_role" "engineer_prod" { name = "ROLE_ENGINEER_PROD" }
# Role hierarchy — functional roles inherit access roles
resource "snowflake_grant_account_role" "analyst_marts_read" {
role_name = snowflake_role.marts_read.name
parent_role_name = snowflake_role.analyst_prod.name
}
resource "snowflake_grant_account_role" "engineer_raw_read" {
role_name = snowflake_role.raw_read.name
parent_role_name = snowflake_role.engineer_prod.name
}
resource "snowflake_grant_account_role" "engineer_marts_read" {
role_name = snowflake_role.marts_read.name
parent_role_name = snowflake_role.engineer_prod.name
}
resource "snowflake_grant_account_role" "engineer_marts_write" {
role_name = snowflake_role.marts_write.name
parent_role_name = snowflake_role.engineer_prod.name
}
# Privilege grants — db USAGE for all access roles
resource "snowflake_grant_privileges_to_role" "raw_read_db_usage" {
role_name = snowflake_role.raw_read.name
privileges = ["USAGE"]
on_account_object {
object_type = "DATABASE"
object_name = snowflake_database.analytics.name
}
}
# (… similar blocks for marts_read, marts_write …)
# Schema-level grants
resource "snowflake_grant_privileges_to_role" "raw_read_schema_usage" {
role_name = snowflake_role.raw_read.name
privileges = ["USAGE"]
on_schema {
schema_name = "${snowflake_database.analytics.name}.${snowflake_schema.raw.name}"
}
}
# Table-level grants — applies to all current + future tables in schema
resource "snowflake_grant_privileges_to_role" "raw_read_select" {
role_name = snowflake_role.raw_read.name
privileges = ["SELECT"]
on_schema_object {
future {
object_type_plural = "TABLES"
in_schema = "${snowflake_database.analytics.name}.${snowflake_schema.raw.name}"
}
}
}
resource "snowflake_grant_privileges_to_role" "marts_write_dml" {
role_name = snowflake_role.marts_write.name
privileges = ["INSERT", "UPDATE", "DELETE"]
on_schema_object {
future {
object_type_plural = "TABLES"
in_schema = "${snowflake_database.analytics.name}.${snowflake_schema.marts.name}"
}
}
}
Step-by-step explanation.
- Five role resources: three access roles (the privilege bundles) and two functional roles (the user-facing layer). Functional roles inherit access roles via
snowflake_grant_account_role. - The role hierarchy is the indirection. Adding
ROLE_ANALYST_STAGEonly requires creating one new functional role and granting the relevant access roles — no privilege grants to write. - Privilege grants use three levels:
on_account_object { object_type = "DATABASE" }for db-level USAGE,on_schema {}for schema-level USAGE,on_schema_object { future {} }for "all current and future tables in this schema". - The
futuregrant is the senior-DE pattern — it future-proofs the grant tree. New tables get the grant automatically; no Terraform re-run needed when a new dbt model lands. - Users are granted only functional roles (
SCIM_ANALYSTgroup →ROLE_ANALYST_PROD). Access roles are never user-facing; they are internal building blocks.
Output (the resolved grant tree).
| Functional role | Inherits | Effective privileges |
|---|---|---|
ROLE_ANALYST_PROD |
ROLE_MARTS_READ |
USAGE on db + USAGE schema marts + SELECT marts.* |
ROLE_ENGINEER_PROD |
ROLE_RAW_READ + ROLE_MARTS_READ + ROLE_MARTS_WRITE
|
USAGE db + USAGE raw + USAGE marts + SELECT raw.* + INSERT/UPDATE/DELETE marts.* |
Rule of thumb. Always separate access roles from functional roles. Users get functional only; functional inherits access; access has the privileges. Adding a new env, a new dataset, or a new user is then a one-resource change.
Worked example — Lake Formation tag-based access at scale
Detailed explanation. A team has 200 tables in Glue. Granting permissions per table per role is unmanageable. Lake Formation tags let you tag tables (sensitivity = pii, domain = sales) and grant on the tag pattern. One grant covers every table that matches.
Question. Write HCL that creates two LF tags (sensitivity, domain), applies them to tables, and grants analytics_role SELECT on every table where domain = sales AND sensitivity = non_pii.
Input.
| Resource | Value |
|---|---|
Tag sensitivity
|
values: pii, non_pii
|
Tag domain
|
values: sales, marketing, finance
|
| Grant |
analytics_role SELECT on (domain=sales AND sensitivity=non_pii) |
Code.
resource "aws_lakeformation_lf_tag" "sensitivity" {
key = "sensitivity"
values = ["pii", "non_pii"]
}
resource "aws_lakeformation_lf_tag" "domain" {
key = "domain"
values = ["sales", "marketing", "finance"]
}
# Tag a specific table
resource "aws_lakeformation_resource_lf_tags" "orders_tags" {
table {
database_name = aws_glue_catalog_database.lake.name
name = aws_glue_catalog_table.orders.name
}
lf_tag {
key = aws_lakeformation_lf_tag.sensitivity.key
value = "non_pii"
}
lf_tag {
key = aws_lakeformation_lf_tag.domain.key
value = "sales"
}
}
# Tag-based grant
resource "aws_lakeformation_permissions" "analytics_sales_non_pii" {
principal = aws_iam_role.analytics.arn
permissions = ["SELECT", "DESCRIBE"]
lf_tag_policy {
resource_type = "TABLE"
expression {
key = "domain"
values = ["sales"]
}
expression {
key = "sensitivity"
values = ["non_pii"]
}
}
}
Step-by-step explanation.
-
aws_lakeformation_lf_tagdefines the tag and its allowed values. Snowflake-style enums; the tag is account-scoped. -
aws_lakeformation_resource_lf_tagsattaches tag values to a specific table. In production, the tagging is often done by the table-owner module — each table resource gets a companion tagging resource. -
aws_lakeformation_permissionsgrants permissions to a principal via a tag policy. Thelf_tag_policyblock'sexpressionblocks are ANDed — the grant applies only where every expression matches. - The runtime effect:
analytics_rolecan SELECT on any current or future table withdomain = sales AND sensitivity = non_pii. Add a new table with those tags → grant applies automatically. - The scaling win: a team with 200 tables can replace 200 per-table grants with 5-10 tag-based grants. Adding a table is one tagging resource; the grant fans out.
Output.
| Tag combination | Tables matching | Role with SELECT |
|---|---|---|
| domain=sales AND sensitivity=non_pii | orders, customers, line_items | analytics_role |
| domain=sales AND sensitivity=pii | sales_pii | (no grant — fenced off) |
| domain=marketing AND sensitivity=non_pii | (other grants) | (other roles) |
Rule of thumb. Above 20-30 tables, switch to LF tag-based access. The setup cost is small (define tags once, tag tables once); the maintenance win is massive (new tables auto-inherit grants).
Worked example — secrets handling via AWS Secrets Manager
Detailed explanation. A Snowflake provider needs a password. Hard-coding it in HCL is the most common Terraform security mistake. The senior-DE pattern: store the password in AWS Secrets Manager, read it via a data source, reference it in the provider block. The state file still contains it (Terraform limitation), so remote state with encryption is mandatory.
Question. Show the full pattern: secret creation (one-time, manual), data-source read, provider use, and the state-file-encryption gotcha.
Input.
| Step | What runs |
|---|---|
| 1 (one-time) | Create the secret manually or via separate Terraform run |
| 2 (HCL) |
data "aws_secretsmanager_secret_version" reads it |
| 3 (HCL) | Provider block consumes the value |
| 4 (backend) | S3 backend with KMS encryption |
Code.
# secret.tf — one-time, can live in a separate state if desired
resource "aws_secretsmanager_secret" "snowflake_admin" {
name = "snowflake/admin"
description = "Snowflake admin password — rotated quarterly"
# NEVER set the value in HCL; rotate via aws cli or Vault
}
# main.tf
data "aws_secretsmanager_secret_version" "snowflake_admin" {
secret_id = "snowflake/admin"
}
provider "snowflake" {
account = var.snowflake_account
username = var.snowflake_admin_username
password = data.aws_secretsmanager_secret_version.snowflake_admin.secret_string
role = "SYSADMIN"
}
# variables.tf
variable "snowflake_admin_username" {
type = string
sensitive = true
}
# backend.tf — REMOTE state with at-rest encryption
terraform {
backend "s3" {
bucket = "company-tfstate"
key = "data-platform/prod.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "arn:aws:kms:us-east-1:123456789012:key/abc-def-..."
dynamodb_table = "tf-locks"
}
}
Step-by-step explanation.
- The secret is created once (outside Terraform or in a bootstrapping state). The HCL never sets the secret value — only the resource shell with name + description.
- The data source
aws_secretsmanager_secret_version.snowflake_adminreads the current value at plan/apply time. Rotation is then orthogonal to Terraform — rotate the secret value, run apply, the new value flows through. - The provider block references
data...secret_string. Terraform marks this value sensitive (won't print in plan output). -
The state-file gotcha — Terraform copies the secret value into the state file when the data source is refreshed. Even though it is sensitive, it's there in plain text JSON. Always use remote state with backend-level encryption (
encrypt = true+kms_key_id). - The S3 backend with KMS encryption ensures the state file is encrypted at rest. The DynamoDB lock table prevents concurrent applies (the lock itself is harmless without the encrypted bucket).
Output (the secret-handling layer cake).
| Layer | Mechanism | Where the secret lives |
|---|---|---|
| Source of truth | Secrets Manager | encrypted in AWS |
| HCL | data "aws_secretsmanager_secret_version" |
reference only |
| Plan output | (sensitive value) |
hidden |
| State file | refreshed value in JSON | plain text → encrypt with KMS |
| Backend | S3 + KMS + DynamoDB | encrypted bucket only |
Rule of thumb. Never password = "…" in HCL. Always Secrets Manager / Vault / Secret Manager. Always remote state with at-rest encryption. Audit the state-file access list — anyone who can read the bucket can read every secret your Terraform touched.
Senior interview question on the cross-stack IAM grant design
A senior interviewer might ask: "Walk me through how you would design the IAM bridge between Snowflake and AWS for a Snowflake storage integration that lets Snowflake read Iceberg tables from S3 via the Glue catalog. What resources do you provision, in what order, and where can the design break?"
Solution Using a storage-integration + role + bucket-policy three-part bridge
Snowflake ↔ AWS bridge for external Iceberg
Step 1 — Snowflake side: create the storage integration
snowflake_storage_integration
type = EXTERNAL_STAGE
storage_provider = S3
storage_aws_role_arn = "arn:aws:iam::123:role/snowflake-bridge"
storage_allowed_locations = ["s3://lake/orders/"]
Step 2 — Snowflake gives back AWS-side IAM details
data.snowflake_storage_integration.x.outputs:
storage_aws_external_id = "ABC123..."
storage_aws_iam_user_arn = "arn:aws:iam::SF-ACCOUNT:user/..."
Step 3 — AWS side: create the role with TRUST to Snowflake's IAM user
aws_iam_role.snowflake_bridge
assume_role_policy = {
Principal = "arn:aws:iam::SF-ACCOUNT:user/..."
Condition = { sts:ExternalId = "ABC123..." }
}
Step 4 — AWS side: attach a policy granting S3 + Glue read on the bucket
aws_iam_role_policy.snowflake_bridge_perms
Action: [s3:GetObject, s3:ListBucket, glue:GetTable, glue:GetPartitions]
Resource: ["arn:aws:s3:::lake/*", "arn:aws:glue:...:catalog", ...]
Step 5 — Snowflake side: create the external volume + iceberg table
snowflake_external_volume {
storage_locations = [...]
}
snowflake_iceberg_table {
external_volume = snowflake_external_volume.x.name
catalog = "glue"
}
Step-by-step trace.
| Step | Resource | Direction |
|---|---|---|
| 1 | snowflake_storage_integration |
Snowflake side — declares the intent |
| 2 | (read outputs) | Snowflake → AWS |
| 3 | aws_iam_role.snowflake_bridge |
AWS side — trusts Snowflake IAM user |
| 4 | aws_iam_role_policy.snowflake_bridge_perms |
AWS side — S3 + Glue read |
| 5 | snowflake_iceberg_table |
Snowflake side — points at Glue catalog |
The order matters: Snowflake-side integration first (produces the external id), then AWS-side role (trusts the external id), then Snowflake-side table (uses the integration). The DAG handles it via references.
Output:
| Resource | Purpose | Where state lives |
|---|---|---|
snowflake_storage_integration |
declares the AWS role Snowflake will assume | Snowflake-side state |
aws_iam_role.snowflake_bridge |
the role with trust policy from Snowflake | AWS-side state |
aws_iam_role_policy.snowflake_bridge_perms |
S3 + Glue grants | AWS-side state |
snowflake_external_volume |
wraps the integration | Snowflake-side |
snowflake_iceberg_table |
the actual table | Snowflake-side |
Why this works — concept by concept:
- External id as the binding — the external id Snowflake generates is checked by AWS during AssumeRole. Without it, anyone with the role ARN could impersonate Snowflake. With it, only Snowflake's IAM user (verified externally) can assume.
- Trust policy on the AWS side — the bridge is one-way: AWS trusts Snowflake, not vice versa. Snowflake's storage integration says "here's the role I will use"; AWS says "yes, you may use it under these conditions".
-
Least-privilege on the policy — the role's policy is scoped to specific S3 prefixes (
s3://lake/orders/*) and the Glue catalog + database, neverAction: *orResource: *. Senior reviewers will reject any wildcard here. - Iceberg catalog as the metadata source — Snowflake reads the Iceberg metadata files directly from S3, but uses the Glue catalog to find them. The bridge is two-protocol: S3 data access + Glue catalog access.
-
Cost — five resources for the bridge, plus the table definition. The pattern is reusable as a
snowflake-aws-iceberg-bridgemodule. State-management complexity rises slightly (cross-stack references), but the alternative — manual UI configuration — is unauditable.
ETL
Topic — etl
Cross-cloud IAM design problems
5. Modules, state & CI/CD ritual
terraform modules + remote state + plan/apply CI is the production triangle — every senior data platform team converges on it
The mental model in one line: modules abstract the repeated patterns, remote state survives the team turnover, and the plan/apply CI ritual is the only safe path to production — together they form the triangle that turns Terraform from "scripts run on a laptop" into "platform engineering with audit trails". Once you say "modules for DRY, remote state for durability, CI for safety," the entire terraform modules + terraform state backend + terragrunt interview surface becomes a deduction from those three pillars.
Module design — input variables, outputs, versioned.
-
Input variables with types and validation — every input is typed (
string,number,bool,list(...),object({...})) and validated where it matters (validation { condition = contains(["dev", "staging", "prod"], var.env) }). -
Outputs as contract — every cross-module reference is an explicit
output. Modules without outputs are useless to compose; modules with too many outputs leak internal detail. -
Versioned via Git tags —
source = "git::https://github.com/company/tf-modules.git//snowflake-database?ref=v1.2.0". The?ref=v1.2.0pins the module version; updates are opt-in by bumping the ref. -
Or via Terraform Registry — public modules live at
registry.terraform.io; private modules can live at HCP Terraform's private registry. Same?ref=semantics. - README + examples — every module ships a README with usage examples. The reviewer reads the README; the consumer copies the example.
Root module composition vs Terragrunt DRY composition.
-
Root module composition — each env directory has a
main.tfthat calls modules:module "lakehouse" { source = "..." env = "prod" }. Plain Terraform; no extra tools. -
Terragrunt — a thin wrapper that adds DRY (don't repeat yourself) via
includeblocks: every env inherits a commonterragrunt.hclconfiguration. Reduces backend.tf and provider.tf duplication across envs. -
The senior consensus in 2026 — for ≤ 5 envs, plain Terraform is simpler. For 10+ envs (multi-region, multi-cloud), Terragrunt's DRY wins. For OpenTofu users, native
tofu1.7+ addsearly variable evaluationthat eliminates 80% of the Terragrunt use case.
Remote state backends — S3 + DynamoDB lock, GCS, Azure Blob, HCP Terraform.
-
S3 + DynamoDB — the AWS-native choice. S3 holds the state file with versioning + KMS encryption; DynamoDB holds the lock (one row per state file).
terraform { backend "s3" { bucket = "..." key = "env/x.tfstate" dynamodb_table = "tf-locks" } }. - GCS — Google Cloud Storage with object-level locking. Single resource type covers both state and lock.
- Azure Blob — same idea, blob storage + lease-based locking.
- HCP Terraform / Terraform Cloud — managed remote state with versioning, locking, encryption, and a web UI. Adds run-as-a-service (plan/apply happens in HashiCorp's cloud).
- Local state — useful only for tutorials and one-off labs. Never for production; never for any state shared by more than one person.
Drift detection — terraform plan in CI, terraform refresh deprecated in 1.6+.
-
Nightly drift detection — a scheduled CI job runs
terraform planon every state. Any non-no-op plan is reported (Slack alert, GitHub issue, PagerDuty). -
terraform refreshdeprecation — the standalone command was deprecated in 1.6; the modern equivalent isterraform apply -refresh-only, which refreshes state without proposing changes. Useful when you need to import-aware updates. - The drift-resolution discipline — every drift event must be triaged: revert (apply the plan), accept (update HCL), or refactor (variable-ise). Ignoring drift compounds; week-old drift becomes month-old drift becomes "we don't know what's in our cloud".
Plan/apply CI pipeline — PR → plan → review → apply.
-
PR creates plan — pull-request CI runs
terraform init+terraform plan -out=plan.tfplan. Plan output is posted to the PR as a comment. - Review — humans read the plan. Required approvers depend on the env (prod = senior+; dev = anyone).
-
Apply on merge — merge to main triggers
terraform apply plan.tfplan(the saved plan). The apply is atomic; either all resources change or none. - Tools — Atlantis (open-source, runs in your cluster, posts to GitHub), HCP Terraform (paid, runs in HashiCorp's cloud), env0 / Spacelift (paid SaaS competitors), or homegrown GitHub Actions / GitLab CI.
The blast-radius question — per-env state, per-service state, per-team state.
-
Per-env state — minimum split.
dev.tfstate,staging.tfstate,prod.tfstate. The default. -
Per-service state within env — split state by data-product or by stack (
prod/warehouse.tfstate,prod/lakehouse.tfstate). Useful when one stack's plan time is dominating CI. -
Per-team state — each team owns their state. Tighter blast radius; cross-team references via
terraform_remote_statedata sources. - The trade-off — finer splits = smaller blast radius but more cross-stack references. The senior answer: split when one state file's plan exceeds 5-10 minutes or has > 500 resources.
Common interview probes on modules + state + CI.
- "How do you version a Terraform module?" — Git tag +
?ref=v1.2.0in source URL, or publish to a Registry with semver. - "How do you handle a
terraform destroyblast radius?" — per-env state + targeted-destroy + protection on prod resources (prevent_destroy = true). - "What is the difference between
terraform apply -refresh-onlyandterraform refresh?" — the latter is deprecated; the former is the modern equivalent that updates state without proposing changes. - "What does Atlantis do?" — runs Terraform plan/apply on PR comments in a controlled environment, posts results back to the PR.
Worked example — a reusable snowflake-database module
Detailed explanation. A team repeats the same pattern: create a Snowflake database with N schemas, a load warehouse, a read role, a write role, and the grant tree. Repeating this 10 times across envs / projects is wasteful. A module encapsulates it; envs invoke it with a few variables.
Question. Write the module shell — variables, resources, outputs — and show an env-level invocation.
Input.
| Module input | Type | Example |
|---|---|---|
| name | string | "ANALYTICS" |
| schemas | list(string) | ["raw", "staging", "marts"] |
| warehouse_size | string | "SMALL" |
| auto_suspend_s | number | 60 |
Code.
# modules/snowflake-database/variables.tf
variable "name" {
type = string
validation {
condition = upper(var.name) == var.name
error_message = "name must be UPPERCASE."
}
}
variable "schemas" {
type = list(string)
default = ["raw", "staging", "marts"]
}
variable "warehouse_size" {
type = string
default = "XSMALL"
}
variable "auto_suspend_s" {
type = number
default = 60
}
# modules/snowflake-database/main.tf
resource "snowflake_database" "this" {
name = var.name
}
resource "snowflake_schema" "schemas" {
for_each = toset(var.schemas)
database = snowflake_database.this.name
name = upper(each.value)
}
resource "snowflake_warehouse" "load" {
name = "LOAD_${var.name}"
warehouse_size = var.warehouse_size
auto_suspend = var.auto_suspend_s
auto_resume = true
}
resource "snowflake_role" "read" {
name = "ROLE_${var.name}_READ"
}
resource "snowflake_role" "write" {
name = "ROLE_${var.name}_WRITE"
}
resource "snowflake_grant_privileges_to_role" "read_db_usage" {
role_name = snowflake_role.read.name
privileges = ["USAGE"]
on_account_object {
object_type = "DATABASE"
object_name = snowflake_database.this.name
}
}
# modules/snowflake-database/outputs.tf
output "database_name" {
value = snowflake_database.this.name
}
output "warehouse_name" {
value = snowflake_warehouse.load.name
}
output "read_role_name" {
value = snowflake_role.read.name
}
output "write_role_name" {
value = snowflake_role.write.name
}
# envs/prod/main.tf — invoking the module
module "analytics_db" {
source = "git::https://github.com/company/tf-modules.git//snowflake-database?ref=v1.2.0"
name = "ANALYTICS"
schemas = ["raw", "staging", "marts", "audit"]
warehouse_size = "MEDIUM"
auto_suspend_s = 300
}
# Cross-module reference
resource "snowflake_grant_account_role" "analyst_inherits_read" {
role_name = module.analytics_db.read_role_name
parent_role_name = snowflake_role.analyst_prod.name
}
Step-by-step explanation.
- The module's
variables.tfdefines its public interface — every input is typed and validated where useful.validation { condition = upper(var.name) == var.name }enforces uppercase naming convention. - The module's
main.tfusesfor_eachto expand schemas dynamically. With["raw", "staging", "marts"], three schema resources are created; with[..., "audit"], four. No code change needed. - The module's
outputs.tfexposes the four critical names back to callers. The consumer treats these as the module's contract. - The env-level
main.tfinvokes the module viasource = "git::...". The?ref=v1.2.0pins the version; bumping tov1.3.0is an explicit opt-in. - Cross-module references use the output names:
module.analytics_db.read_role_name. Terraform's DAG picks up the dependency automatically.
Output (the module's contract).
| Module call | Resources created | Outputs exposed |
|---|---|---|
module "analytics_db" with 4 schemas |
1 db + 4 schemas + 1 warehouse + 2 roles + 1 grant | 4 names |
Rule of thumb. Extract a module after the third copy-paste. Earlier is premature abstraction; later is technical debt. The third repetition is when patterns become real.
Worked example — S3 + DynamoDB remote state with KMS encryption
Detailed explanation. The minimum production-grade remote state setup: an S3 bucket with versioning + KMS encryption + a DynamoDB table for locking. Pre-provisioned via a "bootstrap" Terraform run that uses local state (chicken-and-egg), then every other state file lives in this backend.
Question. Write the bootstrap HCL (the state-backend resources) and the consumer-side backend config that points at it.
Input.
| Resource | Purpose |
|---|---|
aws_s3_bucket.tfstate |
state storage |
aws_s3_bucket_versioning.tfstate |
undo history |
aws_s3_bucket_server_side_encryption_configuration.tfstate |
at-rest encryption |
aws_kms_key.tfstate |
encryption key |
aws_dynamodb_table.tflocks |
apply locking |
Code.
# bootstrap/main.tf — local-state Terraform that creates the remote backend
resource "aws_kms_key" "tfstate" {
description = "KMS key for Terraform state encryption"
deletion_window_in_days = 30
enable_key_rotation = true
}
resource "aws_kms_alias" "tfstate" {
name = "alias/tfstate"
target_key_id = aws_kms_key.tfstate.key_id
}
resource "aws_s3_bucket" "tfstate" {
bucket = "company-tfstate"
}
resource "aws_s3_bucket_versioning" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
versioning_configuration { status = "Enabled" }
}
resource "aws_s3_bucket_server_side_encryption_configuration" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.tfstate.arn
}
}
}
resource "aws_s3_bucket_public_access_block" "tfstate" {
bucket = aws_s3_bucket.tfstate.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
resource "aws_dynamodb_table" "tflocks" {
name = "tf-locks"
billing_mode = "PAY_PER_REQUEST"
hash_key = "LockID"
attribute {
name = "LockID"
type = "S"
}
point_in_time_recovery { enabled = true }
}
# envs/prod/backend.tf — consumer-side backend config
terraform {
required_version = ">= 1.6"
backend "s3" {
bucket = "company-tfstate"
key = "envs/prod/data-platform.tfstate"
region = "us-east-1"
encrypt = true
kms_key_id = "alias/tfstate"
dynamodb_table = "tf-locks"
}
}
Step-by-step explanation.
- The bootstrap HCL creates the backend resources using local state (no backend block). This is the only acceptable use of local state in production — provisioning the backend itself. Once done, the bootstrap state can be
terraform import-ed into the new remote backend for self-management. - The KMS key + alias is created first. Rotation is enabled (
enable_key_rotation = true) — non-negotiable for any encryption key. - The S3 bucket has four companions: versioning (undo state history), encryption (KMS), public-access-block (the four "block public" toggles must all be true), and lifecycle (omitted, but production-grade keeps state versions ≥ 90 days).
- The DynamoDB table holds the lock rows. Each running apply writes a row with the state-file key; on completion, deletes the row. Concurrent applies block on the lock.
PAY_PER_REQUESTkeeps cost negligible for the typical low-volume access. - The consumer-side
backend.tfpoints at the bucket + key + lock table.kms_key_id = "alias/tfstate"ensures state writes use the KMS key for encryption.
Output.
| Backend component | Cost / month (typical) | Purpose |
|---|---|---|
| S3 bucket (KMS) | $1-5 (state versioning included) | state storage |
| KMS key | $1 | encryption |
| DynamoDB lock table | < $1 (low traffic) | concurrent-apply locking |
| Total per env | ~$5-10/month | full remote backend |
Rule of thumb. Never start a Terraform repo without remote state. The 30-minute bootstrap cost is repaid the first time someone runs terraform apply on the wrong workspace or pushes a state file to Git by accident.
Worked example — GitHub Actions plan/apply pipeline
Detailed explanation. The minimum production-grade CI pipeline for Terraform: PR opens → init + plan → plan posted as comment → reviewer approves → merge triggers apply with the saved plan. OIDC authentication to AWS (no long-lived secrets).
Question. Write the GitHub Actions workflow with two jobs (plan on PR, apply on merge) using OIDC for AWS auth and posting plan output to the PR.
Input.
| Trigger | Job | Action |
|---|---|---|
| PR opened/synchronised | plan |
init + plan + comment |
| Push to main | apply |
init + apply with saved plan |
Code.
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
paths: ["envs/prod/**", "modules/**"]
push:
branches: [main]
paths: ["envs/prod/**", "modules/**"]
permissions:
id-token: write # OIDC to AWS
contents: read
pull-requests: write
jobs:
plan:
if: github.event_name == 'pull_request'
runs-on: ubuntu-latest
defaults:
run:
working-directory: envs/prod
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-tf-plan
aws-region: us-east-1
- uses: opentofu/setup-opentofu@v1
with:
tofu_version: 1.7.0
- run: tofu init -input=false
- run: tofu plan -out=plan.tfplan -no-color
id: plan
continue-on-error: false
- name: Post plan to PR
uses: actions/github-script@v7
with:
script: |
const output = `### Terraform plan (envs/prod)\n\n\`\`\`\n${process.env.PLAN}\n\`\`\``;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
env:
PLAN: ${{ steps.plan.outputs.stdout }}
apply:
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
environment: production # requires manual approval
defaults:
run:
working-directory: envs/prod
steps:
- uses: actions/checkout@v4
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789012:role/github-tf-apply
aws-region: us-east-1
- uses: opentofu/setup-opentofu@v1
with:
tofu_version: 1.7.0
- run: tofu init -input=false
- run: tofu plan -out=plan.tfplan -no-color
- run: tofu apply -auto-approve plan.tfplan
Step-by-step explanation.
- The workflow triggers on PRs (for plan) and pushes to main (for apply).
paths:filter limits runs to changes that actually touch infra files. -
permissions.id-token: writeenables OIDC. The GitHub-issued token is exchanged for an AWS role viaaws-actions/configure-aws-credentials@v4. No long-lived AWS keys in secrets. - The
planjob runs on PR; output is posted back to the PR as a comment so reviewers see the diff inline. - The
applyjob runs on push to main, after the PR's plan was reviewed and merged.environment: productiontriggers GitHub's manual-approval step — a designated reviewer must click approve before apply starts. - The apply re-runs
tofu plan -out=plan.tfplanand then applies the saved plan. The re-plan is a safety net: if anything drifted between merge and apply, the diff is visible (the apply will fail if the diff is unexpected; you can configure stricter checks).
Output (the gate sequence per PR).
| Gate | What runs | Required approval |
|---|---|---|
| PR opened | plan job + comment | reviewer reads diff |
| PR approved + merged | (CI does not block here) | branch protection settings |
| Apply on push to main | apply job | environment manual approval |
| Apply complete | state updated | post-apply notification |
Rule of thumb. Every prod apply needs two human approvals: PR review and environment-gate approval. Two humans, two pairs of eyes, two distinct moments. The cost is small; the safety is large.
Senior interview question on Terraform state surgery
A senior interviewer might ask: "A production Terraform apply renamed a critical Snowflake role from ANALYST to ANALYST_PROD by accident (someone refactored). The role's old privileges are gone; users are locked out. Walk me through how you fix this without re-creating the role (which would lose its grant history). What Terraform state commands do you use, in what order?"
Solution Using terraform state mv + terraform import + targeted apply
State surgery — recover from a refactor that renamed a critical role
====================================================================
Diagnosis
---------
- The PR renamed `snowflake_role.analyst` → `snowflake_role.analyst_prod`.
- Terraform saw the rename as DESTROY + CREATE.
- On apply, the old role was dropped (grants gone with it).
- Now: state has `snowflake_role.analyst_prod`; Snowflake has `ANALYST_PROD` (newly created).
Step 1 — DON'T just re-apply. Stop the bleeding.
- Restore yesterday's state file from S3 versioning to a sandbox path.
- Diff the old vs new state to confirm what was lost.
Step 2 — Re-create the role manually in Snowflake with its OLD grants.
- Run the grant statements via SnowSQL from a backup script.
- This restores user access immediately; downtime ends.
Step 3 — Import the (now-restored) role into the NEW state address.
terraform import 'snowflake_role.analyst_prod' 'ANALYST_PROD'
- This binds the existing role to the new HCL address without recreating.
Step 4 — Run `terraform plan` and verify NO changes.
- If plan shows changes, your HCL drifted from the manual restore;
fix HCL until plan is no-op.
Step 5 — Postmortem.
- Add `lifecycle { prevent_destroy = true }` to critical roles.
- Add a CI gate that blocks any DESTROY of a `snowflake_role` resource.
State commands used
-------------------
- terraform state list → see current addresses
- terraform state show <addr> → inspect attributes
- terraform state mv <old> <new> → rename without destroy
- terraform state rm <addr> → remove from state (no destroy)
- terraform import <addr> <cloud-id> → add to state (no create)
Step-by-step trace.
| Step | Command | Effect |
|---|---|---|
| 1 | restore old tfstate from S3 versioning | sandbox copy for diff |
| 2 | run grant statements manually (SnowSQL) | users unblocked |
| 3 | terraform import 'snowflake_role.analyst_prod' 'ANALYST_PROD' |
state now points at the restored role |
| 4 | terraform plan |
should be no-op |
| 5 | add prevent_destroy = true to HCL + CI gate |
future-proof |
The key insight: the refactor renamed an HCL address, but Terraform interprets address-rename as destroy-then-create. The fix is the state mv command (had it been used preemptively), or import after the fact, never blind re-apply.
Output:
| Recovery state | Action |
|---|---|
| Users locked out | restore grants manually |
| HCL has new name, state has new name, cloud has both (old destroyed) | sandbox diff to confirm |
| State and HCL agree on new name | import the manually re-created role |
| Plan = no-op | recovery complete |
| Postmortem | add prevent_destroy + CI gate |
Why this works — concept by concept:
-
terraform state mvis the right tool for renames — had the PR author usedterraform state mv 'snowflake_role.analyst' 'snowflake_role.analyst_prod'before the apply, no destroy would have happened. The state surgery commands exist for exactly this reason. -
terraform importrecovers the missing binding — after manually restoring the role, import re-creates the (address → cloud id) binding in state. Plan should then be no-op. - S3 versioning is the rollback anchor — the old state file is one S3 version-restore away. Without versioning enabled, recovery is much harder; with it, you have at least a baseline to compare against.
-
prevent_destroy = truefor critical resources — Terraform refuses to destroy any resource with this lifecycle directive. Roles, prod databases, prod S3 buckets should all have it; pure data resources (table contents) cannot benefit. - Cost — state surgery is O(1) per resource (constant time per command). The cost lives in the time to diagnose. Practising state surgery on a sandbox env in advance pays for itself the first real-world incident.
ETL
Topic — etl
Module composition + state problems
Optimization
Topic — optimization
CI/CD ritual optimization problems
Cheat sheet — Terraform for data infra recipes
-
The 5-line Snowflake DB pattern.
snowflake_database→snowflake_schema(x N viafor_each) →snowflake_warehouse→snowflake_role→snowflake_grant_privileges_to_role. The minimum production-grade Snowflake provisioning unit; wrap as a module after the third copy. -
Databricks Unity Catalog three-level provisioning.
databricks_catalog→databricks_schema(x N) →databricks_table(external Delta or Iceberg) →databricks_grantswithUSE_CATALOG + USE_SCHEMA + SELECT. Always grant the parent USE first; table-level SELECT alone never works. -
Glue + S3 + Iceberg minimum lakehouse.
aws_s3_bucket+ 4 companion resources (versioning, encryption, public-access-block, lifecycle) →aws_glue_catalog_database→aws_glue_catalog_tablewithtable_type = "ICEBERG". Production starts here; never less. -
IAM role with Glue + S3 + Athena + Lake Formation.
aws_iam_role(with assume-role policy) →aws_iam_role_policy(scoped to specific bucket + glue catalog) →aws_lakeformation_permissions(SELECT/DESCRIBE on table). Three resources for one principal's access. -
Module input/output template.
variables.tfwithtypeandvalidationper input →main.tfwithfor_eachfor collections →outputs.tfexposing names + ARNs only. README with usage example; Git tag for versioning (?ref=v1.2.0). -
Remote state backend with locking (S3 + DynamoDB).
aws_s3_bucket+ versioning + KMS encryption + public-access-block →aws_dynamodb_tablewithLockIDhash_key + PITR enabled →terraform { backend "s3" { ... kms_key_id ... dynamodb_table } }in every consumer config. -
OIDC GitHub Actions for AWS.
permissions.id-token: write+aws-actions/configure-aws-credentials@v4withrole-to-assume→ zero long-lived AWS keys. Two separate roles: one for plan (read-only), one for apply (read-write); the apply role used only onpush: main. -
Plan/apply ritual.
terraform plan -out=plan.tfplanalways saves the binary plan;terraform apply plan.tfplanconsumes it. Reviewers read the plan; apply is atomic. Never apply without a saved plan in CI. -
Secrets discipline.
data "aws_secretsmanager_secret_version" "x"(or Vaultdata "vault_generic_secret") — neverpassword = "..."inline.variable { sensitive = true }hides plan output but does not encrypt state; always pair with remote state + KMS encryption. -
Drift detection. Nightly scheduled CI job:
terraform init && terraform plan -detailed-exitcode. Exit code 2 means drift; pipe to Slack / GitHub issue. Triage each drift: revert, accept, or refactor. -
State surgery commands.
terraform state list(inspect) →terraform state mv <old> <new>(rename without destroy) →terraform state rm <addr>(forget without destroy) →terraform import <addr> <cloud-id>(adopt without create). Practice on a sandbox before you need them in prod. -
OpenTofu drop-in. Swap
terraformbinary fortofu, swaphashicorp/setup-terraformforopentofu/setup-opentofu. State, HCL, providers identical. MPL licence; community-governed under Linux Foundation. -
Module versioning via Git tag.
source = "git::https://github.com/x/y.git//modules/z?ref=v1.2.0". Updates are opt-in by bumping the ref; never auto-pullmain. -
Lake Formation tag-based access.
aws_lakeformation_lf_tag(define) →aws_lakeformation_resource_lf_tags(apply) →aws_lakeformation_permissionswithlf_tag_policyblock. One grant covers 100s of tables matching the tag expression. -
prevent_destroyon critical resources. Addlifecycle { prevent_destroy = true }to any resource whose destroy would cause data loss or audit failure. Terraform refuses; the override is removing the directive (which itself is a PR change).
Frequently asked questions
Why use Terraform for data infrastructure instead of click-ops?
Because every change has to be auditable, reproducible, and reviewable — and click-ops gives you none of those. Terraform turns a Snowflake database, a Databricks Unity Catalog, an AWS Glue table, and the IAM grants between them into one reviewable HCL pull request; the plan shows exactly what changes, the apply is atomic, and the state file is the source of truth. With terraform for data engineering, drift detection runs nightly and flags any change that bypassed Terraform; with click-ops, you find out three months later in an audit. Click-ops also fails the "stand up a new env identically on a Tuesday" test — a senior interviewer's favourite probe. The 30-minute up-front cost of Terraforming a slice pays back the first time a regulator asks "show me when this grant was created."
Terraform vs OpenTofu in 2026 — which should I pick?
For new open-source-first projects, default to OpenTofu — it's the Linux Foundation fork of the last MPL-licensed Terraform, ships at near-feature-parity with HashiCorp Terraform, and dodges the BSL licence ambiguity. The state format, HCL syntax, and provider protocol are identical; tofu is a drop-in replacement for terraform. Stay on HashiCorp Terraform if you are an enterprise paying for HCP Terraform / Terraform Cloud features (Sentinel policy-as-code, run tasks, agent pools, drift detection-as-a-service) — that paid surface has no OpenTofu equivalent yet. For interview purposes, the senior signal is naming both, articulating the August 2023 re-licensing and the 2024 LF fork, and stating your default. PipeCode practice problems on terragrunt and opentofu adoption decisions ship in the platform-engineering library and are worth drilling before your next infra-on-call interview.
How do you manage Terraform state for data infra?
Always use a remote state backend with locking — for AWS that's S3 + KMS encryption + DynamoDB lock table; for GCP it's GCS with object-level locking; for Azure it's Blob with lease-based locking; for HashiCorp paid users it's HCP Terraform. The state file contains secrets in plain text (passwords, tokens written during creation), so encryption at rest is non-negotiable. Use one state file per environment (envs/prod/data-platform.tfstate) rather than per-stack within an env, because cross-stack references (Databricks reads a Glue table, Snowflake assumes an AWS role) become trivial when they're in the same state. Enable S3 versioning on the state bucket — the first time you need to recover from a botched apply, you'll be grateful. The terraform state mv, terraform state rm, and terraform import commands are your state-surgery toolkit; practice them on a sandbox before you need them in prod.
Can Terraform create Snowflake databases and Unity Catalog?
Yes — both are fully Terraform-native via community-maintained providers. terraform snowflake ships via the Snowflake-Labs/snowflake provider with resources for snowflake_database, snowflake_schema, snowflake_warehouse, snowflake_role, snowflake_grant_privileges_to_role, snowflake_external_table, snowflake_storage_integration, and the full Iceberg-on-Snowflake surface. terraform databricks ships via the databricks/databricks provider with databricks_workspace, databricks_cluster, databricks_job, databricks_catalog, databricks_schema, databricks_table, databricks_grant, and databricks_external_location for Unity Catalog. The senior pattern is one HCL module per logical unit (a "Snowflake database" module, a "Unity Catalog" module), versioned via Git tag, invoked from per-env root modules. PipeCode drills on the Snowflake RBAC tree and Unity Catalog three-level grant pattern are the fastest way to internalise the provider surface.
What is the blast radius of terraform destroy?
Catastrophic unless scoped. terraform destroy walks the entire state in reverse dependency order and deletes every managed resource. On a multi-stack prod state, that is your warehouse, your lakehouse, your IAM roles, your S3 buckets — gone. The senior-DE controls are: (1) lifecycle { prevent_destroy = true } on any resource whose loss would cause data loss or audit failure — Terraform refuses to destroy it; (2) state separation per env — envs/dev/, envs/staging/, envs/prod/ each have their own backend key, so terraform destroy in dev cannot touch prod; (3) CI gates that block any plan with DESTROY actions on production resource types (snowflake_database, aws_s3_bucket); (4) target-destroy — terraform destroy -target=... limits the blast radius to a single resource or module. The interview answer to "what is the blast radius of terraform destroy?" is "the entire state — which is why we never run it unguarded; we use -target or prevent_destroy or per-env state isolation."
How do you handle secrets in Terraform for data infra?
Never inline. Every secret — a Snowflake admin password, a Databricks PAT, an AWS access key for a federated identity, an OAuth client secret — must live in a dedicated secret store (AWS Secrets Manager, HashiCorp Vault, GCP Secret Manager, Azure Key Vault) and be referenced from Terraform via a data source: data "aws_secretsmanager_secret_version" "x" then data.x.secret_string inside the provider or resource. Variables that carry secrets are marked sensitive = true to hide them from plan output. The state-file gotcha: Terraform copies the refreshed value into state, so even with secrets data-sourced from Vault, the state file ends up with the value in plain text. The fix is remote state with at-rest encryption — S3 + KMS, GCS with CMEK, Azure Blob with customer-managed keys. Audit the read access on the state bucket; anyone who can read it can read every secret your Terraform has ever touched. Senior reviewers will reject any PR with password = "..." in HCL on sight.
Practice on PipeCode
- Drill the ETL practice library → for the warehouse, lakehouse, catalog, and IAM provisioning interview surface.
- Rehearse on medium-difficulty ETL problems → when the interviewer wants pipeline-grade IaC depth.
- Stack the optimization library → for the cost/perf trade-offs that show up in every Terraform PR review.
- Sharpen the SQL practice library → for the warehouse-grant + schema-design questions that pair with Terraform interviews.
- For the broader surface, work through the rest of PipeCode's 450+ data engineering practice problems on the PipeCode homepage →.
Lock in Terraform for data infra muscle memory
Provider docs explain the resource. PipeCode drills explain the decision — when a module pays back, when state surgery is safer than `terraform import`, when a Snowflake grant tree explodes. Pipecode.ai is Leetcode for Data Engineering — pattern-first practice tuned for the production trade-offs senior data engineers actually face.
Practice ETL & infra problems →
Practice optimization problems →





Top comments (0)