De' Clerke

Posted on Jun 2

Terraform for Data Engineers: Provisioning GCS, BigQuery, S3, and Lambda Without Clicking Through Consoles

#aws #dataengineering #devops #terraform

Every data pipeline eventually needs a bucket. Then a second bucket. Then a BigQuery dataset, a service account with the right permissions, and a Lambda function to handle alerts. If you set all of that up through the GCP and AWS consoles, you get something that works once, is impossible to reproduce exactly, and will be misconfigured in the next project because you forgot which checkboxes you ticked. Terraform solves this by treating infrastructure as code: version-controlled, reviewable, and repeatable.

This article covers the patterns a data engineer actually needs. Not VPCs and Kubernetes clusters. GCS buckets, BigQuery tables with partitioning, S3 data lakes with lifecycle rules, Lambda functions for lightweight processing, and the IAM wiring that makes service accounts work without over-permissioning.

All provider versions in this article are current as of June 2026: Terraform 1.15.5, Google provider 7.34.0, AWS provider 6.47.0.

The Mental Model: State, Plan, Apply

Terraform works by comparing three things: what you wrote in your .tf files, what it last recorded in the state file, and what actually exists in the cloud. The core workflow is three commands:

terraform init    # download providers and modules
terraform plan    # show what will change without touching anything
terraform apply   # make the changes

terraform plan is the command you run the most. It shows exactly what will be created, modified, or destroyed before anything happens. A plan that shows a resource being replaced (-/+) when you expected it to be modified (~) is a signal to stop and read the plan carefully. Replacement destroys and recreates the resource, which means downtime for anything depending on it.

The state file (terraform.tfstate) is how Terraform knows what it manages. It contains the IDs, attributes, and dependencies of every resource it has created. Never edit it manually and never delete it. If the state file is lost, Terraform loses track of what it owns and will try to create everything from scratch.

File Structure

Split your Terraform config across five files. Every file has a specific responsibility:

project/
├── main.tf          # resource definitions
├── variables.tf     # input variable declarations
├── outputs.tf       # output values
├── providers.tf     # provider and version config
└── terraform.tfvars # actual variable values (gitignore this if it has secrets)

providers.tf is where you pin versions. This is not optional.

terraform {
  required_version = ">= 1.9.0"

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "~> 7.0"
    }
    aws = {
      source  = "hashicorp/aws"
      version = "~> 6.0"
    }
  }
}

provider "google" {
  project = var.project_id
  region  = var.region
}

provider "aws" {
  region = "us-east-1"
}

The ~> 7.0 constraint allows patch and minor updates (7.1, 7.34) but blocks major version upgrades (8.0). Major version bumps in both providers have historically included breaking changes. Pinning to a major version means terraform init -upgrade will not silently change provider behavior.

variables.tf declares inputs with types and validation:

variable "project_id" {
  description = "GCP project ID"
  type        = string
}

variable "region" {
  description = "Default region"
  type        = string
  default     = "africa-south1"   # Johannesburg; BigQuery and GCS available
}

variable "environment" {
  type = string
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Must be dev, staging, or prod."
  }
}

variable "labels" {
  type    = map(string)
  default = {}
}

terraform.tfvars provides the actual values. Add it to .gitignore if it contains credentials or account IDs you do not want public:

project_id  = "my-gcp-project-123"
region      = "africa-south1"
environment = "dev"
labels = {
  project    = "kenya-data-pipeline"
  managed_by = "terraform"
}

Remote State: Stop Storing State Locally

By default, Terraform writes terraform.tfstate to your local working directory. This works for solo projects and breaks the moment anyone else touches the infrastructure. Remote state keeps the file in a shared location with locking so two people cannot run terraform apply simultaneously and corrupt the state.

For GCP projects, use a GCS bucket as the backend:

# backend.tf
terraform {
  backend "gcs" {
    bucket = "my-project-terraform-state"
    prefix = "terraform/state/pipeline"
  }
}

For AWS projects, use S3 with a DynamoDB table for locking:

terraform {
  backend "s3" {
    bucket         = "my-terraform-state-bucket"
    key            = "state/pipeline/terraform.tfstate"
    region         = "us-east-1"
    dynamodb_table = "terraform-lock"
    encrypt        = true
  }
}

The DynamoDB table needs a LockID string partition key. Create it manually once before initializing:

aws dynamodb create-table \
  --table-name terraform-lock \
  --attribute-definitions AttributeName=LockID,AttributeType=S \
  --key-schema AttributeName=LockID,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

After adding a backend, run terraform init again. It will ask whether to migrate the existing local state to the remote backend.

GCS: The Data Lake Bucket

A data lake GCS bucket with versioning, lifecycle rules, and uniform access control:

resource "google_storage_bucket" "data_lake" {
  name          = "${var.project_id}-data-lake-${var.environment}"
  location      = "US"
  storage_class = "STANDARD"
  force_destroy = false

  versioning {
    enabled = true
  }

  lifecycle_rule {
    condition { age = 90 }
    action {
      type          = "SetStorageClass"
      storage_class = "NEARLINE"
    }
  }

  lifecycle_rule {
    condition { age = 365 }
    action { type = "Delete" }
  }

  uniform_bucket_level_access = true
  labels = var.labels
}

force_destroy = false prevents Terraform from deleting the bucket if it contains objects. If terraform destroy encounters a non-empty bucket, it fails with an error instead of silently deleting your data. Leave this as false on anything that contains data you care about.

GCS bucket names are globally unique across all GCP accounts. Including the project ID in the name (${var.project_id}-data-lake) avoids the Error 409: The requested bucket name is not available error, which you will hit if you try to create a bucket with a generic name like data-lake.

BigQuery: Datasets and Partitioned Tables

resource "google_bigquery_dataset" "raw" {
  dataset_id    = "raw"
  friendly_name = "Raw Layer"
  location      = "US"
  labels        = var.labels

  delete_contents_on_destroy = false

  access {
    role          = "OWNER"
    special_group = "projectOwners"
  }
}

resource "google_bigquery_table" "flights" {
  dataset_id          = google_bigquery_dataset.raw.dataset_id
  table_id            = "flights"
  project             = var.project_id
  deletion_protection = false

  time_partitioning {
    type  = "DAY"
    field = "departure_time"
  }

  clustering = ["airline", "origin"]

  schema = file("${path.module}/schemas/flights.json")
  labels = var.labels
}

Two things worth explaining here.

First, deletion_protection = false on the table resource. As of Google provider 6.0, many resources have deletion_protection defaulting to true, which prevents terraform destroy from deleting them. For BigQuery tables in a data pipeline project you plan to rebuild frequently, set it to false explicitly or terraform destroy will error out on the table.

Second, the combination of time_partitioning and clustering. Partitioning by day on departure_time means BigQuery scans only the relevant day partitions when you filter by date, reducing bytes processed and cost. Clustering by airline and origin within each partition further reduces scan size when you filter by those columns. For a table that receives daily appends and is queried by date and airline, this setup can reduce query cost by 80% or more compared to an unpartitioned table.

The schema file is a standard BigQuery JSON schema:

[
  {"name": "flight_id",       "type": "STRING",    "mode": "REQUIRED"},
  {"name": "airline",         "type": "STRING",    "mode": "NULLABLE"},
  {"name": "origin",          "type": "STRING",    "mode": "NULLABLE"},
  {"name": "departure_time",  "type": "TIMESTAMP", "mode": "NULLABLE"}
]

GCP IAM: Service Accounts for Pipelines

Never run a pipeline with personal credentials or a broad role like roles/editor. Create a service account with exactly the permissions needed.

resource "google_service_account" "pipeline" {
  account_id   = "data-pipeline-sa"
  display_name = "Data Pipeline Service Account"
  project      = var.project_id
}

resource "google_project_iam_member" "pipeline_bq_editor" {
  project = var.project_id
  role    = "roles/bigquery.dataEditor"
  member  = "serviceAccount:${google_service_account.pipeline.email}"
}

resource "google_project_iam_member" "pipeline_bq_job" {
  project = var.project_id
  role    = "roles/bigquery.jobUser"
  member  = "serviceAccount:${google_service_account.pipeline.email}"
}

resource "google_storage_bucket_iam_member" "pipeline_gcs" {
  bucket = google_storage_bucket.data_lake.name
  role   = "roles/storage.objectAdmin"
  member = "serviceAccount:${google_service_account.pipeline.email}"
}

Note the two separate BigQuery roles. roles/bigquery.dataEditor lets the service account read and write table data. roles/bigquery.jobUser lets it run query jobs. You need both for a pipeline that reads from and writes to BigQuery. Without jobUser, queries fail with a 403 even though the service account has data access.

For local development, generate a key file and set the environment variable:

resource "google_service_account_key" "pipeline_key" {
  service_account_id = google_service_account.pipeline.name
}

output "sa_key" {
  value     = base64decode(google_service_account_key.pipeline_key.private_key)
  sensitive = true
}

terraform output -raw sa_key > sa-key.json
export GOOGLE_APPLICATION_CREDENTIALS=$(pwd)/sa-key.json

Add sa-key.json to .gitignore immediately. For production and CI/CD, use Workload Identity instead of key files.

S3: Data Lake with Encryption and Lifecycle

S3 bucket resources in provider 6.x are split into separate resources for each concern, unlike the older monolithic aws_s3_bucket with nested blocks. Each setting is its own resource:

resource "aws_s3_bucket" "data_lake" {
  bucket = "${var.project_name}-data-lake-${var.environment}"
  tags   = var.tags
}

resource "aws_s3_bucket_versioning" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  versioning_configuration {
    status = "Enabled"
  }
}

resource "aws_s3_bucket_server_side_encryption_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm = "AES256"
    }
  }
}

resource "aws_s3_bucket_public_access_block" "data_lake" {
  bucket                  = aws_s3_bucket.data_lake.id
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

resource "aws_s3_bucket_lifecycle_configuration" "data_lake" {
  bucket = aws_s3_bucket.data_lake.id

  rule {
    id     = "archive-and-expire"
    status = "Enabled"

    transition {
      days          = 90
      storage_class = "STANDARD_IA"
    }

    expiration {
      days = 365
    }
  }
}

If you have existing Terraform code using the old aws_s3_bucket_object resource, it was renamed to aws_s3_object in AWS provider 4.x. Use the moved block to update the state reference without destroying and recreating the object:

moved {
  from = aws_s3_bucket_object.schema_file
  to   = aws_s3_object.schema_file
}

Lambda: Lightweight Processing and Alerts

Lambda is useful in data pipelines for things that do not belong inside the main DAG: webhook receivers, lightweight event-driven transforms, and alert dispatchers. Here is the full pattern for a scheduled Python Lambda:

data "archive_file" "lambda_zip" {
  type        = "zip"
  source_dir  = "${path.module}/lambda"
  output_path = "${path.module}/lambda.zip"
}

resource "aws_lambda_function" "alert" {
  filename         = data.archive_file.lambda_zip.output_path
  function_name    = "pipeline-alert-${var.environment}"
  role             = aws_iam_role.lambda.arn
  handler          = "handler.lambda_handler"
  runtime          = "python3.12"
  source_code_hash = data.archive_file.lambda_zip.output_base64sha256
  timeout          = 30
  memory_size      = 256

  environment {
    variables = {
      SNS_TOPIC_ARN = aws_sns_topic.alerts.arn
      ENVIRONMENT   = var.environment
    }
  }

  tags = var.tags
}

source_code_hash is what tells Terraform the code changed. Without it, Terraform only updates the function when the .tf file changes, not when the Python code in /lambda changes. With output_base64sha256, a new zip hash triggers a redeployment automatically on terraform apply.

The supported Python runtimes as of June 2026 are python3.12, python3.13, and python3.14. python3.12 is a stable, widely tested choice for production. python3.9 reached Python EOL in October 2025 and Lambda deprecated it in early 2026. Do not use it for new functions.

The Lambda needs an IAM role:

resource "aws_iam_role" "lambda" {
  name = "pipeline-lambda-role-${var.environment}"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Action    = "sts:AssumeRole"
      Effect    = "Allow"
      Principal = { Service = "lambda.amazonaws.com" }
    }]
  })
}

resource "aws_iam_role_policy_attachment" "lambda_basic" {
  role       = aws_iam_role.lambda.name
  policy_arn = "arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
}

resource "aws_iam_role_policy" "lambda_sns" {
  name = "lambda-sns-publish"
  role = aws_iam_role.lambda.id

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [{
      Effect   = "Allow"
      Action   = ["sns:Publish"]
      Resource = aws_sns_topic.alerts.arn
    }]
  })
}

AWSLambdaBasicExecutionRole grants CloudWatch Logs write access, which is the minimum a Lambda needs to emit logs. Everything else (SNS, S3, DynamoDB) needs explicit policy attachments.

To schedule the Lambda, use EventBridge:

resource "aws_cloudwatch_event_rule" "hourly" {
  name                = "hourly-pipeline-check"
  schedule_expression = "rate(1 hour)"
}

resource "aws_cloudwatch_event_target" "lambda" {
  rule      = aws_cloudwatch_event_rule.hourly.name
  target_id = "PipelineAlertLambda"
  arn       = aws_lambda_function.alert.arn
}

resource "aws_lambda_permission" "cloudwatch" {
  statement_id  = "AllowCloudWatchInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.alert.function_name
  principal     = "events.amazonaws.com"
  source_arn    = aws_cloudwatch_event_rule.hourly.arn
}

The aws_lambda_permission resource is easy to miss. Without it, EventBridge will attempt to invoke the Lambda and get an access denied error, even though the EventBridge rule and target are configured correctly. Lambda requires explicit permission grants for each invoking service.

Modules: Reusing Patterns Across Projects

Once you write a GCS bucket with lifecycle rules and IAM correctly once, you do not want to rewrite it for every project. Extract it into a module:

modules/
└── gcs_data_lake/
    ├── main.tf
    ├── variables.tf
    └── outputs.tf

# modules/gcs_data_lake/main.tf
resource "google_storage_bucket" "this" {
  name                        = var.bucket_name
  location                    = var.location
  project                     = var.project_id
  storage_class               = "STANDARD"
  uniform_bucket_level_access = true
  force_destroy               = false
  labels                      = var.labels

  versioning { enabled = true }

  lifecycle_rule {
    condition { age = var.nearline_days }
    action { type = "SetStorageClass"; storage_class = "NEARLINE" }
  }
}

# modules/gcs_data_lake/variables.tf
variable "bucket_name"   { type = string }
variable "project_id"    { type = string }
variable "location"      { type = string; default = "US" }
variable "labels"        { type = map(string); default = {} }
variable "nearline_days" { type = number; default = 90 }

# modules/gcs_data_lake/outputs.tf
output "bucket_name" { value = google_storage_bucket.this.name }
output "bucket_url"  { value = google_storage_bucket.this.url }

Use it from the root module:

module "landing_zone" {
  source       = "./modules/gcs_data_lake"
  project_id   = var.project_id
  bucket_name  = "${var.project_id}-landing-${var.environment}"
  location     = "US"
  labels       = var.labels
  nearline_days = 60
}

output "landing_bucket" {
  value = module.landing_zone.bucket_name
}

Run terraform init after adding a module reference. Without it, Terraform does not know the module exists.

Common Errors and Actual Fixes

State lock error after a crashed run:

Error: Error acquiring the state lock
Lock Info:
  ID: abc-123-def

A previous Terraform run exited without releasing the lock. Fix it with the lock ID from the error message:

terraform force-unlock abc-123-def

GCP 403 permission error:

Error: googleapi: Error 403: The caller does not have permission

The service account running Terraform is missing an IAM role. Fix it in the Terraform config with google_project_iam_member, or temporarily with:

gcloud projects add-iam-policy-binding MY_PROJECT \
  --member="serviceAccount:sa@project.iam.gserviceaccount.com" \
  --role="roles/bigquery.dataEditor"

GCS bucket name conflict:

Error: Error creating Bucket: googleapi: Error 409: The requested bucket name is not available

GCS bucket names are globally unique. Another account (or a previous version of your own project) already has that name. Add var.project_id or a random suffix to the bucket name.

GCP credentials not found:

Error: No valid credential sources found

Terraform cannot find GCP credentials. Fix with one of:

gcloud auth application-default login
# or
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/sa-key.json

AWS Lambda default timeout:

Lambda's default timeout is 3 seconds. Any function doing API calls, database writes, or anything with network latency will time out. Set it explicitly in the resource:

resource "aws_lambda_function" "alert" {
  timeout     = 30
  memory_size = 256
}

Maximum timeout is 900 seconds (15 minutes).

AWS provider 6.x boolean values:

If you have existing configs that use "0" or "1" for boolean attributes, provider 6.x rejects them. Update to true or false:

# Old (fails in provider 6.x)
versioning_enabled = "1"

# Correct
versioning_enabled = true

CI/CD: Running Terraform in GitHub Actions

# .github/workflows/terraform.yml
name: Terraform

on:
  push:
    branches: [main]
    paths: ['terraform/**']
  pull_request:
    branches: [main]
    paths: ['terraform/**']

jobs:
  terraform:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: "~> 1.9"

      - name: Terraform Init
        working-directory: terraform/
        run: terraform init
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

      - name: Terraform Validate
        working-directory: terraform/
        run: terraform validate

      - name: Terraform Plan
        working-directory: terraform/
        run: terraform plan -out=tfplan
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}
          TF_VAR_project_id: ${{ secrets.GCP_PROJECT_ID }}
          TF_VAR_environment: prod

      - name: Terraform Apply
        if: github.ref == 'refs/heads/main' && github.event_name == 'push'
        working-directory: terraform/
        run: terraform apply -auto-approve tfplan
        env:
          GOOGLE_CREDENTIALS: ${{ secrets.GCP_SA_KEY }}

Pass sensitive variables through environment variables prefixed with TF_VAR_. Terraform picks them up automatically, mapping TF_VAR_project_id to var.project_id. This avoids putting credentials or project IDs in .tfvars files that might be committed.

The if: condition on Apply means the plan runs on every pull request but apply only runs when merged to main. Pull request authors see the plan output in the workflow logs before anything changes.

Terraform vs OpenTofu

In August 2023, HashiCorp changed Terraform's license from MPL 2.0 to the Business Source License (BUSL). The BSL prohibits using Terraform directly in competing products. OpenTofu is an open-source fork under the Linux Foundation that continued under MPL 2.0.

As of June 2026, both tools use the same HCL syntax and are largely compatible. OpenTofu 1.11 introduced ephemeral values (temporary credentials that never land in state), and its state encryption feature from 1.7 has no direct Terraform equivalent. A January 2026 survey found 31% of platform engineering teams had migrated at least one environment to OpenTofu.

For a data engineer building pipelines, the practical difference is minimal today. If you are using Terraform Cloud or HCP Terraform for remote state and collaboration, stay on Terraform. If you want open-source-only tooling or are concerned about the license, OpenTofu is a direct drop-in replacement: rename the binary and nothing else in your workflow changes.

The Three Files to Start With

Every new pipeline project gets three Terraform files from the start. They are the minimum needed to provision a data lake bucket and keep the state in a remote backend:

providers.tf: provider versions pinned to major version ranges, remote backend configured
variables.tf: project ID, region, environment, labels
main.tf: GCS bucket or S3 bucket with versioning, encryption, lifecycle, and public access block

Run terraform plan before every terraform apply. Read the plan. A plan that shows destruction where you expected modification is telling you something about how the resource handles updates. Trust the plan more than you trust your memory of what you configured.

The Terraform patterns in this article are drawn from multiple data engineering projects using GCP and AWS. Infrastructure code for the Kenya Economic Pulse and BizPulse Kenya pipelines is on my GitHub.

Follow me on dev.to for more on data engineering, dbt, and Airflow.

DEV Community