Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

#aws #terraform #apacheiceberg #dataengineering

Originally published on graycloudarch.com.

The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture, fully managed with Terraform, with Intelligent-Tiering configured from day one. Here's how we did it.

The Medallion Architecture

Three zones, each with a clear contract about what data lives there and who owns it:

Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, owned by data engineering. Curated is the analytics layer that BI tools, Athena queries, and QuickSight dashboards read from.

The naming convention we landed on was {zone}_{domain} for Glue databases — raw_crm, clean_customer, curated_sales_metrics. It looks minor, but it matters. When you're looking at a table in Athena or debugging a failed Glue job, the database name tells you exactly what tier you're in and what domain you're touching. Namespace collisions become impossible because the zone prefix scopes every domain. Data lineage is readable from table names alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue DataBrew for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}

The service_principals variable takes a list of service principal strings — ["athena.amazonaws.com", "glue.amazonaws.com"] and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.

The S3 Table Bucket Module

The table bucket itself is straightforward. The interesting part is Intelligent-Tiering:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "this" {
  count  = var.enable_intelligent_tiering ? 1 : 0
  bucket = aws_s3tables_table_bucket.this.name
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

We enable Intelligent-Tiering on the entire bucket from the start. The 90-day threshold for Archive Access and 180-day threshold for Deep Archive weren't arbitrary — they match the typical access patterns for a data lake: raw data is queried heavily during initial load and validation, then access drops off sharply once the clean layer is populated.

The reason Intelligent-Tiering beats manual lifecycle policies here is subtle but important. A manual lifecycle policy moves data based on age. Intelligent-Tiering moves data based on actual access patterns. If a dataset from eight months ago suddenly becomes relevant for a compliance audit, Intelligent-Tiering keeps it in a more accessible tier automatically. A manual policy would have moved it to Deep Archive on day 180 regardless. For a data lake, where access patterns are genuinely unpredictable, letting AWS monitor actual usage is worth the small monitoring fee.

The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}

Glue Data Catalog

We provisioned 12 Glue databases across the three zones — four domains per zone (CRM, customer, sales, operations). The Terraform for each database includes the Iceberg metadata parameters that enable Iceberg table format for all tables created in that database:

resource "aws_glue_catalog_database" "this" {
  name = "raw_crm"
  parameters = {
    "iceberg_enabled" = "true"
    "table_type"      = "ICEBERG"
    "format-version"  = "2"
  }
}

Format version 2 is the current Iceberg spec. It unlocks row-level deletes, which is required for GDPR compliance — when a user requests deletion, you can execute a targeted delete on the Iceberg table rather than rewriting entire Parquet partitions.

One thing that's easy to miss: Glue databases with Iceberg parameters set don't automatically create Iceberg tables. The database parameters act as defaults and metadata; actual table creation still happens via your ETL tooling (Glue jobs, Spark, Flink). What you get from Terraform is the catalog structure and the governance layer — databases, permissions, encryption settings — so that when the data engineering team writes their first Glue job, the infrastructure is already in place.

The Cost Model

When I presented this to the platform team's tech lead, the cost projection was what turned a "nice to have" into a "let's do this now."

For a 100TB lake:

Timeframe	Storage Tier	Monthly Cost
First 90 days	Standard	~$2,300
After 90 days	Archive Access	~$400
After 180 days	Deep Archive	~$100

That's roughly 80% savings once the bulk of the data ages past 90 days, and 95% savings at 180 days. The Intelligent-Tiering monitoring cost is $0.0025 per 1,000 objects — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. Negligible.

The S3 Table Bucket metadata performance improvement compounds this. Faster query planning means less Athena scan time, which means lower query costs and faster results for analysts. The platform pays for itself in reduced query costs as the data volume grows.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before Glue (catalog databases reference the bucket location).

In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.

One deployment note: the first time you run terragrunt plan on the Glue module in an account that hasn't had Glue configured before, you'll get an error about the Glue service-linked role not existing. Fix it by running aws iam create-service-linked-role --aws-service-name glue.amazonaws.com before the apply. It only needs to happen once per account.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a fully provisioned catalog — three zones, twelve databases, Iceberg metadata configured, encryption enabled, Intelligent-Tiering active. They could start writing Glue jobs and creating tables immediately without worrying about storage configuration, access patterns, or cost optimization after the fact.

The Terraform modules are reusable. Adding a new domain (say, a finance domain across all three zones) is three database resource declarations and one pull request. The KMS key, bucket, and Intelligent-Tiering configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains and the cost trajectory make a strong case for starting there rather than retrofitting later.

Building out a data platform and figuring out the storage and catalog architecture? Get in touch — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.