Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

#aws #terraform #apacheiceberg #dataengineering

Originally published on graycloudarch.com.

The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture, fully managed with Terraform. Here's how we did it — including a few things about Table Buckets that don't show up in most writeups.

The Medallion Architecture

One table bucket per environment. Zones are namespaces inside the bucket — not separate buckets, not separate Glue databases in the legacy sense:

Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, produced by Spark transforms. Curated is the analytics layer that Athena queries and BI dashboards read from.

The namespace naming convention we used was {zone}_{domain} — raw_crm, clean_customer, curated_sales_metrics. When you're looking at a table in Athena or debugging a failed transform job, the namespace name tells you exactly what tier you're in and what domain you're touching. Data lineage is readable from table names alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}

The service_principals variable takes a list of service principal strings — ["athena.amazonaws.com", "glue.amazonaws.com", "emr-serverless.amazonaws.com"] and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.

The S3 Table Bucket Module

The table bucket itself is straightforward:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

One important thing that trips people up: S3 Table Buckets are not standard S3 buckets. They use the S3 Tables API, not the standard S3 API. Several standard S3 resources will fail with NoSuchBucket (404) if you try to attach them to a Table Bucket:

aws_s3_bucket_versioning
aws_s3_bucket_server_side_encryption_configuration
aws_s3_bucket_public_access_block
aws_s3_bucket_intelligent_tiering_configuration

Encryption is managed internally — AES256 is applied on creation automatically. You'll want ignore_changes = [encryption_configuration] in your lifecycle block or Terraform will constantly detect drift.

The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}

Glue Is Not the Catalog

This is the part that most S3 Table Bucket writeups get wrong, and it matters for how you structure the rest of your Terraform.

S3 Tables is the metadata source of truth. Glue is the integration layer. When you enable the S3 Tables analytics integration, AWS creates a federated catalog named s3tablescatalog in your Glue Data Catalog. Table buckets, namespaces, and tables are surfaced through that catalog hierarchy — Athena and EMR see them through Glue, but Glue doesn't own them.

This means you should not be creating aws_glue_catalog_database resources with location_uri S3 paths and trying to wire Iceberg metadata parameters onto them. That's the legacy Glue-over-S3-prefixes model. For S3 Tables, the catalog structure comes from the table bucket integration, not from manual Glue database provisioning.

In Terraform, the integration resource is aws_s3tables_table_bucket_policy (for access control) and the analytics integration is enabled at the account level. Once enabled, Athena queries S3 Tables through the s3tablescatalog namespace automatically.

The namespace naming convention (raw, clean, curated with domain suffixes) is defined in the table bucket itself, not in Glue. Glue reflects it — it doesn't own it.

The Cost Model

For a 100TB lake, the comparison against standard S3 holds:

Storage Class	When	Monthly Cost
Standard	Active data	~$2,300
Standard-IA equivalent	Less-accessed data	~$400
Glacier equivalent	Archive	~$100

The metadata acceleration charge for Table Buckets is $0.00025 per 1,000 requests — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. The performance improvement compounds the cost picture: 10x faster query planning means less Athena scan time, which means lower query costs as data volume grows.

One note: you cannot attach aws_s3_bucket_intelligent_tiering_configuration to a Table Bucket — it's a standard S3 resource and will fail. Storage cost optimization for Table Buckets happens through compaction and retention maintenance jobs (typically run on a schedule via MWAA or EMR), not through lifecycle policies.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before the S3 Tables analytics integration (which creates the federated Glue catalog surface).

In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.

One deployment note: if you're using Athena and haven't enabled S3 Tables analytics integration in the account before, do that before the apply. Athena queries S3 Tables only after the integration is enabled and the s3tablescatalog namespace is visible in the Glue Data Catalog.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a fully provisioned storage foundation — one table bucket per environment, three namespaces per bucket, encryption enabled, and Athena wired to query through the s3tablescatalog integration. They could start writing Spark jobs and creating tables immediately without worrying about storage configuration or catalog wiring after the fact.

The Terraform modules are reusable. Adding a new environment is one Terragrunt leaf config. Adding a new domain namespace is a namespace declaration on the existing bucket. The KMS key and integration configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains make a strong case for starting there rather than retrofitting later — just go in knowing they're a different API surface than standard S3, and structure your modules accordingly.

Building out a data platform and figuring out the storage and catalog architecture? Get in touch — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.