Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

#apacheiceberg #aws #dataengineering #s3

This post was originally published on graycloudarch.com.

The data platform team had a deadline and a storage decision to make.
They'd committed to Apache Iceberg as the table format --- open standard,
time travel, schema evolution, the usual reasons. What they hadn't
locked down was where the data was actually going to live, and whether
the storage layer would hold up under the metadata-heavy access patterns
Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments
run on it. But AWS launched S3 Table Buckets in late 2024, and they're
purpose-built for exactly this workload: Iceberg metadata operations.
The numbers made the decision easy --- 10x faster metadata queries, 50% or
more improvement in query planning time compared to standard S3. The
gotcha worth knowing upfront: S3 Table Bucket support requires AWS
Provider 5.70 or later. If your Terraform modules are pinned to an older
provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture,
fully managed with Terraform, with Intelligent-Tiering configured from
day one. Here's how we did it.

The Medallion Architecture

Three zones, each with a clear contract about what data lives there
and who owns it:

Raw is immutable. Once data lands there, it doesn't change --- ETL
failures don't corrupt the source record because the source record is
untouched. Clean is normalized and domain-aligned, owned by data
engineering. Curated is the analytics layer that BI tools, Athena
queries, and QuickSight dashboards read from.

The naming convention we landed on was {zone}_{domain}
for Glue databases --- raw_crm, clean_customer,
curated_sales_metrics. It looks minor, but it matters. When
you're looking at a table in Athena or debugging a failed Glue job, the
database name tells you exactly what tier you're in and what domain
you're touching. Namespace collisions become impossible because the zone
prefix scopes every domain. Data lineage is readable from table names
alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite
module that creates the KMS key and the S3 Table Bucket together, or
split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream
services: Athena for query results, EMR for cluster encryption, MWAA for
DAG storage, Kinesis for stream encryption, and Glue DataBrew for
transform outputs. If we bundled the key into the lake storage module,
every one of those services would need a dependency chain that
eventually resolves back through lake storage just to get a KMS key ARN.
Separate modules mean the key has one owner, and everything else
declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}

The service_principals variable takes a list of service
principal strings ---
["athena.amazonaws.com", "glue.amazonaws.com"] and so on.
Adding a new service that needs key access is one line in the Terragrunt
config, no module change required.

::: {}
Working through something similar? I advise platform teams on AWS infrastructure â€" multi-account architecture, Transit Gateway, EKS, and Terraform IaC. Let's talk.
:::

The S3 Table Bucket Module

The table bucket itself is straightforward. The interesting part is
Intelligent-Tiering:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "this" {
  count  = var.enable_intelligent_tiering ? 1 : 0
  bucket = aws_s3tables_table_bucket.this.name
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

We enable Intelligent-Tiering on the entire bucket from the start.
The 90-day threshold for Archive Access and 180-day threshold for Deep
Archive weren't arbitrary --- they match the typical access patterns for a
data lake: raw data is queried heavily during initial load and
validation, then access drops off sharply once the clean layer is
populated.

The reason Intelligent-Tiering beats manual lifecycle policies here
is subtle but important. A manual lifecycle policy moves data based on
age. Intelligent-Tiering moves data based on actual access patterns. If
a dataset from eight months ago suddenly becomes relevant for a
compliance audit, Intelligent-Tiering keeps it in a more accessible tier
automatically. A manual policy would have moved it to Deep Archive on
day 180 regardless. For a data lake, where access patterns are genuinely
unpredictable, letting AWS monitor actual usage is worth the small
monitoring fee.

The Terragrunt dependency chain wires the KMS key ARN into the table
bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}

Glue Data Catalog

We provisioned 12 Glue databases across the three zones --- four
domains per zone (CRM, customer, sales, operations). The Terraform for
each database includes the Iceberg metadata parameters that enable
Iceberg table format for all tables created in that database:

resource "aws_glue_catalog_database" "this" {
  name = "raw_crm"
  parameters = {
    "iceberg_enabled" = "true"
    "table_type"      = "ICEBERG"
    "format-version"  = "2"
  }
}

Format version 2 is the current Iceberg spec. It unlocks row-level
deletes, which is required for GDPR compliance --- when a user requests
deletion, you can execute a targeted delete on the Iceberg table rather
than rewriting entire Parquet partitions.

One thing that's easy to miss: Glue databases with Iceberg parameters
set don't automatically create Iceberg tables. The database parameters
act as defaults and metadata; actual table creation still happens via
your ETL tooling (Glue jobs, Spark, Flink). What you get from Terraform
is the catalog structure and the governance layer --- databases,
permissions, encryption settings --- so that when the data engineering
team writes their first Glue job, the infrastructure is already in
place.

The Cost Model

When I presented this to the platform team's tech lead, the cost
projection was what turned a "nice to have" into a "let's do this
now."

For a 100TB lake:

Timeframe Storage Tier Monthly Cost

First 90 days Standard ~\$2,300
After 90 days Archive Access ~\$400
After 180 days Deep Archive ~\$100

That's roughly 80% savings once the bulk of the data ages past 90
days, and 95% savings at 180 days. The Intelligent-Tiering monitoring
cost is \$0.0025 per 1,000 objects --- on a 100TB lake with typical Iceberg
file sizes, that's a few dollars a month. Negligible.

The S3 Table Bucket metadata performance improvement compounds this.
Faster query planning means less Athena scan time, which means lower
query costs and faster results for analysts. The platform pays for
itself in reduced query costs as the data volume grows.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before
S3 (bucket encryption needs the key ARN), and both must exist before
Glue (catalog databases reference the bucket location).

In practice, across three environments (dev, nonprod, prod), the full
deployment took about four hours. Most of that was Terragrunt apply time
--- the actual resource creation for each component is fast, but we ran
plan, reviewed, applied, and verified before moving to the next
environment.

One deployment note: the first time you run
terragrunt plan on the Glue module in an account that
hasn't had Glue configured before, you'll get an error about the Glue
service-linked role not existing. Fix it by running
aws iam create-service-linked-role --aws-service-name glue.amazonaws.com
before the apply. It only needs to happen once per account.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a
fully provisioned catalog --- three zones, twelve databases, Iceberg
metadata configured, encryption enabled, Intelligent-Tiering active.
They could start writing Glue jobs and creating tables immediately
without worrying about storage configuration, access patterns, or cost
optimization after the fact.

The Terraform modules are reusable. Adding a new domain (say, a
finance domain across all three zones) is three database
resource declarations and one pull request. The KMS key, bucket, and
Intelligent-Tiering configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider
support came together in late 2024. If your team is planning an Iceberg
migration and hasn't evaluated Table Buckets yet, the metadata
performance gains and the cost trajectory make a strong case for
starting there rather than retrofitting later.

Building out a data platform and figuring out the storage and catalog
architecture? Get in touch --- this kind of
infrastructure design work is something I do regularly, whether you're
starting from scratch or migrating an existing lake.