DEV Community

Cover image for Building Apache Iceberg Lakehouse Storage with S3 Table Buckets
Glenn Gray
Glenn Gray

Posted on • Originally published at graycloudarch.com

Building Apache Iceberg Lakehouse Storage with S3 Table Buckets

Originally published on graycloudarch.com.


The data platform team had a deadline and a storage decision to make. They'd committed to Apache Iceberg as the table format — open standard, time travel, schema evolution, the usual reasons. What they hadn't locked down was where the data was actually going to live, and whether the storage layer would hold up under the metadata-heavy access patterns Iceberg requires.

The default answer is regular S3. It works. Most Iceberg deployments run on it. But AWS launched S3 Table Buckets in late 2024, and they're purpose-built for exactly this workload: Iceberg metadata operations. The numbers made the decision easy — 10x faster metadata queries, 50% or more improvement in query planning time compared to standard S3. The gotcha worth knowing upfront: S3 Table Bucket support requires AWS Provider 5.70 or later. If your Terraform modules are pinned to an older provider version, that's your first upgrade.

We built the storage layer as a three-zone medallion architecture, fully managed with Terraform, with Intelligent-Tiering configured from day one. Here's how we did it.

The Medallion Architecture

Three zones, each with a clear contract about what data lives there and who owns it:

Medallion architecture — three-zone lakehouse: Source Systems flow into Raw Zone (immutable landing), then ETL into Clean Zone (normalized), then aggregation into Curated Zone (analytics-ready), consumed by Athena, QuickSight, and Tableau

Raw is immutable. Once data lands there, it doesn't change — ETL failures don't corrupt the source record because the source record is untouched. Clean is normalized and domain-aligned, owned by data engineering. Curated is the analytics layer that BI tools, Athena queries, and QuickSight dashboards read from.

The naming convention we landed on was {zone}_{domain} for Glue databases — raw_crm, clean_customer, curated_sales_metrics. It looks minor, but it matters. When you're looking at a table in Athena or debugging a failed Glue job, the database name tells you exactly what tier you're in and what domain you're touching. Namespace collisions become impossible because the zone prefix scopes every domain. Data lineage is readable from table names alone.

Why Two Modules Instead of One

The first design question was whether to build a single composite module that creates the KMS key and the S3 Table Bucket together, or split them into separate modules. We split them.

The KMS key isn't just for the lake. It's used by five downstream services: Athena for query results, EMR for cluster encryption, MWAA for DAG storage, Kinesis for stream encryption, and Glue DataBrew for transform outputs. If we bundled the key into the lake storage module, every one of those services would need a dependency chain that eventually resolves back through lake storage just to get a KMS key ARN. Separate modules mean the key has one owner, and everything else declares a dependency on it independently.

The KMS module:

# kms-key/main.tf
resource "aws_kms_key" "this" {
  description             = var.description
  enable_key_rotation     = var.enable_key_rotation
  deletion_window_in_days = var.deletion_window_in_days

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Sid    = "Enable IAM User Permissions"
        Effect = "Allow"
        Principal = { AWS = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root" }
        Action   = "kms:*"
        Resource = "*"
      },
      {
        Sid    = "Allow Service Access"
        Effect = "Allow"
        Principal = { Service = var.service_principals }
        Action = ["kms:Decrypt", "kms:GenerateDataKey", "kms:CreateGrant"]
        Resource = "*"
      }
    ]
  })
}
Enter fullscreen mode Exit fullscreen mode

The service_principals variable takes a list of service principal strings — ["athena.amazonaws.com", "glue.amazonaws.com"] and so on. Adding a new service that needs key access is one line in the Terragrunt config, no module change required.

The S3 Table Bucket Module

The table bucket itself is straightforward. The interesting part is Intelligent-Tiering:

# s3-table-bucket/main.tf
resource "aws_s3tables_table_bucket" "this" {
  name = var.bucket_name
}

resource "aws_s3_bucket_intelligent_tiering_configuration" "this" {
  count  = var.enable_intelligent_tiering ? 1 : 0
  bucket = aws_s3tables_table_bucket.this.name
  name   = "EntireBucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }
  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}
Enter fullscreen mode Exit fullscreen mode

We enable Intelligent-Tiering on the entire bucket from the start. The 90-day threshold for Archive Access and 180-day threshold for Deep Archive weren't arbitrary — they match the typical access patterns for a data lake: raw data is queried heavily during initial load and validation, then access drops off sharply once the clean layer is populated.

The reason Intelligent-Tiering beats manual lifecycle policies here is subtle but important. A manual lifecycle policy moves data based on age. Intelligent-Tiering moves data based on actual access patterns. If a dataset from eight months ago suddenly becomes relevant for a compliance audit, Intelligent-Tiering keeps it in a more accessible tier automatically. A manual policy would have moved it to Deep Archive on day 180 regardless. For a data lake, where access patterns are genuinely unpredictable, letting AWS monitor actual usage is worth the small monitoring fee.

The Terragrunt dependency chain wires the KMS key ARN into the table bucket configuration:

# lake-storage/terragrunt.hcl
dependency "kms" {
  config_path = "../kms-key"
}

inputs = {
  bucket_name = "company-lake-${local.environment}"
  kms_key_arn = dependency.kms.outputs.key_arn
}
Enter fullscreen mode Exit fullscreen mode

Glue Data Catalog

We provisioned 12 Glue databases across the three zones — four domains per zone (CRM, customer, sales, operations). The Terraform for each database includes the Iceberg metadata parameters that enable Iceberg table format for all tables created in that database:

resource "aws_glue_catalog_database" "this" {
  name = "raw_crm"
  parameters = {
    "iceberg_enabled" = "true"
    "table_type"      = "ICEBERG"
    "format-version"  = "2"
  }
}
Enter fullscreen mode Exit fullscreen mode

Format version 2 is the current Iceberg spec. It unlocks row-level deletes, which is required for GDPR compliance — when a user requests deletion, you can execute a targeted delete on the Iceberg table rather than rewriting entire Parquet partitions.

One thing that's easy to miss: Glue databases with Iceberg parameters set don't automatically create Iceberg tables. The database parameters act as defaults and metadata; actual table creation still happens via your ETL tooling (Glue jobs, Spark, Flink). What you get from Terraform is the catalog structure and the governance layer — databases, permissions, encryption settings — so that when the data engineering team writes their first Glue job, the infrastructure is already in place.

The Cost Model

When I presented this to the platform team's tech lead, the cost projection was what turned a "nice to have" into a "let's do this now."

For a 100TB lake:

Timeframe Storage Tier Monthly Cost
First 90 days Standard ~$2,300
After 90 days Archive Access ~$400
After 180 days Deep Archive ~$100

That's roughly 80% savings once the bulk of the data ages past 90 days, and 95% savings at 180 days. The Intelligent-Tiering monitoring cost is $0.0025 per 1,000 objects — on a 100TB lake with typical Iceberg file sizes, that's a few dollars a month. Negligible.

The S3 Table Bucket metadata performance improvement compounds this. Faster query planning means less Athena scan time, which means lower query costs and faster results for analysts. The platform pays for itself in reduced query costs as the data volume grows.

Deployment Sequence

The deployment order is driven by dependencies: KMS must exist before S3 (bucket encryption needs the key ARN), and both must exist before Glue (catalog databases reference the bucket location).

Deployment sequence: KMS Key must be created first, then S3 Table Bucket (which uses the key ARN), then Glue Data Catalog

In practice, across three environments (dev, nonprod, prod), the full deployment took about four hours. Most of that was Terragrunt apply time — the actual resource creation for each component is fast, but we ran plan, reviewed, applied, and verified before moving to the next environment.

One deployment note: the first time you run terragrunt plan on the Glue module in an account that hasn't had Glue configured before, you'll get an error about the Glue service-linked role not existing. Fix it by running aws iam create-service-linked-role --aws-service-name glue.amazonaws.com before the apply. It only needs to happen once per account.

What the Data Team Inherited

When we handed this over to the data engineering team, they had a fully provisioned catalog — three zones, twelve databases, Iceberg metadata configured, encryption enabled, Intelligent-Tiering active. They could start writing Glue jobs and creating tables immediately without worrying about storage configuration, access patterns, or cost optimization after the fact.

The Terraform modules are reusable. Adding a new domain (say, a finance domain across all three zones) is three database resource declarations and one pull request. The KMS key, bucket, and Intelligent-Tiering configuration don't change.

S3 Table Buckets are still relatively new, and the Terraform provider support came together in late 2024. If your team is planning an Iceberg migration and hasn't evaluated Table Buckets yet, the metadata performance gains and the cost trajectory make a strong case for starting there rather than retrofitting later.


Building out a data platform and figuring out the storage and catalog architecture? Get in touch — this kind of infrastructure design work is something I do regularly, whether you're starting from scratch or migrating an existing lake.

Top comments (0)