DEV Community

Du Tran
Du Tran

Posted on

Migrating Apache Iceberg Tables Between AWS Accounts: What Nobody Tells You

Migrating Apache Iceberg Tables Between AWS Accounts: What Nobody Tells You

When my company needed to migrate to a new AWS account, I took on this project solo — and ended up successfully migrating nearly 2,000 Iceberg tables while maintaining full data integrity across both accounts.

This wasn't a straightforward lift-and-shift. It required understanding Iceberg's metadata structure at a deep level, handling edge cases that aren't documented anywhere, and verifying data consistency at scale.

This post documents what I learned, so you don't have to figure it out the hard way.


Why Is This Hard? The S3 Path Problem

S3 bucket names are globally unique. When you create a new AWS account and set up a new bucket, it will have a different name than the source bucket — always.

Apache Iceberg hardcodes the full S3 URI at every layer of its metadata. This means simply running aws s3 sync to copy your files isn't enough. All the metadata still points to the old bucket, and every query on the new table will fail.


Why Standard Approaches Fall Short

Before building a custom solution, I evaluated the obvious options:

aws s3 sync
Copies all files to the new bucket quickly and cheaply, but metadata still references the old bucket name. Queries fail immediately.

CTAS / INSERT OVERWRITE
This works — Athena or Spark reads from the old table and writes a brand-new Iceberg table at the new location. But it rewrites every data file from scratch. For tables in the hundreds of gigabytes or terabytes, the compute cost and time are simply not acceptable when you have nearly 2,000 tables to migrate.

Spark snapshot procedure
Iceberg's built-in snapshot procedure is designed to convert Hive-format tables to Iceberg — not to migrate between buckets. More importantly, it doesn't handle Iceberg v2 delete files, which turned out to be the most complex part of this migration.


Understanding Iceberg's Metadata Structure

To understand why migration is complex, you need to understand how Iceberg organizes its metadata. Iceberg is not a file format — it's a table format built on multiple layers of metadata stacked on top of each other.

metadata.json
    └── snapshots
            └── manifest list  (snap-*.avro)
                        └── manifest files  (*.avro)
                                ├── data files  (*.parquet)
                                └── delete files  (*.parquet)  ← Iceberg v2
Enter fullscreen mode Exit fullscreen mode

metadata.json is the entry point for every Iceberg table. It contains the table's root location, schema, partition spec, and a full history of snapshots. Critically, it stores the full S3 path to the manifest list for each snapshot.

Manifest list (snap-*.avro) is an Avro file that enumerates all manifest files belonging to a specific snapshot. Each record contains manifest_path — a full S3 URI — and manifest_length, the file size in bytes.

Manifest files (*.avro) are where Iceberg tracks individual data and delete files. Each record contains file_path pointing to an actual Parquet file on S3, along with statistics including row count, file size, and column-level min/max bounds.

Delete files are specific to Iceberg v2. Instead of rewriting a data file on every update or delete, Iceberg writes a separate Parquet file describing which rows were removed. Inside this Parquet file is a column called file_path — again, a full S3 URI — pointing to the corresponding data file.

This is the core problem: the bucket name appears at every layer, including inside binary files. A correct migration must touch all of them.


The Approach: Rewrite Metadata, Not Data

The key insight is that data files don't contain any path references. Only metadata files do. So instead of rewriting expensive data files, we can rewrite just the metadata files to point to the new bucket — while simply copying the data files as-is with aws s3 sync.

Here's the high-level flow:

Step 1 — Copy data files
         aws s3 sync (old bucket → new bucket)

Step 2 — Download metadata files to local server

Step 3 — Rewrite each metadata layer (replace bucket name)

Step 4 — Upload rewritten metadata to new bucket

Step 5 — Register new table in Glue Catalog
Enter fullscreen mode Exit fullscreen mode

Steps 1 and 5 are straightforward. The complexity lives entirely in Step 3 — and the order in which you process each layer matters.


Step-by-Step: Rewriting the Metadata Layers

Step 1 — metadata.json

This is the simplest layer since it's plain JSON. You need to replace the bucket name in:

  • location — the table root path
  • write.object-storage.path inside properties (if present)
  • manifest-list path inside each snapshot entry
  • All paths in snapshot-log and metadata-log

One important optimization: only process the current snapshot. You don't need to migrate the full snapshot history. This significantly simplifies the work — you only need to follow the manifest list of the active snapshot forward.

Note that this means time travel will not work on the migrated table. Make sure to communicate this limitation to stakeholders before migration.

Step 2 — Manifest List

The manifest list is an Avro file. For each record, replace manifest_path with the new bucket name.

Here's the first critical gotcha: manifest_length must be accurate. This field stores the file size in bytes of the corresponding manifest file. After you rewrite manifest files in the next step, their sizes may change — and you must update manifest_length accordingly. Iceberg uses this field for integrity validation. Get it wrong and your table is corrupted.

This means you must process manifest files before updating the manifest list, so you have the correct new sizes available.

Step 3 — Manifest Files

This is the most complex layer. Beyond replacing file_path in each record, there are two additional concerns:

Delete file sizes: If a record references a delete file (content type = 1 in Iceberg v2), the file_size_in_bytes field must be updated to reflect the size of the rewritten delete file. Just like manifest_length, Iceberg validates this field. Get it wrong and queries will throw cryptic errors.

This means you must process delete files before manifest files, so you have the correct new sizes ready.

lower_bounds and upper_bounds: Iceberg stores column-level statistics in manifest records for query pruning. In most cases these are numeric or timestamp values and don't contain any path references. However, if your table has any string column whose values happen to contain S3 paths — for example, in certain CDC patterns — those bounds will contain the bucket name encoded as raw UTF-8 bytes. You need to detect and replace these as well.

Step 4 — Delete Files

Delete files are Parquet files with a file_path column containing full S3 URIs pointing to the data files they affect. Read the file, replace the bucket name in the file_path column, and write it back.

For large delete files, process in batches rather than loading the entire file into memory.

Two important things to get right here:

  • Preserve the exact schema when writing back, including Parquet version and compression codec. Schema mismatches can make the file unreadable to Iceberg.
  • Record the new file size — you'll need it when updating manifest files in the next step.

The Correct Processing Order

The dependency chain between layers determines the order you must follow:

Delete files  →  Manifest files  →  Manifest list  →  metadata.json
   (sizes)           (sizes)
Enter fullscreen mode Exit fullscreen mode

Each layer needs the file sizes from the layer below before it can be correctly written. Processing in the wrong order means you won't have the information you need — and the sizes you write will be wrong.


Key Gotchas

These are the things that will cost you hours if you don't know about them upfront:

File sizes must be exact after every rewrite. Iceberg validates file_size_in_bytes for delete files and manifest_length for manifest files. These aren't advisory fields — they're used for integrity checks. A wrong value means a corrupted table.

Processing order is strict. Bottom-up: delete files → manifest files → manifest list → metadata.json. There's no flexibility here.

Only migrate the current snapshot. Don't try to migrate full snapshot history. It multiplies the work enormously and the migrated table will have a fresh history anyway.

lower/upper bounds are bytes, not strings. When scanning bounds in manifest records, the values are raw bytes. Decode to UTF-8, replace, and re-encode. Standard string replacement won't work.

Parallelism helps but adds complexity. With nearly 2,000 tables and potentially thousands of delete files per table, parallel processing is essential for reasonable throughput. But shared state — particularly the dictionaries tracking new file sizes — needs to be handled carefully to avoid race conditions.


Limitations

This approach has some limitations worth being explicit about:

  • No time travel on the migrated table — only the current snapshot is migrated
  • The table must have no active writes during migration — otherwise you risk inconsistency
  • Equality delete files may behave differently — this approach was validated against positional delete files

Conclusion

Migrating Iceberg tables between AWS accounts is not a copy-paste operation. Because Iceberg hardcodes S3 paths at every metadata layer — from the top-level JSON down to binary Avro and Parquet files — you need to rewrite metadata systematically, in the right order, with correct file sizes at each step.

The metadata rewrite approach avoids rewriting data files entirely, making it dramatically more cost-effective than CTAS or INSERT OVERWRITE — especially at scale.

If you're facing the same challenge, I hope this gives you a clear mental model of what's actually involved and helps you build the right solution from the start.

Top comments (0)