DEV Community

Aki for AWS Community Builders

Posted on

Track Apache Iceberg Schema Changes in AWS Glue Data Catalog with aws glue get-table-versions

Original Japanese article: Iceberg × Glue Data Catalogのスキーマ変更履歴をaws glue get-table-versionsで確認する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

As Apache Iceberg adoption continues to grow in AWS-based lakehouse architectures, schema evolution has become one of its most valuable features.

At the same time, questions like the following inevitably arise:

  • When did the schema change?
  • Which columns were added or removed?
  • Who made the change?

Although you can view historical schema versions from the AWS Glue console, investigating these details can be cumbersome.

This is where aws glue get-table-versions becomes useful.

When your Apache Iceberg tables are managed through AWS Glue Data Catalog, this command allows you to retrieve schema change history over time.

In this article, I'll walk through the basics of get-table-versions, show how to extract column-level differences with jq, and explain how to identify the person who made a change by combining the results with CloudTrail.


What is get-table-versions?

aws glue get-table-versions is an AWS CLI command that retrieves historical versions of a table registered in AWS Glue Data Catalog.

Every time a Glue table definition is updated, a new VersionId is created. Each version stores the schema definition, partition information, table parameters, and other metadata.

In addition to schema tracking, Iceberg-specific parameters such as metadata_location are also recorded, making this command useful for Iceberg operational management.

Isn't the Console Enough?

You can actually view previous schema versions from the AWS Glue console by selecting an older version from the version dropdown in the upper-right corner of the table page.

However, the console only allows you to inspect one version at a time.

It does not show differences between versions, making it difficult to answer questions such as:

  • Which columns were added or removed?
  • In which version did the change occur?
  • When exactly was the schema updated?

If you need to compare multiple versions or quickly investigate schema-related incidents, using the CLI is far more efficient.

Basic Command Syntax

aws glue get-table-versions \
  --database-name <database-name> \
  --table-name <table-name> \
  --no-paginate \
  --region ap-northeast-1
Enter fullscreen mode Exit fullscreen mode

Using --no-paginate retrieves all versions in a single request.

The response contains a TableVersions array. Each element includes:

  • VersionId
  • Table (full schema definition)
  • UpdateTime (timestamp when the Glue table definition was updated)

Example Output

{
  "Table": {
    "Name": "flights_1m",
    "DatabaseName": "icebergdb",
    "UpdateTime": "2026-06-12T05:21:18+00:00",
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "fl_date",
          "Type": "date",
          "Parameters": {
            "iceberg.field.current": "true",
            "iceberg.field.id": "1",
            "iceberg.field.optional": "true"
          }
        }
      ]
    },
    "Parameters": {
      "metadata_location": "s3://<your-bucket>/warehouse/flights_1m/metadata/00002-....metadata.json",
      "previous_metadata_location": "s3://<your-bucket>/warehouse/flights_1m/metadata/00001-....metadata.json",
      "table_type": "ICEBERG"
    },
    "VersionId": "5"
  }
}
Enter fullscreen mode Exit fullscreen mode

Characteristics of Iceberg Tables

Compared to standard Glue tables, Iceberg tables have several notable characteristics.

First, you may notice that VersionId values become surprisingly large.

Glue VersionId values are not the same as Iceberg Snapshot IDs or commit counts. They increment whenever the Glue table definition stored in the catalog is updated.

Because Iceberg frequently updates metadata_location, Glue table definitions are also updated regularly, causing VersionId to increase much more rapidly than expected.

In one of my test environments, a table had already reached VersionId 318.

However, most of those versions were created by metadata updates associated with data writes rather than actual schema changes.

Another notable characteristic is the presence of iceberg.field.id within each column's Parameters.

This field represents Iceberg's internal column ID, which enables schema evolution features such as column renames without breaking data mapping.

The table-level Parameters section also contains:

  • metadata_location
  • previous_metadata_location

These point to Iceberg metadata files stored in Amazon S3.

Because Glue VersionId values correspond to Iceberg metadata updates, you can trace these files for deeper historical analysis when necessary.

Warning

AWS Glue Data Catalog has service quotas for the number of stored table versions.

In Iceberg environments, it's possible to hit these limits and encounter ResourceNumberLimitExceededException.

Consider periodically removing old versions or using the SkipArchive option of UpdateTable to reduce version growth.


Viewing Schema Change History

List Columns by Version

Let's start by displaying the update timestamp and column list for each version.

Using jq, we can extract VersionId, UpdateTime, and the column names:

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq -r '
    .TableVersions[]
    | {
        version: .VersionId,
        updated: .Table.UpdateTime,
        columns: [.Table.StorageDescriptor.Columns[].Name]
      }
    | "\(.version)\t\(.updated)\t\(.columns | join(", "))"
  '
Enter fullscreen mode Exit fullscreen mode

Example output:

5    2026-06-12T05:21:18+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, arr_time
4    2026-06-12T05:20:48+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, double
3    2026-06-12T05:20:11+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time
2    2025-09-01T21:27:05+00:00    fl_date, dep_delay, arr_delay, air_time, distance, dep_time, arr_time
Enter fullscreen mode Exit fullscreen mode

Note that VersionId values are not necessarily consecutive.

In this example they happen to be 2 → 3 → 4 → 5, but in production environments they may reach hundreds or even thousands.

Since Glue VersionId values do not directly correspond to Iceberg commits, you should not use them alone to estimate the number of schema changes.


Compare Differences Between Versions

To identify added and removed columns between adjacent versions, you can compare column arrays using the jq array difference (-) operator.

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq '
    [ .TableVersions[] | {
        v: .VersionId,
        cols: [.Table.StorageDescriptor.Columns[].Name]
      }
    ]
    | sort_by(.v | tonumber)
    | . as $sorted
    | range(1; length)
    | {
        from: $sorted[.].v,
        added:   ($sorted[.].cols - $sorted[. - 1].cols),
        removed: ($sorted[. - 1].cols - $sorted[.].cols)
      }
  '
Enter fullscreen mode Exit fullscreen mode

Example output:

{ "from": "3", "added": [],           "removed": ["arr_time"] }
{ "from": "4", "added": ["double"],   "removed": [] }
{ "from": "5", "added": ["arr_time"], "removed": ["double"] }
Enter fullscreen mode Exit fullscreen mode

This shows that:

  • arr_time was removed between v2 and v3.
  • double was added between v3 and v4.
  • double was removed and arr_time was restored between v4 and v5.

For full transparency, double was not generated automatically by Glue.

It was actually a mistake I made during testing. I intended to add the arr_time column but accidentally entered the data type name double as the column name.

The issue was corrected in v5, but it serves as a useful demonstration that both mistakes and subsequent fixes are preserved in the version history.


Find Versions Containing a Specific Column

If you need to answer a question such as:

"Which versions contained this column?"

you can use select and contains in jq:

TARGET_COLUMN="COLUMN_NAME"

aws glue get-table-versions \
  --database-name icebergdb \
  --table-name flights_1m \
  --no-paginate \
  --region ap-northeast-1 \
| jq --arg col "$TARGET_COLUMN" '
    .TableVersions[]
    | select(
        .Table.StorageDescriptor.Columns
        | map(.Name)
        | contains([$col])
      )
    | {version: .VersionId, updated: .Table.UpdateTime}
  '
Enter fullscreen mode Exit fullscreen mode

Example output:

{
  "version": "4",
  "updated": "2026-06-12T05:20:48+00:00"
}
Enter fullscreen mode Exit fullscreen mode

This confirms that the column double existed only in version 4.


Identify Who Changed the Schema with CloudTrail

While get-table-versions tells you when a schema changed, it does not tell you who made the change.

To identify the responsible user or role, you can correlate the schema update time with CloudTrail UpdateTable events.

Once you've identified the relevant timestamp, search CloudTrail around that period:

aws cloudtrail lookup-events \
  --lookup-attributes AttributeKey=EventName,AttributeValue=UpdateTable \
  --start-time "2026-06-12T05:20:18+00:00" \
  --end-time "2026-06-12T05:21:18+00:00" \
  --region ap-northeast-1 \
| jq '.Events[] | {time: .EventTime, user: .Username, detail: .CloudTrailEvent | fromjson | .requestParameters}'
Enter fullscreen mode Exit fullscreen mode

Example output:

{
  "time": "2026-06-12T05:21:18+00:00",
  "user": "XXXXX",
  "detail": {
    "catalogId": "123456789012",
    "databaseName": "icebergdb",
    "tableInput": {
      "name": "flights_1m"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The user field identifies the IAM user or role responsible for the update.

Additionally, detail.tableInput contains the updated table definition, allowing you to inspect the actual schema change directly from CloudTrail.

In many cases, reviewing UpdateTable events is sufficient.

However, depending on the tool or workflow being used, changes may also appear as:

  • CreateTable
  • BatchCreatePartition
  • Lake Formation-related events

If you cannot find the expected event, try expanding your search criteria.


Conclusion

In this article, we explored how to use aws glue get-table-versions to track schema changes in AWS Glue Data Catalog.

With this approach, you can:

  • Review schema history chronologically
  • Compare column additions and removals between versions
  • Identify which versions contained specific columns
  • Determine who made a change by correlating with CloudTrail

Glue Data Catalog is often viewed simply as a metadata catalog for current table definitions.

However, by leveraging Table Versions, it can also serve as a lightweight audit mechanism.

Because schema evolution is a fundamental feature of Apache Iceberg, understanding how to answer questions such as:

  • When did the schema change?
  • What changed?
  • Who changed it?

can be extremely valuable during troubleshooting and day-to-day operations.

I hope this article helps anyone managing Apache Iceberg tables with AWS Glue Data Catalog.

Top comments (0)