Original Japanese article: AWS Lambda × PyIceberg のカタログアクセスパターン比較
Introduction
Apache Iceberg is a reliable table format for data lakes, providing ACID transactions, schema evolution, and time travel capabilities. Previously, we demonstrated how to write to Iceberg tables directly from Lambda using GlueCatalog: https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a
https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-duckdb-and-pyiceberg-1l5p
In this article, we will compare two approaches to catalog access from Lambda using PyIceberg:
- GlueCatalog: Access via AWS Glue Data Catalog
- REST Catalog: Access via AWS Glue Iceberg REST endpoint (Iceberg standardized REST interface)
Two Access Patterns Overview
When building ETL workloads on AWS Lambda that write to Apache Iceberg tables, two primary patterns stand out:
Pattern 1: GlueCatalog
Process data in Lambda using DuckDB and PyArrow, then use PyIceberg with GlueCatalog
to write to Iceberg via the AWS SDK and Glue API.
Key characteristics:
- Uses
GlueCatalog
type in PyIceberg - Direct access via AWS SDK → Glue API calls
- AWS-native integration approach
Sample Code (Pattern 1: GlueCatalog)
import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog
def lambda_handler(event, context):
try:
duckdb_connection = duckdb.connect(database=':memory:')
duckdb_connection.execute("SET home_directory='/tmp'")
duckdb_connection.execute("INSTALL httpfs;")
duckdb_connection.execute("LOAD httpfs;")
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()
catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")
namespace = "icebergdb"
table_name = "yellow_tripdata"
iceberg_table = catalog.load_table(f"{namespace}.{table_name}")
iceberg_table.append(result_arrow_table)
print("Successfully appended data to Iceberg table in S3.")
except Exception as e:
print(f"An error occurred: {e}")
Pattern 2: REST Catalog (via AWS Glue Iceberg REST endpoint)
This uses the Iceberg-standard REST interface exposed by AWS Glue — the Iceberg REST Catalog — enabling table operations via PyIceberg through REST.
Key characteristics:
- Uses
REST Catalog
in PyIceberg - Communicates via Iceberg-standard REST API
- Aligns with open Iceberg specifications and greater cross-platform portability
Sample Code (Pattern 1: GlueCatalog)
import boto3
import duckdb
import pyarrow as pa
from pyiceberg.catalog import load_catalog
def lambda_handler(event, context):
try:
duckdb_connection = duckdb.connect(database=':memory:')
duckdb_connection.execute("SET home_directory='/tmp'")
duckdb_connection.execute("INSTALL httpfs;")
duckdb_connection.execute("LOAD httpfs;")
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()
sts = boto3.client('sts')
region = sts._client_config.region_name
catalog_properties = {
"type": "rest",
"uri": f"https://glue.{region}.amazonaws.com/iceberg",
"s3.region": region,
"rest.sigv4-enabled": "true",
"rest.signing-name": "glue",
"rest.signing-region": region
}
catalog = load_catalog(**catalog_properties)
database_name = "icebergdb"
table_name = "yellow_tripdata"
table = catalog.load_table(f"{database_name}.{table_name}")
table.append(result_arrow_table)
print("Successfully appended data to Iceberg table in S3 via REST Catalog.")
except Exception as e:
print(f"An error occurred: {e}")
Commonalities
Both approaches share the following traits:
- Use PyIceberg for catalog operations
- Retrieve Iceberg tables via
load_table()
and append data withappend()
- Output Iceberg-conformant artifacts—manifests, data files, snapshots—into S3 (OTF)
Key Differences
Feature | GlueCatalog ( Pattern 1 ) | REST Catalog ( Pattern 2 ) |
---|---|---|
Catalog Type | GlueCatalog |
REST Catalog |
Access Method | AWS SDK → Glue API | Iceberg-standard REST endpoint via AWS Glue |
API Alignment | AWS-specific integration | Open standard conforming to Apache Iceberg REST spec |
Insight: Pattern 2 adheres to the standardized Iceberg REST API, while Pattern 1 sticks with AWS-specific Glue integration. AWS documentation often favors the REST endpoint pattern. In many Lambda workloads, the practical difference may be negligible.
When to Choose Which?
-
Choose GlueCatalog if:
- You prefer tight, native AWS integration and simplicity
-
Choose REST Catalog if:
- You value adherence to open standards and cross-platform compatibility
- You want better alignment with the general Iceberg ecosystem and AWS documentation trends
Summary
To recap, two main patterns exist for accessing Apache Iceberg tables from AWS Lambda using PyIceberg:
- GlueCatalog: AWS-native access via Glue SDK
- REST Catalog: Iceberg-standard REST API via AWS Glue Iceberg REST endpoint
Both patterns enable writing to Iceberg tables effectively. The right choice depends on whether you prioritize AWS integration or open-standard interoperability.
I hope this comparison helps you plan effective, lightweight ETL workflows with PyIceberg and Lambda!
Top comments (0)