Aki for AWS Community Builders

Posted on Aug 13 • Edited on Aug 19

PyIceberg on AWS Lambda: Comparing GlueCatalog and REST Catalog Access Methods

#aws #iceberg #dataengineering

Original Japanese article: AWS Lambda × PyIceberg のカタログアクセスパターン比較

Introduction

Apache Iceberg is a reliable table format for data lakes, providing ACID transactions, schema evolution, and time travel capabilities. Previously, we demonstrated how to write to Iceberg tables directly from Lambda using GlueCatalog: https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a
https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-duckdb-and-pyiceberg-1l5p

In this article, we will compare two approaches to catalog access from Lambda using PyIceberg:

GlueCatalog: Access via AWS Glue Data Catalog
REST Catalog: Access via AWS Glue Iceberg REST endpoint (Iceberg standardized REST interface)

Two Access Patterns Overview

When building ETL workloads on AWS Lambda that write to Apache Iceberg tables, two primary patterns stand out:

Pattern 1: GlueCatalog

Process data in Lambda using DuckDB and PyArrow, then use PyIceberg with GlueCatalog to write to Iceberg via the AWS SDK and Glue API.

Key characteristics:

Uses GlueCatalog type in PyIceberg
Direct access via AWS SDK → Glue API calls
AWS-native integration approach

Sample Code (Pattern 1: GlueCatalog)

import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog

def lambda_handler(event, context):
    try:
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'")
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']
        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")
        namespace = "icebergdb"
        table_name = "yellow_tripdata"
        iceberg_table = catalog.load_table(f"{namespace}.{table_name}")
        iceberg_table.append(result_arrow_table)

        print("Successfully appended data to Iceberg table in S3.")
    except Exception as e:
        print(f"An error occurred: {e}")

Pattern 2: REST Catalog (via AWS Glue Iceberg REST endpoint)

This uses the Iceberg-standard REST interface exposed by AWS Glue — the Iceberg REST Catalog — enabling table operations via PyIceberg through REST.

Key characteristics:

Uses REST Catalog in PyIceberg
Communicates via Iceberg-standard REST API
Aligns with open Iceberg specifications and greater cross-platform portability

Sample Code (Pattern 1: GlueCatalog)

import boto3
import duckdb
import pyarrow as pa
from pyiceberg.catalog import load_catalog

def lambda_handler(event, context):
    try:
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'")
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']
        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        sts = boto3.client('sts')
        region = sts._client_config.region_name
        catalog_properties = {
            "type": "rest",
            "uri": f"https://glue.{region}.amazonaws.com/iceberg",
            "s3.region": region,
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "glue",
            "rest.signing-region": region
        }

        catalog = load_catalog(**catalog_properties)
        database_name = "icebergdb"
        table_name = "yellow_tripdata"
        table = catalog.load_table(f"{database_name}.{table_name}")
        table.append(result_arrow_table)

        print("Successfully appended data to Iceberg table in S3 via REST Catalog.")
    except Exception as e:
        print(f"An error occurred: {e}")

Commonalities

Both approaches share the following traits:

Use PyIceberg for catalog operations
Retrieve Iceberg tables via load_table() and append data with append()
Output Iceberg-conformant artifacts—manifests, data files, snapshots—into S3 (OTF)

Key Differences

Feature	GlueCatalog ( Pattern 1 )	REST Catalog ( Pattern 2 )
Catalog Type	`GlueCatalog`	`REST Catalog`
Access Method	AWS SDK → Glue API	Iceberg-standard REST endpoint via AWS Glue
API Alignment	AWS-specific integration	Open standard conforming to Apache Iceberg REST spec

Insight: Pattern 2 adheres to the standardized Iceberg REST API, while Pattern 1 sticks with AWS-specific Glue integration. AWS documentation often favors the REST endpoint pattern. In many Lambda workloads, the practical difference may be negligible.

When to Choose Which?

Choose GlueCatalog if:
- You prefer tight, native AWS integration and simplicity
Choose REST Catalog if:
- You value adherence to open standards and cross-platform compatibility
- You want better alignment with the general Iceberg ecosystem and AWS documentation trends

Summary

To recap, two main patterns exist for accessing Apache Iceberg tables from AWS Lambda using PyIceberg:

GlueCatalog: AWS-native access via Glue SDK
REST Catalog: Iceberg-standard REST API via AWS Glue Iceberg REST endpoint

Both patterns enable writing to Iceberg tables effectively. The right choice depends on whether you prioritize AWS integration or open-standard interoperability.

I hope this comparison helps you plan effective, lightweight ETL workflows with PyIceberg and Lambda!

DEV Community

PyIceberg on AWS Lambda: Comparing GlueCatalog and REST Catalog Access Methods

Introduction

Two Access Patterns Overview

Pattern 1: GlueCatalog

Sample Code (Pattern 1: GlueCatalog)

Pattern 2: REST Catalog (via AWS Glue Iceberg REST endpoint)

Sample Code (Pattern 1: GlueCatalog)

Commonalities

Key Differences

When to Choose Which?

Summary

Top comments (0)