DEV Community

Aki for AWS Community Builders

Posted on • Edited on

PyIceberg on AWS Lambda: Comparing GlueCatalog and REST Catalog Access Methods

Original Japanese article: AWS Lambda × PyIceberg のカタログアクセスパターン比較

Introduction

Apache Iceberg is a reliable table format for data lakes, providing ACID transactions, schema evolution, and time travel capabilities. Previously, we demonstrated how to write to Iceberg tables directly from Lambda using GlueCatalog: https://zenn.dev/penginpenguin/articles/77d4a9b1e90e3a
https://dev.to/aws-builders/lightweight-etl-with-aws-lambda-duckdb-and-pyiceberg-1l5p

In this article, we will compare two approaches to catalog access from Lambda using PyIceberg:

  1. GlueCatalog: Access via AWS Glue Data Catalog
  2. REST Catalog: Access via AWS Glue Iceberg REST endpoint (Iceberg standardized REST interface)

Two Access Patterns Overview

When building ETL workloads on AWS Lambda that write to Apache Iceberg tables, two primary patterns stand out:

Pattern 1: GlueCatalog

Process data in Lambda using DuckDB and PyArrow, then use PyIceberg with GlueCatalog to write to Iceberg via the AWS SDK and Glue API.

Key characteristics:

  • Uses GlueCatalog type in PyIceberg
  • Direct access via AWS SDK → Glue API calls
  • AWS-native integration approach

Sample Code (Pattern 1: GlueCatalog)

import duckdb
import pyarrow as pa
from pyiceberg.catalog.glue import GlueCatalog

def lambda_handler(event, context):
    try:
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'")
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']
        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        catalog = GlueCatalog(region_name="ap-northeast-1", database="icebergdb", name="my_catalog")
        namespace = "icebergdb"
        table_name = "yellow_tripdata"
        iceberg_table = catalog.load_table(f"{namespace}.{table_name}")
        iceberg_table.append(result_arrow_table)

        print("Successfully appended data to Iceberg table in S3.")
    except Exception as e:
        print(f"An error occurred: {e}")

Enter fullscreen mode Exit fullscreen mode

Pattern 2: REST Catalog (via AWS Glue Iceberg REST endpoint)

This uses the Iceberg-standard REST interface exposed by AWS Glue — the Iceberg REST Catalog — enabling table operations via PyIceberg through REST.

Key characteristics:

  • Uses REST Catalog in PyIceberg
  • Communicates via Iceberg-standard REST API
  • Aligns with open Iceberg specifications and greater cross-platform portability

Sample Code (Pattern 1: GlueCatalog)

import boto3
import duckdb
import pyarrow as pa
from pyiceberg.catalog import load_catalog

def lambda_handler(event, context):
    try:
        duckdb_connection = duckdb.connect(database=':memory:')
        duckdb_connection.execute("SET home_directory='/tmp'")
        duckdb_connection.execute("INSTALL httpfs;")
        duckdb_connection.execute("LOAD httpfs;")

        s3_bucket = event['Records'][0]['s3']['bucket']['name']
        s3_object_key = event['Records'][0]['s3']['object']['key']
        s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"

        query = f"SELECT * FROM read_parquet('{s3_input_path}') WHERE VendorID = 1"
        result_arrow_table = duckdb_connection.execute(query).fetch_arrow_table()

        sts = boto3.client('sts')
        region = sts._client_config.region_name
        catalog_properties = {
            "type": "rest",
            "uri": f"https://glue.{region}.amazonaws.com/iceberg",
            "s3.region": region,
            "rest.sigv4-enabled": "true",
            "rest.signing-name": "glue",
            "rest.signing-region": region
        }

        catalog = load_catalog(**catalog_properties)
        database_name = "icebergdb"
        table_name = "yellow_tripdata"
        table = catalog.load_table(f"{database_name}.{table_name}")
        table.append(result_arrow_table)

        print("Successfully appended data to Iceberg table in S3 via REST Catalog.")
    except Exception as e:
        print(f"An error occurred: {e}")

Enter fullscreen mode Exit fullscreen mode

Commonalities

Both approaches share the following traits:

  • Use PyIceberg for catalog operations
  • Retrieve Iceberg tables via load_table() and append data with append()
  • Output Iceberg-conformant artifacts—manifests, data files, snapshots—into S3 (OTF)

Key Differences

Feature GlueCatalog ( Pattern 1 ) REST Catalog ( Pattern 2 )
Catalog Type GlueCatalog REST Catalog
Access Method AWS SDK → Glue API Iceberg-standard REST endpoint via AWS Glue
API Alignment AWS-specific integration Open standard conforming to Apache Iceberg REST spec

Insight: Pattern 2 adheres to the standardized Iceberg REST API, while Pattern 1 sticks with AWS-specific Glue integration. AWS documentation often favors the REST endpoint pattern. In many Lambda workloads, the practical difference may be negligible.


When to Choose Which?

  • Choose GlueCatalog if:
    • You prefer tight, native AWS integration and simplicity
  • Choose REST Catalog if:
    • You value adherence to open standards and cross-platform compatibility
    • You want better alignment with the general Iceberg ecosystem and AWS documentation trends

Summary

To recap, two main patterns exist for accessing Apache Iceberg tables from AWS Lambda using PyIceberg:

  1. GlueCatalog: AWS-native access via Glue SDK
  2. REST Catalog: Iceberg-standard REST API via AWS Glue Iceberg REST endpoint

Both patterns enable writing to Iceberg tables effectively. The right choice depends on whether you prioritize AWS integration or open-standard interoperability.

I hope this comparison helps you plan effective, lightweight ETL workflows with PyIceberg and Lambda!


Top comments (0)