DEV Community

Aki for AWS Community Builders

Posted on

Organizing Architecture Patterns for Triggering AWS Glue Python Shell with S3 Events

Original Japanese article: S3トリガーで起動するAWS Glue Python Shellの構成パターン整理

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my previous article, I introduced a lightweight ETL approach using AWS Glue Python Shell.
In that article, I also briefly touched on how to trigger Glue Python Shell jobs.

Previous article:
Lightweight ETL with AWS Glue Python Shell, chDB, and PyIceberg (Compared with DuckDB)

In this article, I organizeand explain three representative patterns for triggering Glue Python Shell jobs using S3 object creation events.

At first glance, these architectures may look similar. However, they differ significantly in terms of:

  • where control logic is placed,
  • how retries are handled, and
  • where workflow and orchestration are managed.

I’ll walk through these differences and typical use cases for each pattern.


Trigger Patterns

The three trigger patterns covered in this article are:

  1. EventBridge + Glue Workflow
  2. Lambda + start_job_run
  3. Lambda + start_workflow_run

All three patterns use S3 object creation as the trigger and eventually start a Glue Python Shell job.

There are, of course, many other possible architectures—for example:

S3 → EventBridge → Step Functions → Glue Python Shell

However, in this article, I’ll focus on these three relatively simple and commonly used patterns.


Pattern 1: EventBridge + Glue Workflow

In this pattern, S3 object creation events are captured by Amazon EventBridge, which then triggers a Glue Workflow.
Inside the workflow, a Glue Python Shell job is executed.

Key Characteristics

  • No programming required to filter events or map input parameters. By using EventBridge input paths and templates, you can pass the S3 bucket and key directly to Glue Workflow Run Properties.
  • Retry policies and dead-letter queues can be configured at the Glue Workflow level.
  • Workflow-based orchestration makes it easy to add pre-processing or post-processing steps later.

Glue Python Shell Sample Code

This sample is the same as in the previous article.
The key point is using getResolvedOptions to retrieve the input parameter s3_input (the S3 file path).

By passing parameters in this way, we can implement event-driven processing triggered by file uploads.

import boto3
import sys
import os
from awsglue.utils import getResolvedOptions


def get_job_parameters():
    try:
        required_args = ['s3_input']
        args = getResolvedOptions(sys.argv, required_args)

        s3_file_path = args['s3_input']
        print(f"s3_input: {s3_file_path}")

        return s3_file_path

    except Exception as e:
        print(f"parameters error: {e}")
        raise


def _to_pyarrow_table(result):
    """
    Compatibility helper to extract a pyarrow.Table from a chDB query_result.
    """
    import chdb

    if hasattr(chdb, "to_arrowTable"):
        return chdb.to_arrowTable(result)

    if hasattr(result, "to_pyarrow"):
        return result.to_pyarrow()
    if hasattr(result, "to_arrow"):
        return result.to_arrow()

    raise RuntimeError(
        "Cannot convert chdb query_result to pyarrow.Table. "
        f"Available attributes: {sorted(dir(result))[:200]}"
    )


def normalize_arrow_for_iceberg(table):
    """
    Normalize Arrow schema for Iceberg compatibility.
    - timestamptz -> timestamp
    - binary -> string
    """
    import pyarrow as pa

    new_fields = []
    new_columns = []

    for field, column in zip(table.schema, table.columns):
        # timestamp with timezone -> timestamp
        if pa.types.is_timestamp(field.type) and field.type.tz is not None:
            new_type = pa.timestamp(field.type.unit)
            new_fields.append(pa.field(field.name, new_type, field.nullable))
            new_columns.append(column.cast(new_type))
        else:
            new_fields.append(field)
            new_columns.append(column)

    new_schema = pa.schema(new_fields)
    return pa.Table.from_arrays(new_columns, schema=new_schema)


def read_parquet_with_chdb(s3_input):
    """
    Read Parquet file from S3 using chDB.
    """
    import chdb

    if s3_input.startswith("s3://"):
        bucket, key = s3_input.replace("s3://", "").split("/", 1)
        s3_url = f"https://{bucket}.s3.ap-northeast-1.amazonaws.com/{key}"
    else:
        s3_url = s3_input

    print(f"Reading data from S3: {s3_url}")

    query = f"""
        SELECT *
        FROM s3('{s3_url}', 'Parquet')
        WHERE VendorID = 1
    """

    result = chdb.query(query, "Arrow")
    arrow_table = _to_pyarrow_table(result)

    print("Original schema:")
    print(arrow_table.schema)

    # Normalize schema for Iceberg compatibility
    arrow_table = normalize_arrow_for_iceberg(arrow_table)

    print("Normalized schema:")
    print(arrow_table.schema)
    print(f"Rows: {arrow_table.num_rows:,}")

    return arrow_table


def write_iceberg_table(arrow_table):
    """
    Write Arrow table to Iceberg table using PyIceberg.
    """
    try:
        print("Writing started...")

        from pyiceberg.catalog import load_catalog

        catalog_config = {
            "type": "glue",
            "warehouse": "s3://your-bucket/your-warehouse/",  # Adjust to your environment.
            "region": "ap-northeast-1",
        }

        catalog = load_catalog("glue_catalog", **catalog_config)
        table_identifier = "icebergdb.yellow_tripdata"

        table = catalog.load_table(table_identifier)

        print(f"Target data to write: {arrow_table.num_rows:,} rows")
        table.append(arrow_table)

        return True

    except Exception as e:
        print(f"Writing error: {e}")
        import traceback
        traceback.print_exc()
        return False


def main():
    try:
        import chdb
        import pyiceberg

        # Read input parameter
        s3_input = get_job_parameters()

        # Read data with chDB
        arrow_tbl = read_parquet_with_chdb(s3_input)
        print(f"Data read success: {arrow_tbl.num_rows:,} rows")

        # Write to Iceberg table
        if write_iceberg_table(arrow_tbl):
            print("\nWriting fully successful!")
        else:
            print("Writing failed")

    except Exception as e:
        print(f"Main error: {e}")
        import traceback
        traceback.print_exc()


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Configuration Steps

1. Create a Glue Python Shell job

Add a job parameter to receive the S3 file path.
The job uses the sample code shown above.


2. Create a Glue Workflow


Configure an event trigger.


Assign the Glue Python Shell job.


Add Run Properties to receive the S3 file path.


3. Create an EventBridge rule

Set the event source to S3 Object Created, and use the following event pattern:

{
  "source": ["aws.s3"],
  "detail-type": ["Object Created"],
  "detail": {
    "bucket": {
      "name": ["YOUR_BUCKET_NAME"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode


Set the target to the Glue Workflow.


Select Input transformer as the transformation method.

Input paths:

{
  "s3_bucket": "$.detail.bucket.name",
  "s3_key": "$.detail.object.key"
}
Enter fullscreen mode Exit fullscreen mode

Template (pass parameters as s3_input):

{
  "s3_input": "s3://<s3_bucket>/<s3_key>",
  "s3_key": "<s3_key>"
}
Enter fullscreen mode Exit fullscreen mode

4. Enable EventBridge notifications on the S3 bucket


Pattern 2: Lambda + start_job_run

In this pattern, S3 events trigger a Lambda function, and the Lambda function directly starts a Glue Python Shell job using start_job_run.

This was the architecture used in my previous article.

Key Characteristics

  • High flexibility, thanks to Lambda’s ability to integrate with many AWS services.
  • Retry logic must be carefully designed. While Glue Python Shell supports job retries, trigger-level retries depend on Lambda configuration.
  • Workflow control can become complex. You may need additional services like Step Functions.
  • Raises the question: “Should this processing be done entirely in Lambda?” Choosing between Lambda and Glue Python Shell depends on execution time and workload characteristics.

Lambda Sample Code

import boto3

def lambda_handler(event, context):
    glue = boto3.client("glue")

    s3_bucket = event['Records'][0]['s3']['bucket']['name']
    s3_object_key = event['Records'][0]['s3']['object']['key']

    s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"


    response = glue.start_job_run(
        JobName="YOUR_JOB_NAME",
        Arguments={
            "--s3_input": s3_input_path
        }
    )

    print(f"Glue Job started: {response['JobRunId']}")
    return response
Enter fullscreen mode Exit fullscreen mode

Configuration Steps

  1. Create the Glue Python Shell job (same as Pattern 1).
  2. Create a Lambda function and deploy the above code.
  3. Configure an S3 PUT event trigger.

Pattern 3: Lambda + start_workflow_run

In this pattern, S3 events trigger a Lambda function, which then starts a Glue Workflow using start_workflow_run.
The workflow then executes the Glue Python Shell job.

Key Characteristics

  • Combines Lambda’s flexibility with Glue Workflow’s orchestration capabilities.
  • Retry and error-handling responsibilities must be clearly divided between Lambda and the Workflow.
  • Workflow-based design makes it easy to extend processing steps later.

Lambda Sample Code

import boto3

def lambda_handler(event, context):
    glue = boto3.client("glue")

    s3_bucket = event['Records'][0]['s3']['bucket']['name']
    s3_object_key = event['Records'][0]['s3']['object']['key']

    s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"


    response = glue.start_workflow_run(
        Name = 'YOUR_WORKFLOW_NAME',
        RunProperties = { '--s3_input': s3_input_path })

    print(f"Glue Job started: {response['RunId']}")
    return response
Enter fullscreen mode Exit fullscreen mode

Configuration Steps

  1. Create the Glue Python Shell job (same as Pattern 1).
  2. Create the Glue Workflow(same as Pattern 1).
  3. Create a Lambda function and deploy the above code.
  4. Configure the S3 trigger.

When to Use Each Pattern

Criterion EventBridge + Workflow Lambda + Job Lambda + Workflow
Simplicity
Flexibility
Retry control
Workflow management

Choosing the right pattern largely depends on where you want to centralize control logic.

  • Start small → Lambda + start_job_run
  • Glue-centric architecture → EventBridge + Workflow
  • Flexibility + orchestration → Lambda + start_workflow_run

Conclusion

In this article, I introduced three patterns for triggering Glue Python Shell jobs using S3 events.

Glue Python Shell often feels like a middle-ground service compared to Glue Spark jobs, EMR, or Lambda, which can make its use cases unclear.
However, when requirements align—low cost, long execution time, and no Spark dependency—it can be a very practical choice.

As shown in my previous article, combining Glue Python Shell with tools like chDB or DuckDB can significantly improve processing efficiency.
It also integrates well with Apache Iceberg via PyIceberg.

Although this article focused on S3-based triggers, Glue Python Shell is also useful as a general-purpose, long-running job execution environment, which is actually a more common use case.

I hope this article helps you when designing data platforms and ETL architectures.


Top comments (0)