Original Japanese article: S3トリガーで起動するAWS Glue Python Shellの構成パターン整理
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
In my previous article, I introduced a lightweight ETL approach using AWS Glue Python Shell.
In that article, I also briefly touched on how to trigger Glue Python Shell jobs.
Previous article:
Lightweight ETL with AWS Glue Python Shell, chDB, and PyIceberg (Compared with DuckDB)
In this article, I organizeand explain three representative patterns for triggering Glue Python Shell jobs using S3 object creation events.
At first glance, these architectures may look similar. However, they differ significantly in terms of:
- where control logic is placed,
- how retries are handled, and
- where workflow and orchestration are managed.
I’ll walk through these differences and typical use cases for each pattern.
Trigger Patterns
The three trigger patterns covered in this article are:
- EventBridge + Glue Workflow
- Lambda + start_job_run
- Lambda + start_workflow_run
All three patterns use S3 object creation as the trigger and eventually start a Glue Python Shell job.
There are, of course, many other possible architectures—for example:
S3 → EventBridge → Step Functions → Glue Python Shell
However, in this article, I’ll focus on these three relatively simple and commonly used patterns.
Pattern 1: EventBridge + Glue Workflow
In this pattern, S3 object creation events are captured by Amazon EventBridge, which then triggers a Glue Workflow.
Inside the workflow, a Glue Python Shell job is executed.
Key Characteristics
-
No programming required to filter events or map input parameters.
By using EventBridge input paths and templates, you can pass the S3
bucketandkeydirectly to Glue Workflow Run Properties. - Retry policies and dead-letter queues can be configured at the Glue Workflow level.
- Workflow-based orchestration makes it easy to add pre-processing or post-processing steps later.
Glue Python Shell Sample Code
This sample is the same as in the previous article.
The key point is using getResolvedOptions to retrieve the input parameter s3_input (the S3 file path).
By passing parameters in this way, we can implement event-driven processing triggered by file uploads.
import boto3
import sys
import os
from awsglue.utils import getResolvedOptions
def get_job_parameters():
try:
required_args = ['s3_input']
args = getResolvedOptions(sys.argv, required_args)
s3_file_path = args['s3_input']
print(f"s3_input: {s3_file_path}")
return s3_file_path
except Exception as e:
print(f"parameters error: {e}")
raise
def _to_pyarrow_table(result):
"""
Compatibility helper to extract a pyarrow.Table from a chDB query_result.
"""
import chdb
if hasattr(chdb, "to_arrowTable"):
return chdb.to_arrowTable(result)
if hasattr(result, "to_pyarrow"):
return result.to_pyarrow()
if hasattr(result, "to_arrow"):
return result.to_arrow()
raise RuntimeError(
"Cannot convert chdb query_result to pyarrow.Table. "
f"Available attributes: {sorted(dir(result))[:200]}"
)
def normalize_arrow_for_iceberg(table):
"""
Normalize Arrow schema for Iceberg compatibility.
- timestamptz -> timestamp
- binary -> string
"""
import pyarrow as pa
new_fields = []
new_columns = []
for field, column in zip(table.schema, table.columns):
# timestamp with timezone -> timestamp
if pa.types.is_timestamp(field.type) and field.type.tz is not None:
new_type = pa.timestamp(field.type.unit)
new_fields.append(pa.field(field.name, new_type, field.nullable))
new_columns.append(column.cast(new_type))
else:
new_fields.append(field)
new_columns.append(column)
new_schema = pa.schema(new_fields)
return pa.Table.from_arrays(new_columns, schema=new_schema)
def read_parquet_with_chdb(s3_input):
"""
Read Parquet file from S3 using chDB.
"""
import chdb
if s3_input.startswith("s3://"):
bucket, key = s3_input.replace("s3://", "").split("/", 1)
s3_url = f"https://{bucket}.s3.ap-northeast-1.amazonaws.com/{key}"
else:
s3_url = s3_input
print(f"Reading data from S3: {s3_url}")
query = f"""
SELECT *
FROM s3('{s3_url}', 'Parquet')
WHERE VendorID = 1
"""
result = chdb.query(query, "Arrow")
arrow_table = _to_pyarrow_table(result)
print("Original schema:")
print(arrow_table.schema)
# Normalize schema for Iceberg compatibility
arrow_table = normalize_arrow_for_iceberg(arrow_table)
print("Normalized schema:")
print(arrow_table.schema)
print(f"Rows: {arrow_table.num_rows:,}")
return arrow_table
def write_iceberg_table(arrow_table):
"""
Write Arrow table to Iceberg table using PyIceberg.
"""
try:
print("Writing started...")
from pyiceberg.catalog import load_catalog
catalog_config = {
"type": "glue",
"warehouse": "s3://your-bucket/your-warehouse/", # Adjust to your environment.
"region": "ap-northeast-1",
}
catalog = load_catalog("glue_catalog", **catalog_config)
table_identifier = "icebergdb.yellow_tripdata"
table = catalog.load_table(table_identifier)
print(f"Target data to write: {arrow_table.num_rows:,} rows")
table.append(arrow_table)
return True
except Exception as e:
print(f"Writing error: {e}")
import traceback
traceback.print_exc()
return False
def main():
try:
import chdb
import pyiceberg
# Read input parameter
s3_input = get_job_parameters()
# Read data with chDB
arrow_tbl = read_parquet_with_chdb(s3_input)
print(f"Data read success: {arrow_tbl.num_rows:,} rows")
# Write to Iceberg table
if write_iceberg_table(arrow_tbl):
print("\nWriting fully successful!")
else:
print("Writing failed")
except Exception as e:
print(f"Main error: {e}")
import traceback
traceback.print_exc()
if __name__ == "__main__":
main()
Configuration Steps
1. Create a Glue Python Shell job
Add a job parameter to receive the S3 file path.
The job uses the sample code shown above.
2. Create a Glue Workflow

Assign the Glue Python Shell job.

Add Run Properties to receive the S3 file path.
3. Create an EventBridge rule
Set the event source to S3 Object Created, and use the following event pattern:
{
"source": ["aws.s3"],
"detail-type": ["Object Created"],
"detail": {
"bucket": {
"name": ["YOUR_BUCKET_NAME"]
}
}
}

Set the target to the Glue Workflow.

Select Input transformer as the transformation method.
Input paths:
{
"s3_bucket": "$.detail.bucket.name",
"s3_key": "$.detail.object.key"
}
Template (pass parameters as s3_input):
{
"s3_input": "s3://<s3_bucket>/<s3_key>",
"s3_key": "<s3_key>"
}
4. Enable EventBridge notifications on the S3 bucket
Pattern 2: Lambda + start_job_run
In this pattern, S3 events trigger a Lambda function, and the Lambda function directly starts a Glue Python Shell job using start_job_run.
This was the architecture used in my previous article.
Key Characteristics
- High flexibility, thanks to Lambda’s ability to integrate with many AWS services.
- Retry logic must be carefully designed. While Glue Python Shell supports job retries, trigger-level retries depend on Lambda configuration.
- Workflow control can become complex. You may need additional services like Step Functions.
- Raises the question: “Should this processing be done entirely in Lambda?” Choosing between Lambda and Glue Python Shell depends on execution time and workload characteristics.
Lambda Sample Code
import boto3
def lambda_handler(event, context):
glue = boto3.client("glue")
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
response = glue.start_job_run(
JobName="YOUR_JOB_NAME",
Arguments={
"--s3_input": s3_input_path
}
)
print(f"Glue Job started: {response['JobRunId']}")
return response
Configuration Steps
- Create the Glue Python Shell job (same as Pattern 1).
- Create a Lambda function and deploy the above code.
- Configure an S3 PUT event trigger.
Pattern 3: Lambda + start_workflow_run
In this pattern, S3 events trigger a Lambda function, which then starts a Glue Workflow using start_workflow_run.
The workflow then executes the Glue Python Shell job.
Key Characteristics
- Combines Lambda’s flexibility with Glue Workflow’s orchestration capabilities.
- Retry and error-handling responsibilities must be clearly divided between Lambda and the Workflow.
- Workflow-based design makes it easy to extend processing steps later.
Lambda Sample Code
import boto3
def lambda_handler(event, context):
glue = boto3.client("glue")
s3_bucket = event['Records'][0]['s3']['bucket']['name']
s3_object_key = event['Records'][0]['s3']['object']['key']
s3_input_path = f"s3://{s3_bucket}/{s3_object_key}"
response = glue.start_workflow_run(
Name = 'YOUR_WORKFLOW_NAME',
RunProperties = { '--s3_input': s3_input_path })
print(f"Glue Job started: {response['RunId']}")
return response
Configuration Steps
- Create the Glue Python Shell job (same as Pattern 1).
- Create the Glue Workflow(same as Pattern 1).
- Create a Lambda function and deploy the above code.
- Configure the S3 trigger.
When to Use Each Pattern
| Criterion | EventBridge + Workflow | Lambda + Job | Lambda + Workflow |
|---|---|---|---|
| Simplicity | △ | ◎ | ○ |
| Flexibility | △ | ◎ | ◎ |
| Retry control | ◎ | △ | ○ |
| Workflow management | ◎ | △ | ◎ |
Choosing the right pattern largely depends on where you want to centralize control logic.
-
Start small → Lambda +
start_job_run - Glue-centric architecture → EventBridge + Workflow
-
Flexibility + orchestration → Lambda +
start_workflow_run
Conclusion
In this article, I introduced three patterns for triggering Glue Python Shell jobs using S3 events.
Glue Python Shell often feels like a middle-ground service compared to Glue Spark jobs, EMR, or Lambda, which can make its use cases unclear.
However, when requirements align—low cost, long execution time, and no Spark dependency—it can be a very practical choice.
As shown in my previous article, combining Glue Python Shell with tools like chDB or DuckDB can significantly improve processing efficiency.
It also integrates well with Apache Iceberg via PyIceberg.
Although this article focused on S3-based triggers, Glue Python Shell is also useful as a general-purpose, long-running job execution environment, which is actually a more common use case.
I hope this article helps you when designing data platforms and ETL architectures.








Top comments (0)