Aki for AWS Community Builders

Posted on Jan 8 • Edited on Jan 16

AWS Lambda and AWS Glue Python Shell in the Context of Lightweight ETL

#aws #dataengineering #etl

Original Japanese article: 軽量ETLの文脈で考えるAWS LambdaとAWS Glue Python Shell

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my previous articles, I introduced lightweight ETL patterns using AWS Lambda and AWS Glue Python Shell.

In this article, I would like to organize when to use AWS Lambda and when to use AWS Glue Python Shell for lightweight ETL workloads, and clarify where each service fits best.

When building with native AWS services, it’s common to think “use Lambda first, and move to Glue when Lambda hits its limits.”
This approach is not wrong, but I believe adding AWS Glue Python Shell as another option makes the design space more flexible.

What is Lightweight ETL?

After publishing several articles on the topic, I should mention this upfront: there is no strict definition of “lightweight ETL.”

For the sake of discussion, I’ll use the following working definition:

Files or data up to a few hundred MB
One or a small number of files
No need for advanced processing such as distributed computation
Scheduled or event-driven execution
Mainly used for preprocessing or initial transformations

In other words, these workloads are not heavy enough to require Glue (Spark) or EMR, but are still too expensive or complex to ignore.
Designing this “middle zone” effectively is the key point.

Position of Lightweight ETL (Big Picture)

Using the Medallion Architecture as a reference, let’s consider where lightweight ETL fits.
Lightweight ETL is most effective when transforming data from the source or Landing layer into the Bronze layer.

There are many interpretations of Medallion Architecture layer definitions, but in this article, you can think of it mainly in the context of preprocessing or initial transformations.

Depending on the size and nature of the dataset, there are also cases where lightweight ETL can be applied to Bronze → Silver transformations.

How Data Changes from Bronze to Silver to Gold

This is a high-level explanation, but each layer generally looks like the following.

Bronze

This layer stores raw data (or lightly transformed data if a Landing layer exists).

In some cases, data is converted from JSON or CSV to Parquet, or stored in Iceberg format.
One advantage of using Parquet or Iceberg at this stage is improved efficiency for Bronze → Silver transformations.
It’s also useful if you want to run analytical queries directly on Bronze data.

Sometimes, Bronze and Silver tables are joined to produce Gold tables.
Using Parquet or Iceberg formats also helps significantly in these scenarios.

Deciding what should be considered “raw” data is an important design decision.
There are debates about whether formatted data can still be called raw, but thinking in terms of a separate Landing layer can help clarify this.

Typical responsibilities include:

Storing raw data
(Optionally) format conversion
(Optionally) type normalization
(Optionally) dropping unnecessary columns
(Optionally) filtering

If you plan to apply Data Contracts, this is typically the layer where they are enforced.

Silver

This layer stores data transformed into 3NF (Third Normal Form).
Normalization reduces data redundancy and prepares the data for reporting.

Data modeling choices often become a topic of discussion here.
Depending on the dataset, this layer may only perform cleansing, so once again, it depends on the use case.

Typical responsibilities include:

Removing duplicate data
Aggregation
Data cleansing

Gold

This layer stores data optimized using dimensional modeling.
Whether to use a One Big Table or a more traditional dimensional model is another common discussion point.
The choice of data model depends on the use case.

Typical responsibilities include:

Optimizing business logic
Creating business metrics
Building data marts

An important point is that data volume and query complexity increase significantly from the Silver layer onward.
At that stage, workloads often exceed the scope of lightweight ETL and are better handled by distributed ETL solutions.

AWS Lambda for Lightweight ETL

As mentioned in previous articles, AWS Lambda works very well for lightweight ETL.

Fast startup
Easy integration with many AWS services
Simple architecture
Serverless, with low operational overhead

For preprocessing and small-scale transformations, Lambda is usually the first option to consider.

That said, Lambda has clear limitations:

Memory limit (3008 MB by default, up to 10240 MB with a quota increase)
Execution time limit (15 minutes)
vCPU allocation depends on memory size

Even with optimizations using DuckDB or chDB, these constraints eventually become bottlenecks.

AWS Glue Python Shell for Lightweight ETL

AWS Glue Python Shell is not very visible, but compared to Glue (Spark) or EMR, it is easier to control costs and supports long-running jobs, up to 168 hours.

It can be seen as the next option after Lambda.

Processing units selectable from 1/16 DPU to 1 DPU (1 DPU = 4 vCPU / 16 GB memory)
Maximum execution time of up to 168 hours (default: 48 hours) This extended execution time is one of the biggest advantages of Glue Python Shell.
Serverless, with low operational overhead
Integration with Glue Workflows

Compared to Lambda, Glue Python Shell offers more memory and much longer execution time, making it suitable for heavier preprocessing workloads.

However, it also has drawbacks:

Slower startup time than Lambda
Limited event-driven integration

Compared to Lambda’s simplicity and versatility, these can be disadvantages.

Glue Python Shell is useful when you need to process heavy workloads that require time rather than immediate responsiveness.
It also works well for nightly batch processing of multiple files.

Choosing Between Lambda and Glue Python Shell

As discussed, Lambda and Glue Python Shell have different characteristics.
A simple way to think about it is:

Lambda is for responsive, lightweight preprocessing
Glue Python Shell is for heavier preprocessing that requires more time and memory

Rough guidelines:

Small workloads that finish within 15 minutes (up to a few hundred MB): Lambda
Medium workloads that exceed 15 minutes or data that Lambda cannot handle (up to a few GB): Glue Python Shell
Large-scale or complex workloads: Glue (Spark) or EMR

Cost is also a critical factor.
Lambda includes a free tier of 1 million requests and 400,000 GB-seconds per month, while Glue Python Shell does not.
For the same execution duration, Glue Python Shell is generally more expensive.

These factors should all be considered when making a decision.

Configuration Comparison

Aspect	Lambda	Glue Python Shell
Memory	Up to 10 GB (quota required)	16 GB per DPU
Execution time	Up to 15 minutes	Up to 168 hours
Startup speed	Very fast	Slower
Event integration	Very good	Good
Operational effort	Very low	Low

Insights from Benchmarks

Let’s look at benchmarks using the same dataset and processing logic described in the following articles:

Lightweight ETL with AWS Lambda, chDB, and PyIceberg (Compared with DuckDB)
Lightweight ETL with AWS Glue Python Shell, chDB, and PyIceberg (Compared with DuckDB)

The dataset used is NYC Taxi data:

https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

The test data includes:

January 2024 Yellow Taxi Trip Records (2,964,624 records, 48 MB)
A full year combined file (41,169,720 records, 807 MB)

Lambda was tested with memory sizes of 1024 MB, 2048 MB, and the default maximum of 3008 MB.
Glue Python Shell was tested with 1/16 DPU and 1 DPU.

Memory usage cannot be directly compared, so the focus is on processing time.

January file (48 MB)

Platform	Resource setting	chDB time (s)	DuckDB time (s)
Glue Python Shell	1/16 DPU	46.000	40.000
Glue Python Shell	1 DPU	39.000	34.000
AWS Lambda	1024 MB	5.092	5.163
AWS Lambda	2048 MB	3.873	4.265
AWS Lambda	3008 MB	3.370	4.061

Full year file (807 MB)

Platform	Resource setting	chDB time (s)	DuckDB time (s)
Glue Python Shell	1/16 DPU	OutOfMemory	OutOfMemory
Glue Python Shell	1 DPU	51.0	212.0
AWS Lambda	1024 MB	OutOfMemory	OutOfMemory
AWS Lambda	2048 MB	OutOfMemory	OutOfMemory
AWS Lambda	3008 MB	27.171	187.332

One important observation is that Glue Python Shell shows longer execution times.
With 1 DPU, it should theoretically outperform Lambda.
This difference is not due to raw processing power, but to fixed startup overhead in Glue Python Shell, regardless of workload size.

This includes:

Job startup time
Container preparation
Dependency initialization

These factors add to the overall lead time and must be considered.
Evaluating this startup overhead is key when deciding whether to use Glue Python Shell.

Another point is that Lambda increases vCPU allocation as memory increases.
With 3008 MB, Lambda may achieve higher single-process performance than Glue Python Shell.

That said, Glue Python Shell still offers two major strengths: long execution time (up to 168 hours) and 16 GB of memory.

While there are many other decision factors, a simple rule of thumb could look like this:

File size
 ├─ Up to a few hundred MB → Lambda
 │    └─ If processing exceeds 15 minutes → Glue Python Shell
 └─ Multiple GB or complex workloads → Glue (Spark) / EMR

The most important thing is being able to clearly explain why you chose a particular approach.

Conclusion

In this article, I discussed how to choose between AWS Lambda and AWS Glue Python Shell in the context of lightweight ETL.

For preprocessing workloads up to a few hundred MB, AWS Lambda is often the best choice in terms of execution time, cost, and operational simplicity.
Glue Python Shell is a useful option for filling the gap between Lambda and distributed processing, handling workloads that are too heavy for Lambda but do not require Spark or EMR.

Rather than forcing all workloads into Lambda or Glue Python Shell, choosing the right tool for each situation is essential.

I hope this article helps anyone considering lightweight data processing on AWS.

DEV Community

AWS Lambda and AWS Glue Python Shell in the Context of Lightweight ETL

Introduction

What is Lightweight ETL?

Position of Lightweight ETL (Big Picture)

How Data Changes from Bronze to Silver to Gold

Bronze

Silver

Gold

AWS Lambda for Lightweight ETL

AWS Glue Python Shell for Lightweight ETL

Choosing Between Lambda and Glue Python Shell

Configuration Comparison

Insights from Benchmarks

January file (48 MB)

Full year file (807 MB)

Conclusion

Top comments (0)