Original Japanese article: 軽量ETLの文脈で考えるAWS LambdaとAWS Glue Python Shell
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
In my previous articles, I introduced lightweight ETL patterns using AWS Lambda and AWS Glue Python Shell.
In this article, I would like to organize when to use AWS Lambda and when to use AWS Glue Python Shell for lightweight ETL workloads, and clarify where each service fits best.
When building with native AWS services, it’s common to think “use Lambda first, and move to Glue when Lambda hits its limits.”
This approach is not wrong, but I believe adding AWS Glue Python Shell as another option makes the design space more flexible.
What is Lightweight ETL?
After publishing several articles on the topic, I should mention this upfront: there is no strict definition of “lightweight ETL.”
For the sake of discussion, I’ll use the following working definition:
- Files or data up to a few hundred MB
- One or a small number of files
- No need for advanced processing such as distributed computation
- Scheduled or event-driven execution
- Mainly used for preprocessing or initial transformations
In other words, these workloads are not heavy enough to require Glue (Spark) or EMR, but are still too expensive or complex to ignore.
Designing this “middle zone” effectively is the key point.
Position of Lightweight ETL (Big Picture)
Using the Medallion Architecture as a reference, let’s consider where lightweight ETL fits.
Lightweight ETL is most effective when transforming data from the source or Landing layer into the Bronze layer.
There are many interpretations of Medallion Architecture layer definitions, but in this article, you can think of it mainly in the context of preprocessing or initial transformations.
Depending on the size and nature of the dataset, there are also cases where lightweight ETL can be applied to Bronze → Silver transformations.
How Data Changes from Bronze to Silver to Gold
This is a high-level explanation, but each layer generally looks like the following.
Bronze
This layer stores raw data (or lightly transformed data if a Landing layer exists).
In some cases, data is converted from JSON or CSV to Parquet, or stored in Iceberg format.
One advantage of using Parquet or Iceberg at this stage is improved efficiency for Bronze → Silver transformations.
It’s also useful if you want to run analytical queries directly on Bronze data.
Sometimes, Bronze and Silver tables are joined to produce Gold tables.
Using Parquet or Iceberg formats also helps significantly in these scenarios.
Deciding what should be considered “raw” data is an important design decision.
There are debates about whether formatted data can still be called raw, but thinking in terms of a separate Landing layer can help clarify this.
Typical responsibilities include:
- Storing raw data
- (Optionally) format conversion
- (Optionally) type normalization
- (Optionally) dropping unnecessary columns
- (Optionally) filtering
If you plan to apply Data Contracts, this is typically the layer where they are enforced.
Silver
This layer stores data transformed into 3NF (Third Normal Form).
Normalization reduces data redundancy and prepares the data for reporting.
Data modeling choices often become a topic of discussion here.
Depending on the dataset, this layer may only perform cleansing, so once again, it depends on the use case.
Typical responsibilities include:
- Removing duplicate data
- Aggregation
- Data cleansing
Gold
This layer stores data optimized using dimensional modeling.
Whether to use a One Big Table or a more traditional dimensional model is another common discussion point.
The choice of data model depends on the use case.
Typical responsibilities include:
- Optimizing business logic
- Creating business metrics
- Building data marts
An important point is that data volume and query complexity increase significantly from the Silver layer onward.
At that stage, workloads often exceed the scope of lightweight ETL and are better handled by distributed ETL solutions.
AWS Lambda for Lightweight ETL
As mentioned in previous articles, AWS Lambda works very well for lightweight ETL.
- Fast startup
- Easy integration with many AWS services
- Simple architecture
- Serverless, with low operational overhead
For preprocessing and small-scale transformations, Lambda is usually the first option to consider.
That said, Lambda has clear limitations:
- Memory limit (3008 MB by default, up to 10240 MB with a quota increase)
- Execution time limit (15 minutes)
- vCPU allocation depends on memory size
Even with optimizations using DuckDB or chDB, these constraints eventually become bottlenecks.
AWS Glue Python Shell for Lightweight ETL
AWS Glue Python Shell is not very visible, but compared to Glue (Spark) or EMR, it is easier to control costs and supports long-running jobs, up to 48 hours.
It can be seen as the next option after Lambda.
- Processing units selectable from 1/16 DPU to 1 DPU (1 DPU = 4 vCPU / 16 GB memory)
- Maximum execution time of 48 hours (this is the biggest advantage)
- Serverless, with low operational overhead
- Integration with Glue Workflows
Compared to Lambda, Glue Python Shell offers more memory and much longer execution time, making it suitable for heavier preprocessing workloads.
However, it also has drawbacks:
- Slower startup time than Lambda
- Limited event-driven integration
Compared to Lambda’s simplicity and versatility, these can be disadvantages.
Glue Python Shell is useful when you need to process heavy workloads that require time rather than immediate responsiveness.
It also works well for nightly batch processing of multiple files.
Choosing Between Lambda and Glue Python Shell
As discussed, Lambda and Glue Python Shell have different characteristics.
A simple way to think about it is:
- Lambda is for responsive, lightweight preprocessing
- Glue Python Shell is for heavier preprocessing that requires more time and memory
Rough guidelines:
- Small workloads that finish within 15 minutes (up to a few hundred MB): Lambda
- Medium workloads that exceed 15 minutes or data that Lambda cannot handle (up to a few GB): Glue Python Shell
- Large-scale or complex workloads: Glue (Spark) or EMR
Cost is also a critical factor.
Lambda includes a free tier of 1 million requests and 400,000 GB-seconds per month, while Glue Python Shell does not.
For the same execution duration, Glue Python Shell is generally more expensive.
These factors should all be considered when making a decision.
Configuration Comparison
| Aspect | Lambda | Glue Python Shell |
|---|---|---|
| Memory | Up to 10 GB (quota required) | 16 GB per DPU |
| Execution time | Up to 15 minutes | Up to 48 hours |
| Startup speed | Very fast | Slower |
| Event integration | Very good | Good |
| Operational effort | Very low | Low |
Insights from Benchmarks
Let’s look at benchmarks using the same dataset and processing logic described in the following articles:
Lightweight ETL with AWS Lambda, chDB, and PyIceberg (Compared with DuckDB)
Lightweight ETL with AWS Glue Python Shell, chDB, and PyIceberg (Compared with DuckDB)
The dataset used is NYC Taxi data:
https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page
The test data includes:
- January 2024 Yellow Taxi Trip Records (2,964,624 records, 48 MB)
- A full year combined file (41,169,720 records, 807 MB)
Lambda was tested with memory sizes of 1024 MB, 2048 MB, and the default maximum of 3008 MB.
Glue Python Shell was tested with 1/16 DPU and 1 DPU.
Memory usage cannot be directly compared, so the focus is on processing time.
January file (48 MB)
| Platform | Resource setting | chDB time (s) | DuckDB time (s) |
|---|---|---|---|
| Glue Python Shell | 1/16 DPU | 46.000 | 40.000 |
| Glue Python Shell | 1 DPU | 39.000 | 34.000 |
| AWS Lambda | 1024 MB | 5.092 | 5.163 |
| AWS Lambda | 2048 MB | 3.873 | 4.265 |
| AWS Lambda | 3008 MB | 3.370 | 4.061 |
Full year file (807 MB)
| Platform | Resource setting | chDB time (s) | DuckDB time (s) |
|---|---|---|---|
| Glue Python Shell | 1/16 DPU | OutOfMemory | OutOfMemory |
| Glue Python Shell | 1 DPU | 51.0 | 212.0 |
| AWS Lambda | 1024 MB | OutOfMemory | OutOfMemory |
| AWS Lambda | 2048 MB | OutOfMemory | OutOfMemory |
| AWS Lambda | 3008 MB | 27.171 | 187.332 |
One important observation is that Glue Python Shell shows longer execution times.
With 1 DPU, it should theoretically outperform Lambda.
This difference is not due to raw processing power, but to fixed startup overhead in Glue Python Shell, regardless of workload size.
This includes:
- Job startup time
- Container preparation
- Dependency initialization
These factors add to the overall lead time and must be considered.
Evaluating this startup overhead is key when deciding whether to use Glue Python Shell.
Another point is that Lambda increases vCPU allocation as memory increases.
With 3008 MB, Lambda may achieve higher single-process performance than Glue Python Shell.
That said, Glue Python Shell still offers two major strengths: long execution time (up to 48 hours) and 16 GB of memory.
While there are many other decision factors, a simple rule of thumb could look like this:
File size
├─ Up to a few hundred MB → Lambda
│ └─ If processing exceeds 15 minutes → Glue Python Shell
└─ Multiple GB or complex workloads → Glue (Spark) / EMR
The most important thing is being able to clearly explain why you chose a particular approach.
Conclusion
In this article, I discussed how to choose between AWS Lambda and AWS Glue Python Shell in the context of lightweight ETL.
For preprocessing workloads up to a few hundred MB, AWS Lambda is often the best choice in terms of execution time, cost, and operational simplicity.
Glue Python Shell is a useful option for filling the gap between Lambda and distributed processing, handling workloads that are too heavy for Lambda but do not require Spark or EMR.
Rather than forcing all workloads into Lambda or Glue Python Shell, choosing the right tool for each situation is essential.
I hope this article helps anyone considering lightweight data processing on AWS.



Top comments (0)