Aki for AWS Community Builders

Posted on Jan 17

Exploring the Potential of AWS Glue Python Shell as a Long-Running Batch Execution Environment

#dataengineering #aws

Original Japanese article: AWS Glue Python Shellの長時間バッチ実行環境としての可能性を探る

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In my recent articles, I’ve been exploring AWS Glue Python Shell.
Have you ever struggled with the question: “How should we run heavy batch jobs that need to run for a long time, maybe once a month, in production?”

Typical examples include weekly or monthly aggregations, backfills, or integrations with external systems—jobs that can easily run for dozens of hours. For these kinds of long-running workloads, you often have to make several architectural and operational decisions.

A very common approach is to launch an EC2 instance and run the batch job there.

This approach is relatively straightforward, but it also comes with additional concerns beyond the batch logic itself, such as server management, start/stop operations, capacity shortages, and handling Spot interruptions.

On the other hand, AWS Glue Python Shell has the following characteristics, and depending on how you use it, it can be a very convenient option:

Serverless, no infrastructure management required
Jobs can run for up to 48 hours by default, and up to 168 hours maximum
Pay only for what you actually run (pay-as-you-go)

In this article, I’ll run an experiment where I keep a Glue Python Shell job running for 48 hours. Based on that experiment, I’ll examine whether it really runs stably for long periods, and compare it with EC2 from both cost and operational perspectives. The goal is to evaluate whether Glue Python Shell can realistically be considered a long-running batch execution platform.

AWS Glue Python Shell: Key Characteristics

I’ve covered this in previous articles, but let’s briefly recap the main characteristics of Glue Python Shell:

Job type: Python Shell
Maximum execution time: 48 hours by default, up to 168 hours
Resource configuration: DPU (Data Processing Unit), configurable from 1/16 DPU or 1 DPU
Pricing: DPU × execution time
Instance management: Not required (fully managed by AWS)

Since it doesn’t use Spark, it works particularly well for the following use cases:

Relatively lightweight ETL jobs
API integrations or workflows that are I/O-heavy rather than CPU-heavy

I’ve also written an article comparing Glue Python Shell and Lambda in the context of lightweight ETL, so feel free to check it out:

AWS Lambda and AWS Glue Python Shell in the Context of Lightweight ETL

Does It Really Run for 48 Hours? Let’s Test It

While it’s documented that Glue Python Shell can run for up to 48 hours, I was curious whether it really runs continuously without stopping or failing midway.

To test this, I created a simple job that outputs logs every hour and let it run for a full 48 hours.

In this experiment, I limited the execution to 48 hours, but Glue Python Shell can be configured to run for up to 168 hours (about 7 days). So you can read this with weekly or even longer batch workloads in mind.

Sample Code

import time
import datetime

def process_data():
    time.sleep(5)  # Dummy processing

def main():
    start_time = datetime.datetime.now()
    max_hours = 48
    max_seconds = max_hours * 3600

    print(f"[INFO] Job started at {start_time}", flush=True)

    next_log_seconds = 3600  # Next log after 1 hour

    while True:
        process_data()

        elapsed = (datetime.datetime.now() - start_time).total_seconds()

        if elapsed >= next_log_seconds:
            elapsed_hours = int(elapsed // 3600)
            print(f"[INFO] Elapsed time: {elapsed_hours} hours", flush=True)
            next_log_seconds += 3600

        if elapsed >= max_seconds:
            break

    print(
        f"[INFO] Job completed at {datetime.datetime.now()} "
        f"(total: {elapsed/3600:.2f} hours)",
        flush=True
    )

if __name__ == "__main__":
    main()

By using print(..., flush=True), log output is immediately sent to CloudWatch Logs.
Without flush=True, output may be buffered and appear in CloudWatch in batches after some delay.
For long-running jobs, being able to confirm progress via logs is important, so explicitly flushing logs can be helpful.

Results

CloudWatch Logs

CloudWatch Logs showed continuous output like the following:

[INFO] Job started at 2026-01-14 09:01:45.644511
[INFO] Elapsed time: 1 hours
[INFO] Elapsed time: 2 hours
[INFO] Elapsed time: 3 hours
[INFO] Elapsed time: 4 hours
[INFO] Elapsed time: 5 hours
[INFO] Elapsed time: 6 hours
[INFO] Elapsed time: 7 hours
[INFO] Elapsed time: 8 hours
[INFO] Elapsed time: 9 hours
[INFO] Elapsed time: 10 hours
[INFO] Elapsed time: 11 hours
[INFO] Elapsed time: 12 hours
[INFO] Elapsed time: 13 hours
[INFO] Elapsed time: 14 hours
[INFO] Elapsed time: 15 hours
[INFO] Elapsed time: 16 hours
[INFO] Elapsed time: 17 hours
[INFO] Elapsed time: 18 hours
[INFO] Elapsed time: 19 hours
[INFO] Elapsed time: 20 hours
[INFO] Elapsed time: 21 hours
[INFO] Elapsed time: 22 hours
[INFO] Elapsed time: 23 hours
[INFO] Elapsed time: 24 hours
[INFO] Elapsed time: 25 hours
[INFO] Elapsed time: 26 hours
[INFO] Elapsed time: 27 hours
[INFO] Elapsed time: 28 hours
[INFO] Elapsed time: 29 hours
[INFO] Elapsed time: 30 hours
[INFO] Elapsed time: 31 hours
[INFO] Elapsed time: 32 hours
[INFO] Elapsed time: 33 hours
[INFO] Elapsed time: 34 hours
[INFO] Elapsed time: 35 hours
[INFO] Elapsed time: 36 hours
[INFO] Elapsed time: 37 hours
[INFO] Elapsed time: 38 hours
[INFO] Elapsed time: 39 hours
[INFO] Elapsed time: 40 hours
[INFO] Elapsed time: 41 hours
[INFO] Elapsed time: 42 hours
[INFO] Elapsed time: 43 hours
[INFO] Elapsed time: 44 hours
[INFO] Elapsed time: 45 hours
[INFO] Elapsed time: 46 hours
[INFO] Elapsed time: 47 hours
[INFO] Elapsed time: 48 hours
[INFO] Job completed at 2026-01-16 09:01:49.059613 (total: 48.00 hours)

From the logs, we can clearly see that the job ran for approximately 48 hours.

Glue Job Run Monitoring

Glue’s Job Run Monitoring also confirmed a 48-hour execution.

The Run Status is shown as Timeout, but this is expected behavior—it simply means the job reached its maximum execution time of 48 hours.

In this run, the job used 1 DPU for 48 hours, resulting in 48 DPU-hours, exactly as expected.
The total cost for this 48-hour execution was $21.12.

Observations

No manual intervention was required from start to finish
No interruptions or unexpected timeouts
Logging was fully handled by CloudWatch

While this may sound obvious, the experiment confirms—using actual logs and metrics—that Glue Python Shell runs stably for long durations, exactly as documented.

One thing to note is that in AWS Cost Explorer, the cost was not visible until the full 48-hour execution had completed.
This means that if a job runs longer than intended, costs can increase significantly before you notice. You should consider putting monitoring or guardrails in place to detect unexpected long executions.

Comparison with EC2 (m6i.xlarge)

Next, let’s compare this with running the same workload on an EC2 instance with roughly equivalent specs.

Glue Python Shell (1 DPU):

vCPU: 4
Memory: 16 GiB

EC2 m6i.xlarge:

vCPU: 4
Memory: 16 GiB

Cost Comparison (48-Hour Execution)

Assumptions:

Execution frequency: once per month
Execution time: 48 hours

Aspect	Glue Python Shell	EC2 m6i.xlarge
Execution cost	Charged only for runtime	Charged while instance is running
Cost while idle	None	None if stopped (but EBS and Elastic IP costs may still apply)
Setup	None	AMI and launch configuration required

Execution Platform	Unit Price	48-Hour Cost
EC2 Spot (m6i.xlarge)	$0.0828/hour	$3.97
EC2 On-Demand (m6i.xlarge)	$0.248/hour	$11.90
Glue Python Shell (1 DPU)	$0.44/hour	$21.10

Purely from a runtime cost perspective, EC2 is cheaper.
However, with EC2 you also need to account for EBS volume costs, Elastic IP usage, whether the instance is left running outside batch windows, and ongoing OS maintenance and patching. These factors add to the real operational cost over time.

Operational Overhead

While EC2 can reduce costs, it clearly increases operational complexity.
In the table below, “launch failure” refers to whether users must explicitly design for and handle this risk.

Aspect	EC2 (especially Spot)	Glue Python Shell
Start/stop control	User-managed	Managed
Risk of launch failure	Sometimes	Rare
Interruption handling	Required	Not required
OS management	Required	Not required
Long execution	Depends on design	Up to 168 hours
Setup	AMI & launch config needed	None

Things to Consider When Using EC2

Even with On-Demand instances, there are scenarios where instances fail to launch. Typical examples include:

InsufficientInstanceCapacity
Occurs when AWS doesn’t have enough On-Demand capacity. This can happen even when restarting a stopped instance.
InstanceLimitExceeded
Happens when you hit your regional instance quota. This can be resolved by requesting a quota increase.
UnauthorizedOperation
Occurs when required IAM permissions are missing.
The requested configuration is currently not supported
Happens when a specific combination of AZ, instance type, or AMI is unsupported.

These issues may be rare, but for long-running batch jobs that are usually stopped and only run occasionally, such uncertainty can become a non-trivial operational risk.

To mitigate launch failures, AWS provides On-Demand Capacity Reservations.

What Is an On-Demand Capacity Reservation?

An On-Demand Capacity Reservation lets you reserve EC2 capacity in a specific Availability Zone in advance, ensuring your instance can launch when needed.

The benefit is improved launch reliability.
The downside is that you are charged for the reserved capacity even while the instance is stopped.

Therefore, whether capacity reservations make sense depends on whether it’s acceptable to pay continuous costs for a batch job that runs infrequently.

Trade-offs When Choosing AWS Glue Python Shell

AWS Glue Python Shell also has clear limitations:

Maximum execution time (48 hours by default, up to 168 hours)
CPU and memory can’t be finely tuned due to DPU-based configuration
No OS-level customization

If your workload requires:

Always-on execution
Execution times exceeding 168 hours
OS-level control

then EC2 is a better fit.

However, if you understand and accept these constraints, Glue Python Shell becomes a very convenient and pragmatic option.

Conclusion

In this article, I ran an actual 48-hour experiment with AWS Glue Python Shell and compared it with EC2 as a long-running batch execution platform.

Choosing between Glue Python Shell and EC2 isn’t just about raw execution cost—it also requires evaluating operational overhead, failure modes, and overall uncertainty.

Glue Python Shell is a realistic choice for long-running batch jobs that are infrequent, require stable execution, and aim to minimize operational burden. For workloads where “finishing reliably” is the top priority, Glue Python Shell is a particularly good fit.

On the other hand, if your workload runs frequently, requires long execution times, or needs OS-level flexibility, EC2 is the better choice. In that case, it’s important to carefully evaluate EBS sizing, Elastic IP usage, and whether capacity reservations are necessary.

When selecting a platform for long-running batch jobs, the key is not just “how cheap it is,” but how much uncertainty and operational complexity you’re willing to accept.

I hope this article helps you make a better decision when designing long-running batch processing systems.

DEV Community