DEV Community

Aki for AWS Community Builders

Posted on

Organizing the Use Cases of AWS Step Functions and Glue Workflow for ETL Processing with AWS Glue Jobs

Original Japanese article: Glue JobのETL処理におけるAWS Step FunctionsとGlue Workflowの使い分けを整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

When building data pipelines on AWS, deciding which workflow orchestration tool to use is an important architectural decision.
Especially when designing ETL pipelines centered around Glue Jobs, the following two services are commonly compared:

  • AWS Step Functions
  • AWS Glue Workflow

Both services are workflow orchestration tools that manage job dependencies and execute processes sequentially.
However, since they are designed with different strengths and philosophies, choosing the right one based on your requirements is critical.

In this article, I’ll organize the characteristics, advantages, disadvantages, and use cases of these two services, and discuss how to decide which one to choose.

Personally, I really like Glue Workflow, but I’ve had fewer opportunities to use it recently, which is honestly a bit disappointing.
I also like Glue Python Shell, but compared to standard Glue Jobs, I rarely get to use it these days either...

Major Workflow Orchestration Tools

AWS provides multiple workflow orchestration services, but the two most commonly compared for Glue Job–based data pipelines are the following:

Service Overview
AWS Step Functions AWS managed workflow orchestration service. Can flexibly orchestrate a wide range of AWS services such as Lambda, Glue, ECS, and more
AWS Glue Workflow Glue-native workflow feature. Defines pipelines using Glue Jobs, Crawlers, and Triggers

AWS also offers Amazon Managed Workflows for Apache Airflow (MWAA).
MWAA becomes a strong option when you need highly complex dependency management or cross-cloud orchestration. However, in this article, I’ll focus specifically on comparing Step Functions and Glue Workflow.

Recently, many teams have also adopted workflow tools such as Airflow, Dagster, and Prefect. As always, selecting the right tool depends heavily on your requirements and goals.

AWS Step Functions

Overview

(Figure above: Visual workflow definition using Workflow Studio)

AWS Step Functions is a workflow orchestration service that defines workflows using Amazon States Language (ASL), a JSON/YAML-based language.

It integrates with over 200 AWS services including Lambda, Glue, ECS, SNS, and DynamoDB, and supports flexible workflow controls such as conditional branching, parallel execution, and retry handling.

Features

Step Functions provides two workflow types:

Type Characteristics
Standard Workflow Long-running execution, audit logs, exactly-once execution guarantee
Express Workflow High throughput, low cost, optimized for short-lived processing (at-least-once)

In the context of data pipelines, Standard Workflow is generally the more common choice.

Step Functions also provides AWS Step Functions Workflow Studio, a visual editor that allows workflows to be built through a GUI.

Advantages

  • Multi-service integration: Easily integrates Glue with other AWS services such as Lambda, ECS, SNS, and more
  • Flexible control flow: Supports conditional branching (Choice), parallel execution (Parallel/Map), and advanced error handling (Catch/Retry)
  • High observability: Execution history and per-step states can be visually inspected in the console
  • Event-driven integration: Easily triggered through EventBridge using S3 uploads or schedules
  • Parallel execution support (Map): Well suited for large-scale processing such as file-level or partition-level parallel execution. Distributed Map is especially useful for high-scale parallel workloads
  • Infrastructure as Code support: Can be fully managed through CloudFormation, CDK, or Terraform

Disadvantages

  • Learning curve: Requires understanding ASL, and complex workflows can become verbose
  • Cost: Standard Workflow pricing is based on state transitions, so costs can increase with larger workflows
  • Glue integration setup: IAM roles, parameter passing, and Glue Job integration must be configured manually

Use Cases

Step Functions is a strong fit for scenarios such as:

  • Hybrid pipelines combining Glue with Lambda, ECS, or other AWS services
  • Complex workflows requiring branching, parallel processing, or dynamic parameter passing
  • Pipelines that require SNS notifications or compensating actions on failures
  • Teams managing infrastructure strictly through IaC
  • Large-scale data platforms involving multiple teams

AWS Glue Workflow

Overview

(Figure above: Visual workflow definition using Glue Workflow)

AWS Glue Workflow is Glue’s native workflow orchestration feature.

It defines pipelines using three primary components:

  • Glue Jobs
  • Glue Crawlers
  • Glue Triggers

Pipelines can be configured directly from the Glue console GUI, making it easy to build Glue-centric ETL workflows quickly.

Features

The major components of Glue Workflow are as follows:

Component Role
Glue Job Actual ETL processing using Spark or Python Shell
Glue Crawler Scans data sources such as S3 and registers table metadata into the Data Catalog
Glue Trigger Defines execution conditions such as schedules, events, or conditional dependencies

Glue Triggers support three types:

  • SCHEDULED
  • ON_DEMAND
  • CONDITIONAL

This allows chained execution patterns, such as triggering downstream jobs based on upstream job success or failure.

Advantages

  • Native Glue integration: Seamlessly integrates with Glue Jobs and Crawlers with minimal additional setup
  • Simple configuration: DAG-style workflows can be built intuitively through the console GUI
  • Low cost: No additional charge for the workflow itself beyond Glue Job execution costs
  • Easy crawler integration: Natural fit for workflows that update the Data Catalog after ETL execution
  • Glue Data Catalog integration: Job execution metadata and lineage can be managed centrally within Glue

Disadvantages

  • Glue-only orchestration: Cannot directly orchestrate Lambda, ECS, or other non-Glue services
  • Limited event-driven capabilities: Primarily Trigger- and schedule-based, making advanced event integration less flexible than Step Functions
  • Limited control flow: Weak support for advanced branching and dynamic parameter handling
  • Observability limitations: Detailed execution logs often require separate CloudWatch investigation
  • More difficult IaC management: CloudFormation and Terraform management can become cumbersome compared to Step Functions
  • Limited parallel execution control: Not ideal for fine-grained parallelization or Map-style orchestration
  • Weaker retry/re-execution control: Re-running only failed portions of a workflow is less flexible than in Step Functions

Use Cases

Glue Workflow is well suited for:

  • Simple ETL pipelines composed entirely of Glue Jobs and Crawlers
  • Periodic ingestion pipelines into S3-based data lakes with automatic catalog updates
  • Small teams or early-stage projects that need fast implementation
  • Organizations that prefer operating primarily through the Glue console

Which One Should You Choose?

Based on the characteristics of both services, the selection criteria can generally be summarized as follows.

When to Choose Glue Workflow

  • Your ETL pipeline consists only of Glue Jobs and Crawlers
  • Simple sequential execution and conditional triggers are sufficient
  • You want to build quickly (prototypes or small-scale projects)
  • Your operations are centered around the Glue console
  • You want to minimize costs

When to Choose Step Functions

  • You need integration with Lambda, ECS, or other AWS services
  • You require branching, parallel processing, or advanced error handling
  • You are adopting an EventBridge-centric event-driven architecture
  • You want strict Infrastructure as Code management
  • Multiple teams are involved in operating the data platform
  • Observability and audit logging are important

Summary of Decision Criteria

Perspective Glue Workflow Step Functions
Supported Services Glue only Broad AWS integration
Control Flow Simple Flexible and advanced
Observability
Ease of Configuration △ (learning curve)
Cost Low Depends on state transitions
IaC Management
Crawler Integration △ (manual setup required)

Glue Workflow is fundamentally Trigger-oriented and is not designed for advanced event orchestration like Step Functions.

Unless Glue Workflow specifically satisfies your requirements better, choosing Step Functions is generally the safer long-term option.
Personally, when designing architectures, I often start by considering Step Functions first.

That said, Glue Workflow remains a strong choice when the requirement is simply:

“I want to quickly build a Glue-centric ETL pipeline.”

Combining Both Services

Instead of choosing one or the other exclusively, it is also possible to invoke Glue Workflow from Step Functions.

For example, Step Functions can handle preprocessing and postprocessing with Lambda, while delegating the ETL core to Glue Workflow.

However, this introduces additional complexity because workflow state coordination between the two services must be managed carefully.
If simplicity is important, standardizing on one orchestration tool is generally easier operationally.

Caveats When Invoking Glue Workflow from Step Functions

1. Completion Detection Requires Polling

Step Functions provides a convenient .sync integration pattern for Glue: StartJobRun, which waits for job completion automatically.

However, Glue: StartWorkflowRun does not support the .sync integration pattern.

While you can invoke Glue Workflow through SDK integration, Step Functions will immediately proceed to the next state without waiting for completion.
As a result, you must implement custom polling logic to repeatedly check the WorkflowRun status.

2. Polling Logic Becomes Complex

You typically need to implement a loop like:

  • Wait
  • GetWorkflowRun
  • Choice (RUNNING / COMPLETED / FAILED)

This increases both the number of states and the verbosity of the ASL definition.

3. Error Handling Becomes More Complicated

Glue Workflow status is returned at the WorkflowRun level rather than the individual Job level.

As a result, identifying which specific Glue Job failed requires additional parsing logic against the GetWorkflowRun response.

Because of these complications, although combining both services is technically possible, I generally recommend avoiding Step Functions → Glue Workflow orchestration unless there is a compelling reason.

One possible use case is when you need to extend an existing Glue Workflow–based system incrementally.
Even then, rebuilding the orchestration directly in Step Functions using existing Glue Jobs often feels cleaner.

Migration Considerations

Some teams initially adopt Glue Workflow during the early stages of a project and later migrate to Step Functions as the data platform grows.

When migrating, the following considerations become important:

  • Workflow definitions must be rewritten: Glue Trigger definitions need to be converted into ASL
  • IAM roles must be redesigned: Step Functions requires permissions to invoke Glue Jobs
  • Extensive testing is necessary: Existing jobs must be validated carefully after migration

Migration is certainly possible, but it is not trivial.
This is why it is important to consider future extensibility and maintainability from the beginning when selecting your orchestration tool.

Conclusion

In this article, I organized the differences between AWS Step Functions and Glue Workflow for ETL orchestration.

To summarize:

  • Glue Workflow: Glue-native, simple, low-cost, and ideal for rapidly building straightforward ETL pipelines
  • Step Functions: Better suited for multi-service orchestration, advanced workflow control, observability, and large-scale pipelines

Neither service is universally “better.”
The right choice depends on your use cases, organizational structure, operational requirements, and future scalability needs.

For small-scale ETL pipelines, Glue Workflow is often sufficient. However, as data platforms evolve, requirements such as exception handling, notifications, conditional branching, and integrations with other services tend to grow over time. In many cases, architectures gradually move toward Step Functions as complexity increases.

A practical strategy can be to start simple with Glue Workflow during the early stages, and later migrate to Step Functions when requirements become more sophisticated.

That said, considering future migration costs, building on Step Functions from the beginning can also be a very reasonable approach.

I hope this article helps anyone currently evaluating workflow orchestration options on AWS.

Top comments (0)