The Phantom Menace
I’ve been a heavy user of AWS Glue since its early days, starting with version 0.9. It’s been a bit of a love-hate relationship—especially back then, when Glue jobs took what felt like an eternity to run. Over the years, though, Glue has come a long way. From upgrading Apache Spark to supporting modern data lake formats like Hudi, Iceberg, and Delta Lake, to introducing Generative AI capabilities, Glue has evolved into a powerful tool for building scalable ETL solutions.
But even as Glue has improved, I’ve found myself grappling with a different challenge—managing Glue jobs manually. As I’ve built solutions for clients over the years, the lack of automation in provisioning and deploying Glue jobs has become a pain point.
Back in the early days, we didn’t have CI/CD pipelines or automation tools to provision Glue jobs. Everything was done manually—configuring jobs, managing dependencies, and deploying them. At first, this seemed manageable for small-scale solutions, but as the complexity of pipelines grew, so did the problems. And I’ve always thought that there was something wrong with that since having that workflow creates room for error, such as:
- Inconsistencies Across Environments (Dev, QA, Prod)
- Scaling issues - adding or modifying jobs manually is time-consuming and error-prone
- No Version Control - need I say more?
- Deployment Complexity - dependencies, configurations, etc.
These challenges didn’t just slow us down—they put the reliability of our data pipelines at risk, introducing errors and inefficiencies that could cascade into costly problems downstream.
A New Hope
Fast forward to today, and I’ve adopted a completely different approach. By using AWS CDK (Cloud Development Kit) in combination with GitHub (though this can also be GitLab, BitBucket), I’ve been able to solve many of these challenges. AWS CDK allows me to define Glue resources as Infrastructure-as-Code (IaC), ensuring consistency and scalability. Integrating this with GitHub and CI/CD workflows has made deploying Glue jobs faster, more reliable, and far less error-prone.
The Medallion Architecture as a Framework
One of the data design pattern that are popular today is the Medallion Architecture. And this is something that we use as a framework for building the Lakehouse for our clients.
Our pipeline contains Glue scripts for the Bronze layer, Silver layer, and Gold layer.
Development Workflow
To ensure consistency and collaboration across the team, we follow a structured development workflow, as outlined in the diagram below. This workflow integrates tools like Jira for task tracking, integrated to GitHub to map the tickets to the git branches. The workflow is pretty standard.
The Project Structure
project-root/
├── ingestion/ # For ingestion Glue jobs
│ ├── configs/
│ │ ├── jobs.csv
│ │ ├── custom_jobs.yaml
│ │ ├── default_configs.yaml
│ │ └── README.md
│ ├── scripts/ # Scripts for ingestion Glue jobs
│ │ ├── dev-ingestion-script.py
│ │ └── prd-ingestion-script.py
│ ├── ingestion_stack.py # CDK stack for ingestion jobs
│ └── README.md
│
├── standardization/ # For standardization Glue jobs
│ ├── configs/
│ │ ├── jobs.csv
│ │ ├── custom_jobs.yaml
│ │ ├── default_configs.yaml
│ │ └── README.md
│ ├── scripts/
│ │ ├── dev-standardization-script.py
│ │ └── prd-standardization-script.py
│ ├── standardization_stack.py # CDK stack for standardization jobs
│ └── README.md
│
├── transformation/ # For transformation Glue jobs
│ ├── configs/
│ │ ├── jobs.csv
│ │ ├── custom_jobs.yaml
│ │ ├── default_configs.yaml
│ │ └── README.md
│ ├── scripts/
│ │ ├── dev-transformation-script.py
│ │ └── prd-transformation-script.py
│ ├── transformation_stack.py # CDK stack for transformation jobs
│ └── README.md
│
├── loading/ # For loading Glue jobs
│ ├── configs/
│ │ ├── jobs.csv
│ │ ├── custom_jobs.yaml
│ │ ├── default_configs.yaml
│ │ └── README.md
│ ├── scripts/
│ │ ├── dev-loading-script.py
│ │ └── prd-loading-script.py
│ ├── loading_stack.py # CDK stack for loading jobs
│ └── README.md
│
├── upload_script.py # Script to upload files to S3
├── app.py # Root entry point for AWS CDK
├── requirements.txt # Python dependencies for CDK
└── README.md # High-level project documentation
Config Files: Defaults and Customizations
Each folder (ingestion/, standardization/, transformation/, loading/)
contains:
configs/ - Configuration files specific to that component. These can share a similar structure but should have unique data for each purpose.
- jobs.csv- Defines the Glue jobs and their classifications. This file acts as the source of truth for the jobs you want to deploy. The columns can be adjusted as needed. Take note of the classification as well, this will be used for identifying whether to provision the job as default or custom, which will be discussed below.
JobName | Classification | Category | ConnectionName |
---|---|---|---|
dim-products | default | Transformation | redshift-conn |
dim-users | default | Transformation | redshift-conn |
fact-sales | custom | Transformation | redshift-conn |
-
default_configs.yaml - Separation of default and custom configurations gives me the flexibility to manage AWS Glue jobs efficiently. With default configurations, I can define a baseline setup—for example, provisioning a Glue job with a
G.1X
worker type and 2 DPUs. This ensures that most jobs follow a consistent and standardized configuration, reducing the need for repetitive definitions.
WorkerType: G.1X
NumberOfWorkers: 2
GlueVersion: "5.0"
ExecutionClass: STANDARD
IAMRole: "arn:aws:iam::123456789012:role/glue-role"
Command:
Name: "glueetl"
PythonVersion: "3"
DefaultArguments:
"--enable-metrics": "true"
"--TempDir": "s3://default-bucket/temp/"
"--job-language": "python"
"--enable-glue-datacatalog": "true"
"--spark-event-logs-path": "s3://bucket/logs/sparkHistoryLogs/"
ScriptLocationBase: "s3://bucket/cdk/scripts/transformation/"
Tags:
Project: "Sales"
Environment: "dev"
- custom_configs.yaml - Custom configurations, on the other hand, allow me to handle exceptions where jobs require more specialized settings, like higher memory or specific Spark arguments. By separating these two, I can keep the defaults simple and focused while tailor fitting individual jobs as needed.
fact-sales:
WorkerType: G.2X
NumberOfWorkers: 4
GlueVersion: "5.0"
ExecutionClass: STANDARD
IAMRole: "arn:aws:iam::123456789012:role/glue-role"
DefaultArguments:
"--enable-metrics": "true"
"--TempDir": "s3://my-bucket/temp/"
"--job-language": "python"
Command:
Name: "glueetl"
PythonVersion: "3"
Tags:
Project: "Sales Dashboard"
Environment: "dev"
transformation-job2:
WorkerType: G.1X
NumberOfWorkers: 2
GlueVersion: "5.0"
ExecutionClass: FLEX
IAMRole: "arn:aws:iam::123456789012:role/glue-role"
DefaultArguments:
"--enable-continuous-cloudwatch-log": "true"
Tags:
Project: "Sales Dashboard"
Environment: "dev"
scripts/ - contains the actual Python-based Glue scripts specific to the component.
Example:
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.utils import getResolvedOptions
# Get job arguments
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'input_path', 'output_path'])
# Initialize GlueContext and SparkContext
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Input and output paths (passed as parameters)
input_path = args['input_path'] # e.g., s3://your-bucket/raw-data/customers/
output_path = args['output_path'] # e.g., s3://your-bucket/processed-data/dim_customers/
# Load raw data into a DataFrame
raw_df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(input_path)
# Register the DataFrame as a temporary SQL table
raw_df.createOrReplaceTempView("raw_customers")
# Use Spark SQL to create the dimension table
dimension_query = """
SELECT
CAST(customer_id AS STRING) AS customer_id,
first_name,
last_name,
email,
CAST(date_of_birth AS DATE) AS date_of_birth,
country,
ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS row_num
FROM
raw_customers
WHERE
country IS NOT NULL
"""
# Execute the SQL query
dimension_df = spark.sql(dimension_query)
# Filter to include only the latest record per customer
final_dimension_df = dimension_df.filter(dimension_df.row_num == 1).drop("row_num")
# Write the resulting DataFrame to S3 in Parquet format
final_dimension_df.write.mode("overwrite").parquet(output_path)
print(f"Dimension table created and saved to {output_path}")
_stack.py - AWS CDK stack for defining Glue jobs for that specific component.
Example:
from aws_cdk import aws_glue as glue
from aws_cdk import Stack, CfnOutput
from constructs import Construct
import csv
import yaml
class GlueTransformationStack(Stack):
"""
A CDK stack for creating AWS Glue jobs based on configurations specified in CSV and YAML files.
"""
def __init__(self, scope: Construct, id: str, job_csv_file: str, custom_config_file: str, default_config_file: str, **kwargs):
"""
Initialize the GlueTransformationStack.
:param scope: The scope in which to define this construct.
:param id: The scoped construct ID.
:param job_csv_file: Path to the CSV file containing job definitions.
:param custom_config_file: Path to the YAML file containing custom configurations.
:param default_config_file: Path to the YAML file containing default configurations.
"""
super().__init__(scope, id, **kwargs)
# Load configurations from YAML files
with open(custom_config_file) as f:
custom_configs = yaml.safe_load(f)
with open(default_config_file) as f:
default_config = yaml.safe_load(f)
# Process each job in the CSV file
with open(job_csv_file, mode='r', encoding="utf-8-sig") as file:
csv_reader = csv.DictReader(file)
for row in csv_reader:
self.process_job_row(row, custom_configs, default_config)
def process_job_row(self, row, custom_configs, default_config):
"""
Process a single row from the CSV file and create a Glue job.
:param row: A dictionary representing a row from the CSV file.
:param custom_configs: A dictionary of custom job configurations.
:param default_config: A dictionary of default configurations.
"""
job_name = row['JobName']
classification = row['Classification']
connection_name = row['ConnectionName']
# Determine job configuration: merge custom settings if available
if classification == 'custom' and job_name in custom_configs:
job_config = {**default_config, **custom_configs[job_name]}
else:
job_config = default_config
# Ensure 'Command' is defined in the configuration
if 'Command' not in job_config:
job_config['Command'] = default_config.get('Command', {})
# Set script location directly from the job name
script_name = f"{job_name}.py"
job_config['Command']['ScriptLocation'] = job_config['ScriptLocationBase'] + script_name
job_config['ConnectionName'] = connection_name
# Merge tags from default and custom configurations
default_tags = default_config.get('Tags', {})
custom_tags = job_config.get('Tags', {})
combined_tags = {**default_tags, **custom_tags}
self.create_glue_job(job_name, job_config, combined_tags)
def create_glue_job(self, job_name, job_config, tags):
"""
Create an AWS Glue job using the provided configuration.
:param job_name: The name of the Glue job.
:param job_config: A dictionary containing the job configuration.
:param tags: Combined tags for the Glue job.
"""
glue_job = glue.CfnJob(
self, job_name,
name=job_name,
role=job_config['IAMRole'],
glue_version=job_config['GlueVersion'],
command=glue.CfnJob.JobCommandProperty(
name=job_config['Command']['Name'],
script_location=job_config['Command']['ScriptLocation'],
python_version=job_config['Command']['PythonVersion']
),
default_arguments=job_config['DefaultArguments'],
execution_class=job_config['ExecutionClass'],
connections=glue.CfnJob.ConnectionsListProperty(
connections=[job_config['ConnectionName']]
),
worker_type=job_config['WorkerType'],
number_of_workers=job_config['NumberOfWorkers'],
tags=tags
)
CfnOutput(
self, f"{job_name}Output",
value=glue_job.ref,
description=f"Name of the Glue job: {job_name}"
)
Automating Script Uploads
Theis script can dynamically upload scripts from the scripts/
folder of any component to an S3 bucket. Update the local_directory_to_upload and s3_destination_path variables to target specific components.
import boto3
import os
import logging
# Configure logging
logging.basicConfig(
filename="upload_files_to_s3.log",
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s"
)
def upload_files_to_s3(local_directory, bucket_name, s3_destination):
"""
Uploads files from a local directory to an S3 bucket.
:param local_directory: The local directory to upload.
:param bucket_name: The name of the S3 bucket.
:param s3_destination: The destination path in the S3 bucket.
"""
s3_client = boto3.client('s3')
# Walk through the local directory
for root, _, files in os.walk(local_directory):
for file in files:
local_path = os.path.join(root, file)
relative_path = os.path.relpath(local_path, local_directory)
s3_path = os.path.join(s3_destination, relative_path).replace("\\", "/")
# Upload file to S3
try:
s3_client.upload_file(local_path, bucket_name, s3_path)
print(f"Successfully uploaded {file} to s3://{bucket_name}/{s3_path}")
logging.info(f"Uploaded {file} to s3://{bucket_name}/{s3_path}")
except Exception as e:
print(f"Failed to upload {file}: {e}")
logging.error(f"Failed to upload {file} to s3://{bucket_name}/{s3_path} - Error: {e}")
if __name__ == "__main__":
import sys
# Check command-line arguments
if len(sys.argv) != 2:
print("Usage: python upload_files_to_s3.py <bucket_name>")
logging.error("Script called with insufficient arguments.")
sys.exit(1)
# Parse the bucket name
bucket_name = sys.argv[1]
# Define the folders to process
folders = {
"ingestion": "ingestion/scripts",
"standardization": "standardization/scripts",
"transformation": "transformation/scripts",
"loading": "loading/scripts",
}
# Loop through each folder and upload files
for component, local_directory in folders.items():
s3_destination = f"glue-scripts/{component}"
print(f"\nProcessing folder: {local_directory} -> s3://{bucket_name}/{s3_destination}")
logging.info(f"Processing folder: {local_directory} -> s3://{bucket_name}/{s3_destination}")
upload_files_to_s3(local_directory, bucket_name, s3_destination)
Bringing It All Together with app.py
The root-level CDK application orchestrates the deployment of all component-specific stacks (Ingestion, Standardization, Transformation, and Loading).
Example:
from aws_cdk import App
from ingestion.ingestion_stack import IngestionStack
from standardization.standardization_stack import StandardizationStack
from transformation.transformation_stack import TransformationStack
from loading.loading_stack import LoadingStack
app = App()
# Instantiate each component stack
IngestionStack(app, "IngestionStack")
StandardizationStack(app, "StandardizationStack")
TransformationStack(app, "TransformationStack")
LoadingStack(app, "LoadingStack")
app.synth()
Local Testing
Once you have these components, you can test locally using the following commands.
Synthesize the CloudFormation Templates
cdk synth
Deploy to a Test Environment
Please make sure that your local AWS user is on your development environment.
cdk deploy IngestionStack --require-approval never
cdk deploy StandardizationStack --require-approval never
Verify Resources in AWS:
Confirm that Glue jobs were created with the expected configurations.
(OPTIONAL) Purge the resources:
cdk destroy
CI/CD Integration with GitHub
Once you have the components ready, the next step would be to push the codes into a Code Repository. In this section, we’ll explore how to push your code to a repository (like GitHub) and configure CI/CD to automatically deploy Glue jobs on each push to a specific branch.
Pre-requisities:
GitHub Account
Steps:
- Create Repository
- Push the codes into the Repository
- Go into your Repository, click on Actions tab
- Select New Workflow
- When choosing a workflow, locate and choose the set up a workflow yourself.
- Paste this yaml. Change as necessary.
name: Deploy Data Lake Glue Jobs
on:
push:
branches:
- main # Replace with your branch if different
permissions:
id-token: write # Required for requesting the JWT
contents: read # Required for actions/checkout
jobs:
deploy:
runs-on: ubuntu-latest
steps:
# Step 1: Check out the code repository
- name: Checkout Repository
uses: actions/checkout@v3
# Step 2: Configure AWS credentials using the IAM role
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v4.0.2
with:
role-to-assume: ${{ secrets.AWS_IAM_ROLE }}
aws-region: ${{ secrets.AWS_REGION }}
role-session-name: GitHubActionsDeployment
# Step 3: Set up Python
- name: Setup Python
uses: actions/setup-python@v5.1.0
with:
python-version: '3.10'
cache: 'pip'
# Step 4: Set up Node.js for AWS CDK
- name: Setup Node.js
uses: actions/setup-node@v4.0.0
with:
node-version: '21.2.0'
# Step 5: Install AWS CDK CLI globally
- name: Install AWS CDK
run: npm install -g aws-cdk
# Step 6: Verify CDK installation
- name: Verify CDK Installation
run: cdk --version
# Step 7: Install Python dependencies globally
- name: Install Python Dependencies
run: pip install -r ./infrastructure/requirements.txt
# Step 8: Upload Glue scripts for each layer
- name: Upload Ingestion Files to S3
run: python3 upload_files.py <your-bucket-name>
# Step 9: Deploy all Glue stacks using AWS CDK
- name: Deploy Glue Stacks
run: cdk deploy GlueIngestionStack GlueStandardizationStack GlueTransformationStack GlueLoadingStack --require-approval never
Workflow Structure:
*Trigger (on) *- Specifies that the workflow runs when there is a push to the main branch. If your main branch is named differently (e.g.,
main-prod
), replace it here. This ensures the workflow executes only on the primary branch where approved code resides. More on GitHub Event triggers here.Permissions - note that you must configure OIDC with GitHub from AWS. See the prerequisites.
id-token: write
:
Grants the workflow permission to request an OpenID Connect (OIDC) token for securely authenticating with AWS.
-
contents: read
:
Allows the workflow to read repository content during the actions/checkout step.
Job Definition:
- Runner - Specifies the operating system environment (ubuntu-latest) where the job executes. This ensures compatibility with Python and Node.js for AWS CDK and Glue.
Steps:
Check Out the Code Repository - Checks out the code from the repository so that subsequent steps have access to the scripts. This is similar to checking out the code locally. It pulls the contents of the repository in its virtual environment.
-
Configure AWS Credentials - Configures AWS credentials using an IAM role defined in GitHub Secrets. To define secrets in GitHub follow this guide.
- Key Parameters:
-
role-to-assume
: The ARN of the IAM role the workflow assumes for permissions. You can also use access keys but I highly recommend IAM Roles for security purposes. -
aws-region
: The region where AWS resources are deployed (e.g., ap-southeast-1).
Set Up Python - Sets up
Python 3.10
in the workflow environment.Set Up Node.js - Installs
Node.js
, required for running AWS CDK.Install AWS CDK - Installs AWS CDK globally using npm. This is necessary to execute cdk commands.
Verify CDK Installation - Checks that AWS CDK was installed correctly by outputting its version.
Install Python Dependencies - Installs Python dependencies specified in requirements.txt, such as
aws-cdk-lib
,constructs
, orboto3
.Upload Glue Scripts to S3 - Executes a Python script to upload Glue job scripts to the specified S3 bucket. This ensures the Glue jobs have access to the latest ETL scripts stored in S3.
Deploy Glue Stacks - Deploys all defined Glue stacks (GlueIngestionStack, GlueStandardizationStack, etc.) using AWS CDK.
--require-approval never
: Skips manual approval prompts, enabling fully automated deployments.
You can now push to your main branch and automatically deploy Glue Jobs using GitHub Actions.
I Have The High Ground
Adopting a CI/CD-driven workflow for deploying AWS Glue jobs has been a transformative step for our team. By integrating AWS CDK, GitHub, and automated pipelines, we’ve significantly improved our deployment process. Manual errors, configuration inconsistencies, and deployment delays are challenges we’ve left behind, allowing us to focus on delivering reliable and scalable data solutions.
This approach ensures that every change is traceable, reviewable, and deployed consistently across environments.
This workflow has worked well for us based on our requirements and project goals. However, I know that there’s always room for improvement, and workflows often evolve over time to meet new challenges. If you have suggestions, insights, or ideas on how to further enhance this approach, I’d be happy to discuss them.
This blog is authored solely by me and reflects my personal opinions and experiences, not those of my employer. All references to products, including names, logos, and trademarks, belong to their respective owners and are used for identification purposes only.
Top comments (0)