ETL Made Easy: Integrating Multi-Source Data with AWS Glue

#aws #etl #awsglue #dynamodb

AWS Glue is a serverless data integration service that you can use to perform Extract, Transform, and Load (ETL) jobs. It is often used to handle large datasets and you can use it with a large variety of data sources and formats. It can be used with data lake services like Amazon S3 , Amazon Redshift, DynamoDB, data pipelines and other data warehouses.

The ETL process in AWS Glue is usually as follows:

Extracting information from different data stores like relational databases, NoSQL databases or object stores like Amazon S3.
Transforming the information by converting to required data formats or combining different data sets to compute new values
Loading information into a new data store as required for the use case

In this guide we will be unifying multi-Store order and product data with AWS Glue for analytics and marketing Insights.

Step 1: Log in to your AWS account and open the AWS DynamoDB service. Then, find and click on the Book_Orders table. Once inside, you’ll be able to view all the records stored in the table, which look like this:

Step 2: Go to the S3 service in AWS and locate the relevant buckets. Open each bucket to explore and review the data stored inside.

The data in these files is stored in JSON format and looks like the following example:

{ "book_id": 101, "title": "Learn Python", "author": "John Doe", "weight_kg": 0.5, "category": "Programming" }

Step 3: Now let's create the destination DynamoDB Table

This is where the transformed data will go.

a.Create a new DynamoDB table: book_orders_transformed

b.Partition key: order_id (Number)

c.Leave defaults → Create table.

Initially, it will have no items.

Step 4: Now let us create an AWS Glue Job

a. Open AWS Glue Console
In the AWS Console, let's go the AWS Glue dashboard and then click in ETL jobs and create a new ETL job.

b: Configure the Job
Let us configure the job with the following parameters:

Name: book_orders_etl
IAM role: select an existing role with S3 & DynamoDB permissions or create one.
Type: Spark Python

Step 5: Implementing the Glue ETL Script
Here’s the Python Spark script for our data transformation:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame

# Accept parameters at runtime
args = getResolvedOptions(sys.argv, ['JOB_NAME', 's3_bucket'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Function to calculate total price and total weight
def calculate_totals(row):
    row["total_price"] = round(row["quantity"] * row["unit_price"], 2)
    row["total_weight_kg"] = round(row["quantity"] * row["weight_kg"], 2)
    return row

# Load product data from S3
input_path = f"s3://{args['s3_bucket']}/data/"
books_frame = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={"paths": [input_path]},
    format="json",
    recurse=True
)

# Load order data from DynamoDB
orders_frame = glueContext.create_dynamic_frame.from_options(
    connection_type="dynamodb",
    connection_options={
        "dynamodb.input.tableName": "book_orders",
        "dynamodb.throughput.read.percent": "1.0"
    }
)

# Join orders with product info
joined_frame = Join.apply(
    frame1=orders_frame,
    frame2=books_frame,
    keys1=['book_id'],
    keys2=['book_id']
)

# Convert DynamicFrame to DataFrame to apply map
df = joined_frame.toDF()

# Apply transformation
transformed_df = df.rdd.map(lambda row: calculate_totals(row.asDict())).toDF()

# Convert back to DynamicFrame
transformed_frame = DynamicFrame.fromDF(transformed_df, glueContext, "transformed_frame")

# Write transformed data back to DynamoDB
glueContext.write_dynamic_frame.from_options(
    frame=transformed_frame,
    connection_type="dynamodb",
    connection_options={
        "dynamodb.output.tableName": "book_orders_transformed",
        "dynamodb.throughput.write.percent": "1.0"
    }
)

job.commit()

Step 6: Run the Glue Job

We will open our job and click on Run with parameters option
Add parameter: Key: s3_bucket Value: my-glue-books-data
Click Run job
We can then monitor the job in Run details and wait till the Status changes to Succeeded
Now we can open book_orders_transformed table in DynamoDB and verify that the book_orders_transformed table has our transformed data fields.

Step 7: Validate Your ETL Output

The DynamoDB table book_orders_transformed will now have our transformed data fields.

Example records in the table:

{ "order_id": 1, "customer_name": "Alice", "book_id": 101, "quantity": 2, "unit_price": 12.5, "order_date": "2025-10-01", "title": "Learn Python", "author": "John Doe", "weight_kg": 0.5, "category": "Programming", "total_price": 25.0, "total_weight_kg": 1.0 }

So in this guide we have successfully unified and transformed multi-source order and product data using AWS Glue. The ETL workflow efficiently extracted data from both DynamoDB and S3, applied necessary transformations including calculations for total price and total weight, and loaded the enriched dataset into a new DynamoDB table ready for analytics and marketing insights.

DEV Community

ETL Made Easy: Integrating Multi-Source Data with AWS Glue

Top comments (0)