Mohsin Sheikhani

Posted on Jun 12

A Practical Guide to MLOps on AWS: Transforming Raw Data into AI-Ready Datasets with AWS Glue (Phase 02)

#aws #cloud #mlops #dataengineering

In Phase 01, we built the ingestion layer of our Retail AI Insights system. We streamed historical product interaction data into Amazon S3 (Bronze zone) and stored key product metadata with inventory information in DynamoDB.

Now that we have raw data arriving reliably, it's time to clean, enrich, and organize it for downstream AI workflows.

Objective

Transform raw event data from the Bronze zone into:
Cleaned, analysis-ready Parquet files in the Silver zone
Forecast-specific feature sets in the Gold zone under /forecast_ready/
Recommendation-ready CSV files under /recommendations_ready/

This will power:

Demand forecasting via Amazon Bedrock
Personalized product recommendations using Amazon Personalize

What We'll Build in This Phase

AWS Glue Jobs: Python scripts to clean, transform, and write data to the appropriate S3 zone
AWS Glue Crawlers: Catalog metadata from S3 into tables for Athena & further processing
AWS CDK Stack: Provisions all jobs, buckets, and crawlers
Athena Queries: Run sanity checks on the transformed data

Directory & Bucket Layout

We'll now be working with the following S3 zones:

retail-ai-bronze-zone/ → Raw JSON from Firehose
retail-ai-silver-zone/cleaned_data/ → Cleaned Parquet
retail-ai-gold-zone/forecast_ready/ → Aggregated features for forecasting
retail-ai-gold-zone/recommendations_ready/ → CSV with item metadata for Personalize

You'll also notice a fourth bucket: retail-ai-zone-assets/, this stores scripts, and training dataset.

Step 1 - Creating Glue Resources via CDK

Now that we've set up our storage zones and uploaded the required ETL scripts and datasets, it's time to define the Glue resources with AWS CDK.

We'll create:

3 Glue Jobs
- DataCleaningETLJob → Cleans raw JSON into structured Parquet for the Silver Zone.
- ForecastGoldETLJob → Transforms cleaned data with features for demand prediction.
- RecommendationGoldETLJob → Prepares item metadata CSV for Amazon Personalize.
Four Crawlers
Validate everything with Athena

From the project root, generate the construct file:

mkdir -p lib/constructs/analytics && touch lib/constructs/analytics/glue-resources.ts

Make sure your local scripts/ and dataset/ directories are present, then upload them to your S3 assets bucket:

aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/forecast_gold_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/user_interaction_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./dataset/events_with_metadata.csv s3://retail-ai-zone-assets/dataset/
aws s3 cp ./scripts/inventory_forecaster.py s3://retail-ai-zone-assets/scripts/

Define Glue Jobs & Crawlers in CDK

Now, open the lib/constructs/analytics/glue-resources.ts file and define the full CDK logic to create:

A Glue job role with required permissions
The three ETL jobs with their respective scripts
Four crawlers with S3 targets pointing to Bronze, Silver, Forecast, and Recommendation zones

Open the lib/constructs/analytics/glue-resources.ts file, and add the following code:

import { Construct } from "constructs";
import * as cdk from "aws-cdk-lib";

import { Bucket } from "aws-cdk-lib/aws-s3";
import { CfnCrawler, CfnJob, CfnDatabase } from "aws-cdk-lib/aws-glue";
import {
  Role,
  ServicePrincipal,
  ManagedPolicy,
  PolicyStatement,
} from "aws-cdk-lib/aws-iam";

interface GlueProps {
  bronzeBucket: Bucket;
  silverBucket: Bucket;
  goldBucket: Bucket;
  dataAssetsBucket: Bucket;
}

export class GlueResources extends Construct {
  constructor(scope: Construct, id: string, props: GlueProps) {
    super(scope, id);

    const { bronzeBucket, silverBucket, goldBucket, dataAssetsBucket } = props;

    // Glue Database
    const glueDatabase = new CfnDatabase(this, "SalesDatabase", {
      catalogId: cdk.Stack.of(this).account,
      databaseInput: {
        name: "sales_data_db",
      },
    });

    // Create IAM Role for Glue
    const glueRole = new Role(this, "GlueServiceRole", {
      assumedBy: new ServicePrincipal("glue.amazonaws.com"),
    });

    bronzeBucket.grantRead(glueRole);
    silverBucket.grantReadWrite(glueRole);
    goldBucket.grantReadWrite(glueRole);

    glueRole.addToPolicy(
      new PolicyStatement({
        actions: ["s3:GetObject"],
        resources: [`${dataAssetsBucket.bucketArn}/*`],
      })
    );

    glueRole.addManagedPolicy(
      ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole")
    );

    // Glue Crawler (for Bronze Bucket)
    new CfnCrawler(this, "DataCrawlerBronze", {
      name: "DataCrawlerBronze",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [{ path: bronzeBucket.s3UrlForObject() }],
      },
      tablePrefix: "bronze_",
    });

    // Glue ETL Job
    new CfnJob(this, "DataCleaningETLJob", {
      name: "DataCleaningETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/sales_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--bronze_bucket": bronzeBucket.bucketName,
        "--silver_bucket": silverBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });

    // Glue Crawler (for Silver Bucket)
    new CfnCrawler(this, "DataCrawlerSilver", {
      name: "DataCrawlerSilver",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [
          {
            path: `${silverBucket.s3UrlForObject()}/cleaned_data/`,
          },
        ],
      },
      tablePrefix: "silver_",
    });

    // Glue Crawler (for Gold Bucket)
    new CfnCrawler(this, "DataCrawlerForecast", {
      name: "DataCrawlerForecast",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [{ path: `${goldBucket.s3UrlForObject()}/forecast_ready/` }],
      },
      tablePrefix: "gold_",
    });

    // Glue Crawler (for Gold Bucket)
    new CfnCrawler(this, "DataCrawlerRecommendations", {
      name: "DataCrawlerRecommendations",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [
          { path: `${goldBucket.s3UrlForObject()}/recommendations_ready/` },
        ],
      },
      tablePrefix: "gold_",
    });

    // Glue ETL Job to output forecast ready dataset
    new CfnJob(this, "ForecastGoldETLJob", {
      name: "ForecastGoldETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/forecast_gold_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--silver_bucket": silverBucket.bucketName,
        "--gold_bucket": goldBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });

    // Glue ETL Job to output recommendation ready dataset
    new CfnJob(this, "RecommendationGoldETLJob", {
      name: "RecommendationGoldETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/user_interaction_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--silver_bucket": silverBucket.bucketName,
        "--gold_bucket": goldBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });
  }
}

Wire it up on the retail-ai-insights-stack.ts file

/**
 * Glue ETL Resources
 **/
new GlueResources(this, "GlueResources", {
  bronzeBucket,
  silverBucket,
  goldBucket,
  dataAssetsBucket,
});

Once deployed via cdk deploy:

Navigate to AWS Glue > ETL Jobs - You should see:

Go to AWS Glue > Data Catalog > Crawlers – Ensure four crawlers exist:

Step 2 - Run Glue Jobs to Transform Raw Data

Now that our Glue jobs and crawlers are deployed, let’s walk through how we run the ETL flow across the Bronze, Silver, and Gold zones.

Locate Raw Data in Bronze Bucket

Go to the Amazon S3 Console, open the retail-ai-bronze-zone bucket.
Drill down through the directories until you see the file, note the tree structure, in my case it's dataset/2025/05/26/20
Copy this full prefix path.

Update the ETL Script Input Path

Open the sales_etl_script.py inside VSCode.
On line 36, update the input_path variable to reflect the directory path you just copied:

input_path = f"s3://{bronze_bucket}/dataset/2025/05/26/20/"

Re-upload the modified script to your S3 data-assets bucket:

aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/

Because versioning is enabled on the bucket, this will replace the previous file while preserving version history.

Run the ETL Jobs

Now let’s kick off the transformation pipeline:

Run DataCleaningETLJob

Go to AWS Glue Console > ETL Jobs.
Select the DataCleaningETLJob and click Run Job.
This job will:
- Read raw JSON data from the Bronze bucket.
- Clean, cast, and convert it to Parquet.
- Store the results in the retail-ai-silver-zone bucket under cleaned_data/

Once successful, navigate to the retail-ai-silver-zone bucket and confirm:

Run ForecastGoldETLJob

Go to AWS Glue Console > ETL Jobs.
Select the ForecastGoldETLJob and click Run Job.
This job will:
- Read the cleaned data from retail-ai-silver-zone/cleaned_data/
- Aggregate daily sales
- Output the transformed data to retail-ai-gold-zone/forecast_ready/

Once completed, visit the Gold bucket and confirm the forecast files are present in that directory.

Run RecommendationGoldETLJob

Go to AWS Glue Console > ETL Jobs.
Select the RecommendationGoldETLJob and click Run Job.
This job will:
- Read cleaned product data from the Silver zone
- Output only the required item metadata in CSV format
- Save to retail-ai-gold-zone/recommendations_ready/

After the job runs successfully, go to the Gold bucket and verify the structure and CSV file.

Run All Glue Crawlers

Once the Glue crawlers are deployed, you’ll see four of them listed in the Glue Console > Data Catalog > Crawlers:

Select all four crawlers.
Click Run.
Once completed, look at the "Table changes on the last run" column each should say "1 created".

Validate Table Creation

Navigate to Glue Console > Data Catalog > Databases > Tables. You should now see four new tables, each corresponding to a specific zone:

Each table has an automatically inferred schema, including columns like user_id, event_type, timestamp, price, product_name, and more.

Query with Amazon Athena

Now let’s run SQL queries against these tables:

Open the Amazon Athena Console.

If it's your first time, you’ll see a pop-up:

Choose your retail-ai-zone-assets bucket.

Click Save.

Sample Athena Query

In the query editor, trying running simple SQL queries:

Select * from sales_data_db.<TABLE_NAME>

Try this query on the bronze_retail_ai_bronze_zone table:

Select * from sales_data_db.bronze_retail_ai_bronze_zone

Try this query on the silver_cleaned_data table:

Select * from sales_data_db.silver_cleaned_data

Try this query on the gold_forecast_ready table:

Select * from sales_data_db.gold_forecast_ready

Try this query on the gold_recommendations_ready table:

Select * from sales_data_db.gold_recommendations_ready

What You’ve Just Built

In this phase, you've gone beyond basic ETL. You’ve engineered a production-grade data lake with:

Multi-zone architecture (Bronze, Silver, Gold)
Automated ETL pipelines using AWS Glue
Schema discovery and validation through Crawlers
Interactive querying via Amazon Athena

All of this was done infrastructure-as-code first using AWS CDK, with clean separation of storage, processing, and access layers, exactly how real-world cloud data platforms are designed.

But this isn’t just about organizing data. You’re now sitting on a foundation that’s:

AI-ready
Model-friendly
Cost-efficient
And built for scale

What’s Next?

In Phase 3, we’ll unlock this data’s real potential, using Amazon Bedrock to power AI-based demand forecasting, running nightly on an EC2 instance and storing predictions back into our pipeline.

You’ve built the rails, now it’s time to run intelligence through them.