DEV Community

Mohsin Sheikhani
Mohsin Sheikhani

Posted on

A Practical Guide to MLOps on AWS: Transforming Raw Data into AI-Ready Datasets with AWS Glue (Phase 02)

In Phase 01, we built the ingestion layer of our Retail AI Insights system. We streamed historical product interaction data into Amazon S3 (Bronze zone) and stored key product metadata with inventory information in DynamoDB.

Now that we have raw data arriving reliably, it's time to clean, enrich, and organize it for downstream AI workflows.

Objective

  • Transform raw event data from the Bronze zone into:
  • Cleaned, analysis-ready Parquet files in the Silver zone
  • Forecast-specific feature sets in the Gold zone under /forecast_ready/
  • Recommendation-ready CSV files under /recommendations_ready/

Transforming Raw Data into AI-Ready Datasets with AWS Glue Architecture Diagram

This will power:

  • Demand forecasting via Amazon Bedrock
  • Personalized product recommendations using Amazon Personalize

What We'll Build in This Phase

  • AWS Glue Jobs: Python scripts to clean, transform, and write data to the appropriate S3 zone
  • AWS Glue Crawlers: Catalog metadata from S3 into tables for Athena & further processing
  • AWS CDK Stack: Provisions all jobs, buckets, and crawlers
  • Athena Queries: Run sanity checks on the transformed data

Directory & Bucket Layout

We'll now be working with the following S3 zones:

  • retail-ai-bronze-zone/ → Raw JSON from Firehose
  • retail-ai-silver-zone/cleaned_data/ → Cleaned Parquet
  • retail-ai-gold-zone/forecast_ready/ → Aggregated features for forecasting
  • retail-ai-gold-zone/recommendations_ready/ → CSV with item metadata for Personalize

You'll also notice a fourth bucket: retail-ai-zone-assets/, this stores scripts, and training dataset.

Step 1 - Creating Glue Resources via CDK

Now that we've set up our storage zones and uploaded the required ETL scripts and datasets, it's time to define the Glue resources with AWS CDK.

We'll create:

  • 3 Glue Jobs
    • DataCleaningETLJob → Cleans raw JSON into structured Parquet for the Silver Zone.
    • ForecastGoldETLJob → Transforms cleaned data with features for demand prediction.
    • RecommendationGoldETLJob → Prepares item metadata CSV for Amazon Personalize.
  • Four Crawlers
  • Validate everything with Athena

From the project root, generate the construct file:

mkdir -p lib/constructs/analytics && touch lib/constructs/analytics/glue-resources.ts
Enter fullscreen mode Exit fullscreen mode

Make sure your local scripts/ and dataset/ directories are present, then upload them to your S3 assets bucket:

aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/forecast_gold_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/user_interaction_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./dataset/events_with_metadata.csv s3://retail-ai-zone-assets/dataset/
aws s3 cp ./scripts/inventory_forecaster.py s3://retail-ai-zone-assets/scripts/
Enter fullscreen mode Exit fullscreen mode

Define Glue Jobs & Crawlers in CDK

Now, open the lib/constructs/analytics/glue-resources.ts file and define the full CDK logic to create:

  • A Glue job role with required permissions
  • The three ETL jobs with their respective scripts
  • Four crawlers with S3 targets pointing to Bronze, Silver, Forecast, and Recommendation zones

Open the lib/constructs/analytics/glue-resources.ts file, and add the following code:

import { Construct } from "constructs";
import * as cdk from "aws-cdk-lib";

import { Bucket } from "aws-cdk-lib/aws-s3";
import { CfnCrawler, CfnJob, CfnDatabase } from "aws-cdk-lib/aws-glue";
import {
  Role,
  ServicePrincipal,
  ManagedPolicy,
  PolicyStatement,
} from "aws-cdk-lib/aws-iam";

interface GlueProps {
  bronzeBucket: Bucket;
  silverBucket: Bucket;
  goldBucket: Bucket;
  dataAssetsBucket: Bucket;
}

export class GlueResources extends Construct {
  constructor(scope: Construct, id: string, props: GlueProps) {
    super(scope, id);

    const { bronzeBucket, silverBucket, goldBucket, dataAssetsBucket } = props;

    // Glue Database
    const glueDatabase = new CfnDatabase(this, "SalesDatabase", {
      catalogId: cdk.Stack.of(this).account,
      databaseInput: {
        name: "sales_data_db",
      },
    });

    // Create IAM Role for Glue
    const glueRole = new Role(this, "GlueServiceRole", {
      assumedBy: new ServicePrincipal("glue.amazonaws.com"),
    });

    bronzeBucket.grantRead(glueRole);
    silverBucket.grantReadWrite(glueRole);
    goldBucket.grantReadWrite(glueRole);

    glueRole.addToPolicy(
      new PolicyStatement({
        actions: ["s3:GetObject"],
        resources: [`${dataAssetsBucket.bucketArn}/*`],
      })
    );

    glueRole.addManagedPolicy(
      ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole")
    );

    // Glue Crawler (for Bronze Bucket)
    new CfnCrawler(this, "DataCrawlerBronze", {
      name: "DataCrawlerBronze",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [{ path: bronzeBucket.s3UrlForObject() }],
      },
      tablePrefix: "bronze_",
    });

    // Glue ETL Job
    new CfnJob(this, "DataCleaningETLJob", {
      name: "DataCleaningETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/sales_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--bronze_bucket": bronzeBucket.bucketName,
        "--silver_bucket": silverBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });

    // Glue Crawler (for Silver Bucket)
    new CfnCrawler(this, "DataCrawlerSilver", {
      name: "DataCrawlerSilver",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [
          {
            path: `${silverBucket.s3UrlForObject()}/cleaned_data/`,
          },
        ],
      },
      tablePrefix: "silver_",
    });

    // Glue Crawler (for Gold Bucket)
    new CfnCrawler(this, "DataCrawlerForecast", {
      name: "DataCrawlerForecast",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [{ path: `${goldBucket.s3UrlForObject()}/forecast_ready/` }],
      },
      tablePrefix: "gold_",
    });

    // Glue Crawler (for Gold Bucket)
    new CfnCrawler(this, "DataCrawlerRecommendations", {
      name: "DataCrawlerRecommendations",
      role: glueRole.roleArn,
      databaseName: glueDatabase.ref,
      targets: {
        s3Targets: [
          { path: `${goldBucket.s3UrlForObject()}/recommendations_ready/` },
        ],
      },
      tablePrefix: "gold_",
    });

    // Glue ETL Job to output forecast ready dataset
    new CfnJob(this, "ForecastGoldETLJob", {
      name: "ForecastGoldETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/forecast_gold_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--silver_bucket": silverBucket.bucketName,
        "--gold_bucket": goldBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });

    // Glue ETL Job to output recommendation ready dataset
    new CfnJob(this, "RecommendationGoldETLJob", {
      name: "RecommendationGoldETLJob",
      role: glueRole.roleArn,
      command: {
        name: "glueetl",
        pythonVersion: "3",
        scriptLocation: dataAssetsBucket.s3UrlForObject(
          "scripts/user_interaction_etl_script.py"
        ),
      },
      defaultArguments: {
        "--TempDir": silverBucket.s3UrlForObject("temp/"),
        "--job-language": "python",
        "--silver_bucket": silverBucket.bucketName,
        "--gold_bucket": goldBucket.bucketName,
      },
      glueVersion: "3.0",
      maxRetries: 0,
      timeout: 10,
      workerType: "Standard",
      numberOfWorkers: 2,
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

Wire it up on the retail-ai-insights-stack.ts file

/**
 * Glue ETL Resources
 **/
new GlueResources(this, "GlueResources", {
  bronzeBucket,
  silverBucket,
  goldBucket,
  dataAssetsBucket,
});
Enter fullscreen mode Exit fullscreen mode

Once deployed via cdk deploy:

  1. Navigate to AWS Glue > ETL Jobs - You should see:

AWS Glue Studio

  1. Go to AWS Glue > Data Catalog > Crawlers – Ensure four crawlers exist:

AWS Glue Crawlers

Step 2 - Run Glue Jobs to Transform Raw Data

Now that our Glue jobs and crawlers are deployed, let’s walk through how we run the ETL flow across the Bronze, Silver, and Gold zones.

Locate Raw Data in Bronze Bucket

  1. Go to the Amazon S3 Console, open the retail-ai-bronze-zone bucket.
  2. Drill down through the directories until you see the file, note the tree structure, in my case it's dataset/2025/05/26/20
  3. Copy this full prefix path.

Update the ETL Script Input Path

Open the sales_etl_script.py inside VSCode.
On line 36, update the input_path variable to reflect the directory path you just copied:

input_path = f"s3://{bronze_bucket}/dataset/2025/05/26/20/"
Enter fullscreen mode Exit fullscreen mode

Re-upload the modified script to your S3 data-assets bucket:

aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
Enter fullscreen mode Exit fullscreen mode

Because versioning is enabled on the bucket, this will replace the previous file while preserving version history.

Run the ETL Jobs

Now let’s kick off the transformation pipeline:

Run DataCleaningETLJob

  • Go to AWS Glue Console > ETL Jobs.
  • Select the DataCleaningETLJob and click Run Job.
  • This job will:
    • Read raw JSON data from the Bronze bucket.
    • Clean, cast, and convert it to Parquet.
    • Store the results in the retail-ai-silver-zone bucket under cleaned_data/

Running AWS Glue Job

Once successful, navigate to the retail-ai-silver-zone bucket and confirm:

S3 Bucket for Silver Zone

Run ForecastGoldETLJob

  • Go to AWS Glue Console > ETL Jobs.
  • Select the ForecastGoldETLJob and click Run Job.
  • This job will:
    • Read the cleaned data from retail-ai-silver-zone/cleaned_data/
    • Aggregate daily sales
    • Output the transformed data to retail-ai-gold-zone/forecast_ready/

Running AWS Glue Job

Once completed, visit the Gold bucket and confirm the forecast files are present in that directory.

S3 Bucket for Gold Zone

Run RecommendationGoldETLJob

  • Go to AWS Glue Console > ETL Jobs.
  • Select the RecommendationGoldETLJob and click Run Job.
  • This job will:
    • Read cleaned product data from the Silver zone
    • Output only the required item metadata in CSV format
    • Save to retail-ai-gold-zone/recommendations_ready/

Running AWS Glue Job

After the job runs successfully, go to the Gold bucket and verify the structure and CSV file.

S3 Bucket for Gold Zone

Run All Glue Crawlers

Once the Glue crawlers are deployed, you’ll see four of them listed in the Glue Console > Data Catalog > Crawlers:

AWS Glue Crawlers

  1. Select all four crawlers.
  2. Click Run.
  3. Once completed, look at the "Table changes on the last run" column each should say "1 created".

AWS Glue Crawlers

Validate Table Creation

Navigate to Glue Console > Data Catalog > Databases > Tables. You should now see four new tables, each corresponding to a specific zone:

AWS Glue Data Catalog Tables

Each table has an automatically inferred schema, including columns like user_id, event_type, timestamp, price, product_name, and more.

Query with Amazon Athena

Now let’s run SQL queries against these tables:

Open the Amazon Athena Console.

If it's your first time, you’ll see a pop-up:

AWS Athena, Output bucket configuration

Choose your retail-ai-zone-assets bucket.

AWS Athena, Output bucket configuration

Click Save.

Sample Athena Query

In the query editor, trying running simple SQL queries:

Select * from sales_data_db.<TABLE_NAME>
Enter fullscreen mode Exit fullscreen mode

Try this query on the bronze_retail_ai_bronze_zone table:

Select * from sales_data_db.bronze_retail_ai_bronze_zone
Enter fullscreen mode Exit fullscreen mode

AWS Athena query result

Try this query on the silver_cleaned_data table:

Select * from sales_data_db.silver_cleaned_data
Enter fullscreen mode Exit fullscreen mode

AWS Athena query result

Try this query on the gold_forecast_ready table:

Select * from sales_data_db.gold_forecast_ready
Enter fullscreen mode Exit fullscreen mode

AWS Athena query result

Try this query on the gold_recommendations_ready table:

Select * from sales_data_db.gold_recommendations_ready
Enter fullscreen mode Exit fullscreen mode

AWS Athena query result

What You’ve Just Built

In this phase, you've gone beyond basic ETL. You’ve engineered a production-grade data lake with:

  • Multi-zone architecture (Bronze, Silver, Gold)
  • Automated ETL pipelines using AWS Glue
  • Schema discovery and validation through Crawlers
  • Interactive querying via Amazon Athena

All of this was done infrastructure-as-code first using AWS CDK, with clean separation of storage, processing, and access layers, exactly how real-world cloud data platforms are designed.

But this isn’t just about organizing data. You’re now sitting on a foundation that’s:

  • AI-ready
  • Model-friendly
  • Cost-efficient
  • And built for scale

What’s Next?

In Phase 3, we’ll unlock this data’s real potential, using Amazon Bedrock to power AI-based demand forecasting, running nightly on an EC2 instance and storing predictions back into our pipeline.

You’ve built the rails, now it’s time to run intelligence through them.

Complete Code for the Second Phase

To view the full code for the second phase, checkout the repository on GitHub

🚀 Follow me on LinkedIn for more AWS content!

Top comments (2)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.