In Phase 01, we built the ingestion layer of our Retail AI Insights system. We streamed historical product interaction data into Amazon S3 (Bronze zone) and stored key product metadata with inventory information in DynamoDB.
Now that we have raw data arriving reliably, it's time to clean, enrich, and organize it for downstream AI workflows.
Objective
- Transform raw event data from the Bronze zone into:
- Cleaned, analysis-ready Parquet files in the Silver zone
- Forecast-specific feature sets in the Gold zone under
/forecast_ready/
- Recommendation-ready CSV files under
/recommendations_ready/
This will power:
- Demand forecasting via Amazon Bedrock
- Personalized product recommendations using Amazon Personalize
What We'll Build in This Phase
- AWS Glue Jobs: Python scripts to clean, transform, and write data to the appropriate S3 zone
- AWS Glue Crawlers: Catalog metadata from S3 into tables for Athena & further processing
- AWS CDK Stack: Provisions all jobs, buckets, and crawlers
- Athena Queries: Run sanity checks on the transformed data
Directory & Bucket Layout
We'll now be working with the following S3 zones:
-
retail-ai-bronze-zone/
→ Raw JSON from Firehose -
retail-ai-silver-zone/cleaned_data/
→ Cleaned Parquet -
retail-ai-gold-zone/forecast_ready/
→ Aggregated features for forecasting -
retail-ai-gold-zone/recommendations_ready/
→ CSV with item metadata for Personalize
You'll also notice a fourth bucket: retail-ai-zone-assets/
, this stores scripts, and training dataset.
Step 1 - Creating Glue Resources via CDK
Now that we've set up our storage zones and uploaded the required ETL scripts and datasets, it's time to define the Glue resources with AWS CDK.
We'll create:
- 3 Glue Jobs
- DataCleaningETLJob → Cleans raw JSON into structured Parquet for the Silver Zone.
- ForecastGoldETLJob → Transforms cleaned data with features for demand prediction.
- RecommendationGoldETLJob → Prepares item metadata CSV for Amazon Personalize.
- Four Crawlers
- Validate everything with Athena
From the project root, generate the construct file:
mkdir -p lib/constructs/analytics && touch lib/constructs/analytics/glue-resources.ts
Make sure your local scripts/ and dataset/ directories are present, then upload them to your S3 assets bucket:
aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/forecast_gold_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./scripts/user_interaction_etl_script.py s3://retail-ai-zone-assets/scripts/
aws s3 cp ./dataset/events_with_metadata.csv s3://retail-ai-zone-assets/dataset/
aws s3 cp ./scripts/inventory_forecaster.py s3://retail-ai-zone-assets/scripts/
Define Glue Jobs & Crawlers in CDK
Now, open the lib/constructs/analytics/glue-resources.ts
file and define the full CDK logic to create:
- A Glue job role with required permissions
- The three ETL jobs with their respective scripts
- Four crawlers with S3 targets pointing to Bronze, Silver, Forecast, and Recommendation zones
Open the lib/constructs/analytics/glue-resources.ts
file, and add the following code:
import { Construct } from "constructs";
import * as cdk from "aws-cdk-lib";
import { Bucket } from "aws-cdk-lib/aws-s3";
import { CfnCrawler, CfnJob, CfnDatabase } from "aws-cdk-lib/aws-glue";
import {
Role,
ServicePrincipal,
ManagedPolicy,
PolicyStatement,
} from "aws-cdk-lib/aws-iam";
interface GlueProps {
bronzeBucket: Bucket;
silverBucket: Bucket;
goldBucket: Bucket;
dataAssetsBucket: Bucket;
}
export class GlueResources extends Construct {
constructor(scope: Construct, id: string, props: GlueProps) {
super(scope, id);
const { bronzeBucket, silverBucket, goldBucket, dataAssetsBucket } = props;
// Glue Database
const glueDatabase = new CfnDatabase(this, "SalesDatabase", {
catalogId: cdk.Stack.of(this).account,
databaseInput: {
name: "sales_data_db",
},
});
// Create IAM Role for Glue
const glueRole = new Role(this, "GlueServiceRole", {
assumedBy: new ServicePrincipal("glue.amazonaws.com"),
});
bronzeBucket.grantRead(glueRole);
silverBucket.grantReadWrite(glueRole);
goldBucket.grantReadWrite(glueRole);
glueRole.addToPolicy(
new PolicyStatement({
actions: ["s3:GetObject"],
resources: [`${dataAssetsBucket.bucketArn}/*`],
})
);
glueRole.addManagedPolicy(
ManagedPolicy.fromAwsManagedPolicyName("service-role/AWSGlueServiceRole")
);
// Glue Crawler (for Bronze Bucket)
new CfnCrawler(this, "DataCrawlerBronze", {
name: "DataCrawlerBronze",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [{ path: bronzeBucket.s3UrlForObject() }],
},
tablePrefix: "bronze_",
});
// Glue ETL Job
new CfnJob(this, "DataCleaningETLJob", {
name: "DataCleaningETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/sales_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--bronze_bucket": bronzeBucket.bucketName,
"--silver_bucket": silverBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
// Glue Crawler (for Silver Bucket)
new CfnCrawler(this, "DataCrawlerSilver", {
name: "DataCrawlerSilver",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [
{
path: `${silverBucket.s3UrlForObject()}/cleaned_data/`,
},
],
},
tablePrefix: "silver_",
});
// Glue Crawler (for Gold Bucket)
new CfnCrawler(this, "DataCrawlerForecast", {
name: "DataCrawlerForecast",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [{ path: `${goldBucket.s3UrlForObject()}/forecast_ready/` }],
},
tablePrefix: "gold_",
});
// Glue Crawler (for Gold Bucket)
new CfnCrawler(this, "DataCrawlerRecommendations", {
name: "DataCrawlerRecommendations",
role: glueRole.roleArn,
databaseName: glueDatabase.ref,
targets: {
s3Targets: [
{ path: `${goldBucket.s3UrlForObject()}/recommendations_ready/` },
],
},
tablePrefix: "gold_",
});
// Glue ETL Job to output forecast ready dataset
new CfnJob(this, "ForecastGoldETLJob", {
name: "ForecastGoldETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/forecast_gold_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--silver_bucket": silverBucket.bucketName,
"--gold_bucket": goldBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
// Glue ETL Job to output recommendation ready dataset
new CfnJob(this, "RecommendationGoldETLJob", {
name: "RecommendationGoldETLJob",
role: glueRole.roleArn,
command: {
name: "glueetl",
pythonVersion: "3",
scriptLocation: dataAssetsBucket.s3UrlForObject(
"scripts/user_interaction_etl_script.py"
),
},
defaultArguments: {
"--TempDir": silverBucket.s3UrlForObject("temp/"),
"--job-language": "python",
"--silver_bucket": silverBucket.bucketName,
"--gold_bucket": goldBucket.bucketName,
},
glueVersion: "3.0",
maxRetries: 0,
timeout: 10,
workerType: "Standard",
numberOfWorkers: 2,
});
}
}
Wire it up on the retail-ai-insights-stack.ts
file
/**
* Glue ETL Resources
**/
new GlueResources(this, "GlueResources", {
bronzeBucket,
silverBucket,
goldBucket,
dataAssetsBucket,
});
Once deployed via cdk deploy
:
- Navigate to AWS Glue > ETL Jobs - You should see:
- Go to AWS Glue > Data Catalog > Crawlers – Ensure four crawlers exist:
Step 2 - Run Glue Jobs to Transform Raw Data
Now that our Glue jobs and crawlers are deployed, let’s walk through how we run the ETL flow across the Bronze, Silver, and Gold zones.
Locate Raw Data in Bronze Bucket
- Go to the Amazon S3 Console, open the
retail-ai-bronze-zone bucket
. - Drill down through the directories until you see the file, note the tree structure, in my case it's
dataset/2025/05/26/20
- Copy this full prefix path.
Update the ETL Script Input Path
Open the sales_etl_script.py
inside VSCode.
On line 36, update the input_path variable to reflect the directory path you just copied:
input_path = f"s3://{bronze_bucket}/dataset/2025/05/26/20/"
Re-upload the modified script to your S3 data-assets bucket:
aws s3 cp ./scripts/sales_etl_script.py s3://retail-ai-zone-assets/scripts/
Because versioning is enabled on the bucket, this will replace the previous file while preserving version history.
Run the ETL Jobs
Now let’s kick off the transformation pipeline:
Run DataCleaningETLJob
- Go to AWS Glue Console > ETL Jobs.
- Select the
DataCleaningETLJob
and click Run Job. - This job will:
- Read raw JSON data from the Bronze bucket.
- Clean, cast, and convert it to Parquet.
- Store the results in the
retail-ai-silver-zone
bucket undercleaned_data/
Once successful, navigate to the retail-ai-silver-zone
bucket and confirm:
Run ForecastGoldETLJob
- Go to AWS Glue Console > ETL Jobs.
- Select the
ForecastGoldETLJob
and click Run Job. - This job will:
- Read the cleaned data from
retail-ai-silver-zone/cleaned_data/
- Aggregate daily sales
- Output the transformed data to
retail-ai-gold-zone/forecast_ready/
- Read the cleaned data from
Once completed, visit the Gold bucket and confirm the forecast files are present in that directory.
Run RecommendationGoldETLJob
- Go to AWS Glue Console > ETL Jobs.
- Select the
RecommendationGoldETLJob
and click Run Job. - This job will:
- Read cleaned product data from the Silver zone
- Output only the required item metadata in CSV format
- Save to
retail-ai-gold-zone/recommendations_ready/
After the job runs successfully, go to the Gold bucket and verify the structure and CSV file.
Run All Glue Crawlers
Once the Glue crawlers are deployed, you’ll see four of them listed in the Glue Console > Data Catalog > Crawlers:
- Select all four crawlers.
- Click Run.
- Once completed, look at the "Table changes on the last run" column each should say "1 created".
Validate Table Creation
Navigate to Glue Console > Data Catalog > Databases > Tables. You should now see four new tables, each corresponding to a specific zone:
Each table has an automatically inferred schema, including columns like user_id
, event_type
, timestamp
, price
, product_name
, and more.
Query with Amazon Athena
Now let’s run SQL queries against these tables:
Open the Amazon Athena Console.
If it's your first time, you’ll see a pop-up:
Choose your retail-ai-zone-assets
bucket.
Click Save.
Sample Athena Query
In the query editor, trying running simple SQL queries:
Select * from sales_data_db.<TABLE_NAME>
Try this query on the bronze_retail_ai_bronze_zone
table:
Select * from sales_data_db.bronze_retail_ai_bronze_zone
Try this query on the silver_cleaned_data
table:
Select * from sales_data_db.silver_cleaned_data
Try this query on the gold_forecast_ready
table:
Select * from sales_data_db.gold_forecast_ready
Try this query on the gold_recommendations_ready
table:
Select * from sales_data_db.gold_recommendations_ready
What You’ve Just Built
In this phase, you've gone beyond basic ETL. You’ve engineered a production-grade data lake with:
- Multi-zone architecture (Bronze, Silver, Gold)
- Automated ETL pipelines using AWS Glue
- Schema discovery and validation through Crawlers
- Interactive querying via Amazon Athena
All of this was done infrastructure-as-code first using AWS CDK, with clean separation of storage, processing, and access layers, exactly how real-world cloud data platforms are designed.
But this isn’t just about organizing data. You’re now sitting on a foundation that’s:
- AI-ready
- Model-friendly
- Cost-efficient
- And built for scale
What’s Next?
In Phase 3, we’ll unlock this data’s real potential, using Amazon Bedrock to power AI-based demand forecasting, running nightly on an EC2 instance and storing predictions back into our pipeline.
You’ve built the rails, now it’s time to run intelligence through them.
Complete Code for the Second Phase
To view the full code for the second phase, checkout the repository on GitHub
🚀 Follow me on LinkedIn for more AWS content!
Top comments (2)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.