Data Lakehouse on ROSA with Apache Spark, Iceberg, and AWS Glue
Table of Contents
- Overview
- Architecture
- Prerequisites
- Phase 1: ROSA Cluster Setup
- Phase 2: AWS Glue Data Catalog Configuration
- Phase 3: S3 Data Lake Setup
- Phase 4: Apache Spark on OpenShift
- Phase 5: Apache Iceberg Integration
- Phase 6: Spark-Glue Catalog Integration
- Phase 7: Sample Data Pipelines
- Testing and Validation
- Resource Cleanup
- Troubleshooting
Overview
Project Purpose
This platform implements a modern data lakehouse architecture that achieves true separation of compute and storage. By running Apache Spark on OpenShift while leveraging AWS Glue Data Catalog for metadata management and S3 for storage (in Apache Iceberg format), organizations can scale compute independently, shut down clusters without data loss, and achieve significant cost optimization.
Key Value Propositions
- Stateless Compute: Completely decouple compute from storage and metadata
- Cloud-Native Flexibility: Destroy and recreate compute clusters without losing data
- Cost Optimization: Pay for compute only when running jobs
- Unified Metadata: AWS Glue Catalog provides central metadata repository
- ACID Transactions: Apache Iceberg enables reliable data lake operations
- Performance at Scale: Run high-performance Spark jobs on Kubernetes
Solution Components
| Component | Purpose | Layer |
|---|---|---|
| ROSA | Managed OpenShift cluster for Spark compute | Compute |
| Apache Spark | Distributed data processing engine | Processing |
| Spark Operator | Kubernetes-native Spark job management | Orchestration |
| AWS Glue Data Catalog | Centralized metadata repository | Metadata |
| Amazon S3 | Object storage for data lake | Storage |
| Apache Iceberg | Table format with ACID guarantees | Data Format |
| AWS IAM | Authentication and authorization | Security |
Architecture
High-Level Architecture Diagram
Workflow
- Data Ingestion: Raw data lands in S3 bronze layer
- Spark Job Submission: Developer submits SparkApplication CR
- Job Orchestration: Spark Operator creates driver pod
- Resource Provisioning: Driver spawns executor pods dynamically
- Metadata Discovery: Spark connects to Glue Catalog for table metadata
- Data Processing: Executors read/write Iceberg tables from/to S3
- Metadata Update: Glue Catalog automatically updated with new partitions/schemas
- Job Completion: Executor pods terminate, freeing resources
- Cluster Shutdown: ROSA cluster can be deleted without data loss
- State Recovery: New cluster can access all data via Glue Catalog
Stateless Compute Demonstration
Traditional Approach:
- Local Hive Metastore tied to cluster
- Cluster deletion = metadata loss
- Requires persistent volumes and backups
Lakehouse Approach:
- Metadata in AWS Glue (managed, durable)
- Data in S3 (infinitely scalable)
- Compute fully ephemeral
- Result: Complete cluster rebuild in 40 minutes with zero data loss
Prerequisites
Required Accounts and Subscriptions
- [ ] AWS Account with administrative access
- [ ] Red Hat Account with OpenShift subscription
- [ ] ROSA Enabled in your AWS account
- [ ] AWS Glue Access in your target region
Required Tools
Install the following CLI tools on your workstation:
# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version
# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version
# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version
Example Output:
$ rosa version
[2026-01-13 09:15:22] 1.2.38
Your ROSA CLI is up to date.
$ oc version
[2026-01-13 09:15:35] Client Version: 4.18.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
$ helm version
[2026-01-13 09:15:48] version.BuildInfo{Version:"v3.14.1", GitCommit:"2d17c84a8d8", GitTreeState:"clean", GoVersion:"go1.21.7"}
AWS Prerequisites
Service Quotas
# Check EC2 quotas for ROSA
aws service-quotas get-service-quota \
--service-code ec2 \
--quota-code L-1216C47A \
--region us-east-1
# Check S3 bucket quota
aws service-quotas get-service-quota \
--service-code s3 \
--quota-code L-DC2B2D3D \
--region us-east-1
Example Output:
$ aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A --region us-east-1
[2026-01-13 09:20:14] {
"Quota": {
"ServiceCode": "ec2",
"ServiceName": "Amazon Elastic Compute Cloud (Amazon EC2)",
"QuotaArn": "arn:aws:servicequotas:us-east-1:123456789012:ec2/L-1216C47A",
"QuotaCode": "L-1216C47A",
"QuotaName": "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances",
"Value": 1280.0,
"Unit": "None",
"Adjustable": true,
"GlobalQuota": false
}
}
IAM Permissions
Your AWS IAM user/role needs permissions for:
- EC2 (VPC, subnets, security groups)
- IAM (roles, policies)
- S3 (buckets, objects)
- Glue (databases, tables, catalog)
- CloudWatch (logs, metrics)
Knowledge Prerequisites
You should be familiar with:
- Apache Spark fundamentals (DataFrames, transformations, actions)
- Data engineering concepts (ETL, data lakes, partitioning)
- AWS fundamentals (S3, IAM)
- Kubernetes basics (pods, deployments, services)
- SQL and data modeling
Phase 1: ROSA Cluster Setup
Step 1.1: Configure AWS CLI
# Configure AWS credentials
aws configure
# Verify configuration
aws sts get-caller-identity
Example Output:
$ aws configure
[2026-01-13 09:30:00] AWS Access Key ID [****************AKID]:
AWS Secret Access Key [****************KEY]:
Default region name [us-east-1]:
Default output format [json]:
$ aws sts get-caller-identity
[2026-01-13 09:30:45] {
"UserId": "AIDACKCEVSQ6C2EXAMPLE",
"Account": "123456789012",
"Arn": "arn:aws:iam::123456789012:user/data-engineer"
}
Step 1.2: Initialize ROSA
# Log in to Red Hat
rosa login
# Verify ROSA prerequisites
rosa verify quota
rosa verify permissions
# Initialize ROSA in your AWS account
rosa init
Example Output:
$ rosa login
[2026-01-13 09:35:12] To login to your Red Hat account, get an offline access token at https://console.redhat.com/openshift/token/rosa
? Copy the token and paste it here: ****************************************
[2026-01-13 09:35:45] Logged in as 'data-engineer' on 'https://api.openshift.com'
$ rosa verify quota
[2026-01-13 09:36:20] I: Validating AWS quota...
I: AWS quota ok. If cluster installation fails, validate actual AWS resource usage against https://docs.openshift.com/rosa/rosa_getting_started/rosa-required-aws-service-quotas.html
$ rosa verify permissions
[2026-01-13 09:36:45] I: Validating SCP policies...
I: AWS SCP policies ok
$ rosa init
[2026-01-13 09:37:15] I: Logged in as 'data-engineer' on 'https://api.openshift.com'
I: Validating AWS credentials...
I: AWS credentials are valid!
I: Validating SCP policies...
I: AWS SCP policies ok
I: Validating AWS quota...
I: AWS quota ok. If cluster installation fails, validate actual AWS resource usage against https://docs.openshift.com/rosa/rosa_getting_started/rosa-required-aws-service-quotas.html
I: Ensuring cluster administrator user 'osdCcsAdmin'...
I: Admin user 'osdCcsAdmin' created successfully!
I: Validating SCP policies for 'osdCcsAdmin'...
I: AWS SCP policies ok
I: Verifying whether OpenShift command-line tool is available...
I: Current OpenShift Client Version: 4.18.0
Step 1.3: Create ROSA Cluster
Create a ROSA cluster optimized for Spark workloads:
# Set environment variables
export CLUSTER_NAME="data-lakehouse"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.4xlarge"
export COMPUTE_NODES=3
# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
--cluster-name $CLUSTER_NAME \
--region $AWS_REGION \
--multi-az \
--compute-machine-type $MACHINE_TYPE \
--compute-nodes $COMPUTE_NODES \
--machine-cidr 10.0.0.0/16 \
--service-cidr 172.30.0.0/16 \
--pod-cidr 10.128.0.0/14 \
--host-prefix 23 \
--yes
Example Output:
$ rosa create cluster --cluster-name data-lakehouse --region us-east-1 --multi-az --compute-machine-type m5.4xlarge --compute-nodes 3 --machine-cidr 10.0.0.0/16 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 --yes
[2026-01-13 09:45:00] I: Creating cluster 'data-lakehouse'
I: To view a list of clusters and their status, run 'rosa list clusters'
I: Cluster 'data-lakehouse' has been created.
I: Once the cluster is installed you will need to add an Identity Provider before you can login into the cluster. See 'rosa create idp --help' for more information.
Name: data-lakehouse
ID: 24g9q8jdhgoofs8cmp8ilr67njd5p0j8
External ID:
OpenShift Version: 4.18.0
Channel Group: stable
DNS: data-lakehouse.vxkf.p1.openshiftapps.com
AWS Account: 123456789012
API URL:
Console URL:
Region: us-east-1
Multi-AZ: true
Nodes:
- Control plane: 3
- Infra: 3
- Compute: 3 (m5.4xlarge)
Network:
- Type: OVNKubernetes
- Service CIDR: 172.30.0.0/16
- Machine CIDR: 10.0.0.0/16
- Pod CIDR: 10.128.0.0/14
- Host Prefix: /23
STS Role ARN: arn:aws:iam::123456789012:role/ManagedOpenShift-Installer-Role
Support Role ARN: arn:aws:iam::123456789012:role/ManagedOpenShift-Support-Role
Instance IAM Roles:
- Control plane: arn:aws:iam::123456789012:role/ManagedOpenShift-ControlPlane-Role
- Worker: arn:aws:iam::123456789012:role/ManagedOpenShift-Worker-Role
Operator IAM Roles:
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cloud-network-config-controller-cloud-cre
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-machine-api-aws-cloud-credentials
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cloud-credential-operator-cloud-credent
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-image-registry-installer-cloud-credenti
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-ingress-operator-cloud-credentials
- arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cluster-csi-drivers-ebs-cloud-credenti
State: pending (Preparing account)
Private: No
Created: Jan 13 2026 09:45:00 UTC
Details Page: https://console.redhat.com/openshift/details/s/2Vw0000example
OIDC Endpoint URL: https://rh-oidc.s3.us-east-1.amazonaws.com/24g9q8jdhgoofs8cmp8ilr67njd5p0j8
I: To determine when your cluster is Ready, run 'rosa describe cluster -c data-lakehouse'.
I: To watch your cluster installation logs, run 'rosa logs install -c data-lakehouse --watch'.
Configuration Rationale:
- m5.4xlarge: 16 vCPUs, 64 GB RAM - suitable for Spark executors
- 3 nodes: Allows distributed Spark processing
- Multi-AZ: High availability for production workloads
Step 1.4: Monitor Cluster Creation
# Watch cluster installation progress
rosa logs install --cluster=$CLUSTER_NAME --watch
# Check cluster status
rosa describe cluster --cluster=$CLUSTER_NAME
Example Output:
$ rosa logs install --cluster=data-lakehouse --watch
[2026-01-13 09:46:00] time="2026-01-13T09:46:00Z" level=info msg="Preparing cluster installation"
time="2026-01-13T09:47:15Z" level=info msg="Creating AWS VPC"
time="2026-01-13T09:48:30Z" level=info msg="Creating AWS subnets"
time="2026-01-13T09:50:12Z" level=info msg="Creating security groups"
time="2026-01-13T09:52:45Z" level=info msg="Launching bootstrap instance"
time="2026-01-13T09:55:20Z" level=info msg="Waiting for bootstrap to complete"
time="2026-01-13T10:05:30Z" level=info msg="Destroying bootstrap resources"
time="2026-01-13T10:08:15Z" level=info msg="Installing control plane"
time="2026-01-13T10:15:42Z" level=info msg="Control plane initialized"
time="2026-01-13T10:18:30Z" level=info msg="Installing cluster operators"
time="2026-01-13T10:25:50Z" level=info msg="Cluster installation complete"
$ rosa describe cluster --cluster=data-lakehouse
[2026-01-13 10:26:15] Name: data-lakehouse
ID: 24g9q8jdhgoofs8cmp8ilr67njd5p0j8
External ID:
OpenShift Version: 4.18.0
Channel Group: stable
DNS: data-lakehouse.vxkf.p1.openshiftapps.com
AWS Account: 123456789012
API URL: https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443
Console URL: https://console-openshift-console.apps.data-lakehouse.vxkf.p1.openshiftapps.com
Region: us-east-1
Multi-AZ: true
Nodes:
- Control plane: 3
- Infra: 3
- Compute: 3 (m5.4xlarge)
Network:
- Type: OVNKubernetes
- Service CIDR: 172.30.0.0/16
- Machine CIDR: 10.0.0.0/16
- Pod CIDR: 10.128.0.0/14
- Host Prefix: /23
STS Role ARN: arn:aws:iam::123456789012:role/ManagedOpenShift-Installer-Role
Support Role ARN: arn:aws:iam::123456789012:role/ManagedOpenShift-Support-Role
Instance IAM Roles:
- Control plane: arn:aws:iam::123456789012:role/ManagedOpenShift-ControlPlane-Role
- Worker: arn:aws:iam::123456789012:role/ManagedOpenShift-Worker-Role
State: ready
Private: No
Created: Jan 13 2026 09:45:00 UTC
Details Page: https://console.redhat.com/openshift/details/s/2Vw0000example
OIDC Endpoint URL: https://rh-oidc.s3.us-east-1.amazonaws.com/24g9q8jdhgoofs8cmp8ilr67njd5p0j8
Step 1.5: Create Admin User and Connect
# Create cluster admin user
rosa create admin --cluster=$CLUSTER_NAME
# Use the login command from output
oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 \
--username cluster-admin \
--password <your-password>
# Verify cluster access
oc cluster-info
oc get nodes
Example Output:
$ rosa create admin --cluster=data-lakehouse
[2026-01-13 10:28:00] I: Admin account has been added to cluster 'data-lakehouse'.
I: Please securely store this generated password. If you lose this password you can delete and recreate the cluster admin user.
I: To login, run the following command:
oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 --username cluster-admin --password aB3dE-fGh5J-kLm7N-pQr9S
I: It may take several minutes for this access to become active.
$ oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 --username cluster-admin --password aB3dE-fGh5J-kLm7N-pQr9S
[2026-01-13 10:29:30] Login successful.
You have access to 103 projects, the list has been suppressed. You can list all projects with 'oc projects'
Using project "default".
$ oc cluster-info
[2026-01-13 10:29:45] Kubernetes control plane is running at https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
$ oc get nodes
[2026-01-13 10:30:00] NAME STATUS ROLES AGE VERSION
ip-10-0-128-205.ec2.internal Ready control-plane,master 42m v1.31.0+7c7b8a2
ip-10-0-135-148.ec2.internal Ready control-plane,master 42m v1.31.0+7c7b8a2
ip-10-0-142-87.ec2.internal Ready control-plane,master 42m v1.31.0+7c7b8a2
ip-10-0-152-34.ec2.internal Ready worker 35m v1.31.0+7c7b8a2
ip-10-0-189-72.ec2.internal Ready worker 35m v1.31.0+7c7b8a2
ip-10-0-213-156.ec2.internal Ready worker 35m v1.31.0+7c7b8a2
Step 1.6: Create Project Namespaces
# Create namespace for Spark workloads
oc new-project spark-jobs
# Create namespace for Spark operator
oc new-project spark-operator
Example Output:
$ oc new-project spark-jobs
[2026-01-13 10:31:00] Now using project "spark-jobs" on server "https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443".
You can add applications to this project with the 'new-app' command. For example, try:
oc new-app rails-postgresql-example
to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:
kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
$ oc new-project spark-operator
[2026-01-13 10:31:15] Now using project "spark-operator" on server "https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443".
You can add applications to this project with the 'new-app' command. For example, try:
oc new-app rails-postgresql-example
to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:
kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname
Phase 2: AWS Glue Data Catalog Configuration
Step 2.1: Create Glue Database
# Create Glue database for lakehouse
aws glue create-database \
--database-input '{
"Name": "lakehouse",
"Description": "Data lakehouse with Iceberg tables"
}' \
--region $AWS_REGION
# Create additional databases for different layers
aws glue create-database \
--database-input '{
"Name": "bronze",
"Description": "Raw data landing zone"
}' \
--region $AWS_REGION
aws glue create-database \
--database-input '{
"Name": "silver",
"Description": "Curated and cleaned data"
}' \
--region $AWS_REGION
aws glue create-database \
--database-input '{
"Name": "gold",
"Description": "Analytics-ready aggregated data"
}' \
--region $AWS_REGION
# Verify database creation
aws glue get-databases --region $AWS_REGION
Example Output:
$ aws glue create-database --database-input '{"Name": "lakehouse", "Description": "Data lakehouse with Iceberg tables"}' --region us-east-1
[2026-01-13 10:35:00] (No output indicates success)
$ aws glue create-database --database-input '{"Name": "bronze", "Description": "Raw data landing zone"}' --region us-east-1
[2026-01-13 10:35:15] (No output indicates success)
$ aws glue create-database --database-input '{"Name": "silver", "Description": "Curated and cleaned data"}' --region us-east-1
[2026-01-13 10:35:30] (No output indicates success)
$ aws glue create-database --database-input '{"Name": "gold", "Description": "Analytics-ready aggregated data"}' --region us-east-1
[2026-01-13 10:35:45] (No output indicates success)
$ aws glue get-databases --region us-east-1
[2026-01-13 10:36:00] {
"DatabaseList": [
{
"Name": "bronze",
"Description": "Raw data landing zone",
"CreateTime": "2026-01-13T10:35:15.234000-05:00",
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "123456789012"
},
{
"Name": "gold",
"Description": "Analytics-ready aggregated data",
"CreateTime": "2026-01-13T10:35:45.789000-05:00",
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "123456789012"
},
{
"Name": "lakehouse",
"Description": "Data lakehouse with Iceberg tables",
"CreateTime": "2026-01-13T10:35:00.123000-05:00",
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "123456789012"
},
{
"Name": "silver",
"Description": "Curated and cleaned data",
"CreateTime": "2026-01-13T10:35:30.456000-05:00",
"CreateTableDefaultPermissions": [
{
"Principal": {
"DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
},
"Permissions": [
"ALL"
]
}
],
"CatalogId": "123456789012"
}
]
}
Step 2.2: Create IAM Role for Glue Catalog Access
# Get ROSA cluster OIDC provider
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
# Create trust policy for Spark service account
cat > spark-glue-trust-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
},
"Action": "sts:AssumeRoleWithWebIdentity",
"Condition": {
"StringEquals": {
"${OIDC_PROVIDER}:sub": "system:serviceaccount:spark-jobs:spark-sa"
}
}
}
]
}
EOF
# Create IAM role
export SPARK_ROLE_ARN=$(aws iam create-role \
--role-name SparkGlueCatalogRole \
--assume-role-policy-document file://spark-glue-trust-policy.json \
--query 'Role.Arn' \
--output text)
echo "Spark IAM Role ARN: $SPARK_ROLE_ARN"
Example Output:
$ export OIDC_PROVIDER=$(rosa describe cluster -c data-lakehouse -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
[2026-01-13 10:38:00]
$ export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
[2026-01-13 10:38:05]
$ cat > spark-glue-trust-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:38:20]
$ export SPARK_ROLE_ARN=$(aws iam create-role --role-name SparkGlueCatalogRole --assume-role-policy-document file://spark-glue-trust-policy.json --query 'Role.Arn' --output text)
[2026-01-13 10:38:35]
$ echo "Spark IAM Role ARN: $SPARK_ROLE_ARN"
[2026-01-13 10:38:40] Spark IAM Role ARN: arn:aws:iam::123456789012:role/SparkGlueCatalogRole
Step 2.3: Create IAM Policy for Glue and S3 Access
# Create policy for Glue Catalog access
cat > spark-glue-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"glue:GetDatabase",
"glue:GetDatabases",
"glue:GetTable",
"glue:GetTables",
"glue:GetPartition",
"glue:GetPartitions",
"glue:CreateTable",
"glue:UpdateTable",
"glue:DeleteTable",
"glue:BatchCreatePartition",
"glue:BatchDeletePartition",
"glue:BatchUpdatePartition",
"glue:CreatePartition",
"glue:DeletePartition",
"glue:UpdatePartition"
],
"Resource": [
"arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:catalog",
"arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:database/*",
"arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:table/*/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::lakehouse-*",
"arn:aws:s3:::lakehouse-*/*"
]
},
{
"Effect": "Allow",
"Action": [
"s3:ListAllMyBuckets"
],
"Resource": "*"
}
]
}
EOF
# Create and attach policy
aws iam put-role-policy \
--role-name SparkGlueCatalogRole \
--policy-name GlueS3Access \
--policy-document file://spark-glue-policy.json
echo "IAM policy created and attached"
Example Output:
$ cat > spark-glue-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:40:00]
$ aws iam put-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access --policy-document file://spark-glue-policy.json
[2026-01-13 10:40:15] (No output indicates success)
$ echo "IAM policy created and attached"
[2026-01-13 10:40:20] IAM policy created and attached
Phase 3: S3 Data Lake Setup
Step 3.1: Create S3 Buckets
# Create S3 bucket for data lake
export LAKEHOUSE_BUCKET="lakehouse-data-${ACCOUNT_ID}"
aws s3 mb s3://$LAKEHOUSE_BUCKET --region $AWS_REGION
# Enable versioning for data protection
aws s3api put-bucket-versioning \
--bucket $LAKEHOUSE_BUCKET \
--versioning-configuration Status=Enabled \
--region $AWS_REGION
# Create folder structure for medallion architecture
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key bronze/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key silver/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key gold/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key warehouse/
echo "S3 Data Lake bucket created: s3://$LAKEHOUSE_BUCKET"
Example Output:
$ export LAKEHOUSE_BUCKET="lakehouse-data-${ACCOUNT_ID}"
[2026-01-13 10:42:00]
$ aws s3 mb s3://lakehouse-data-123456789012 --region us-east-1
[2026-01-13 10:42:15] make_bucket: lakehouse-data-123456789012
$ aws s3api put-bucket-versioning --bucket lakehouse-data-123456789012 --versioning-configuration Status=Enabled --region us-east-1
[2026-01-13 10:42:30] (No output indicates success)
$ aws s3api put-object --bucket lakehouse-data-123456789012 --key bronze/
[2026-01-13 10:42:45] {
"ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
"ServerSideEncryption": "AES256"
}
$ aws s3api put-object --bucket lakehouse-data-123456789012 --key silver/
[2026-01-13 10:43:00] {
"ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
"ServerSideEncryption": "AES256"
}
$ aws s3api put-object --bucket lakehouse-data-123456789012 --key gold/
[2026-01-13 10:43:15] {
"ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
"ServerSideEncryption": "AES256"
}
$ aws s3api put-object --bucket lakehouse-data-123456789012 --key warehouse/
[2026-01-13 10:43:30] {
"ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
"ServerSideEncryption": "AES256"
}
$ echo "S3 Data Lake bucket created: s3://$LAKEHOUSE_BUCKET"
[2026-01-13 10:43:45] S3 Data Lake bucket created: s3://lakehouse-data-123456789012
Step 3.2: Configure S3 Bucket Policies
# Create bucket policy for secure access
cat > lakehouse-bucket-policy.json <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "AllowSparkAccess",
"Effect": "Allow",
"Principal": {
"AWS": "$SPARK_ROLE_ARN"
},
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::${LAKEHOUSE_BUCKET}",
"arn:aws:s3:::${LAKEHOUSE_BUCKET}/*"
]
}
]
}
EOF
# Apply bucket policy
aws s3api put-bucket-policy \
--bucket $LAKEHOUSE_BUCKET \
--policy file://lakehouse-bucket-policy.json
echo "Bucket policy applied"
Example Output:
$ cat > lakehouse-bucket-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:45:00]
$ aws s3api put-bucket-policy --bucket lakehouse-data-123456789012 --policy file://lakehouse-bucket-policy.json
[2026-01-13 10:45:15] (No output indicates success)
$ echo "Bucket policy applied"
[2026-01-13 10:45:20] Bucket policy applied
Step 3.3: Upload Sample Data
# Create sample dataset
mkdir -p sample-data
cd sample-data
# Generate sample sales data
python3 <<PYTHON
import csv
import random
from datetime import datetime, timedelta
# Generate sample sales data
with open('sales_data.csv', 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['transaction_id', 'date', 'product', 'category', 'amount', 'quantity', 'region'])
products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones']
categories = ['Electronics', 'Accessories']
regions = ['North', 'South', 'East', 'West']
base_date = datetime(2024, 1, 1)
for i in range(10000):
transaction_date = base_date + timedelta(days=random.randint(0, 365))
product = random.choice(products)
category = 'Electronics' if product in ['Laptop', 'Monitor'] else 'Accessories'
writer.writerow([
f'TXN{i:06d}',
transaction_date.strftime('%Y-%m-%d'),
product,
category,
round(random.uniform(10, 2000), 2),
random.randint(1, 10),
random.choice(regions)
])
print("Sample data generated: sales_data.csv")
PYTHON
# Upload to S3 bronze layer
aws s3 cp sales_data.csv s3://$LAKEHOUSE_BUCKET/bronze/sales/sales_data.csv
cd ..
echo "Sample data uploaded to S3"
Example Output:
$ mkdir -p sample-data
[2026-01-13 10:47:00]
$ cd sample-data
[2026-01-13 10:47:05]
$ python3 <<PYTHON
[script content]
PYTHON
[2026-01-13 10:47:30] Sample data generated: sales_data.csv
$ aws s3 cp sales_data.csv s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv
[2026-01-13 10:48:00] upload: ./sales_data.csv to s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv
$ cd ..
[2026-01-13 10:48:05]
$ echo "Sample data uploaded to S3"
[2026-01-13 10:48:10] Sample data uploaded to S3
Phase 4: Apache Spark on OpenShift
Step 4.1: Install Spark Operator
# Add Spark Operator Helm repository
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update
# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
--namespace spark-operator \
--create-namespace \
--set webhook.enable=true \
--set sparkJobNamespace=spark-jobs
# Verify installation
kubectl get pods -n spark-operator
kubectl get crd | grep spark
Example Output:
$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
[2026-01-13 10:50:00] "spark-operator" has been added to your repositories
$ helm repo update
[2026-01-13 10:50:15] Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "spark-operator" chart repository
Update Complete. βHappy Helming!β
$ helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --set sparkJobNamespace=spark-jobs
[2026-01-13 10:51:00] NAME: spark-operator
LAST DEPLOYED: Mon Jan 13 10:51:00 2026
NAMESPACE: spark-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Verify the Spark Operator deployment:
kubectl get pods -n spark-operator
2. Check the webhook:
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations
3. Submit a SparkApplication:
kubectl apply -f examples/spark-pi.yaml
For more information, visit https://github.com/kubeflow/spark-operator
$ kubectl get pods -n spark-operator
[2026-01-13 10:51:30] NAME READY STATUS RESTARTS AGE
spark-operator-5f7b8c9d6b-xq4zm 1/1 Running 0 30s
$ kubectl get crd | grep spark
[2026-01-13 10:51:45] scheduledsparkapplications.sparkoperator.k8s.io 2026-01-13T15:51:00Z
sparkapplications.sparkoperator.k8s.io 2026-01-13T15:51:00Z
Step 4.2: Create Service Account for Spark
# Create service account with IAM role annotation
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark-sa
namespace: spark-jobs
annotations:
eks.amazonaws.com/role-arn: $SPARK_ROLE_ARN
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: spark-role
namespace: spark-jobs
rules:
- apiGroups: [""]
resources: ["pods", "services", "configmaps"]
verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: [""]
resources: ["pods/log"]
verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: spark-rolebinding
namespace: spark-jobs
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: spark-role
subjects:
- kind: ServiceAccount
name: spark-sa
namespace: spark-jobs
EOF
# Verify service account
oc get sa spark-sa -n spark-jobs -o yaml
Example Output:
$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 10:53:00] serviceaccount/spark-sa created
role.rbac.authorization.k8s.io/spark-role created
rolebinding.rbac.authorization.k8s.io/spark-rolebinding created
$ oc get sa spark-sa -n spark-jobs -o yaml
[2026-01-13 10:53:15] apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/SparkGlueCatalogRole
creationTimestamp: "2026-01-13T15:53:00Z"
name: spark-sa
namespace: spark-jobs
resourceVersion: "123456"
uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
secrets:
- name: spark-sa-dockercfg-xyz12
Step 4.3: Create ConfigMap for Spark Configuration
# Create Spark configuration
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: spark-config
namespace: spark-jobs
data:
spark-defaults.conf: |
spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
spark.sql.catalog.glue_catalog.warehouse=s3://${LAKEHOUSE_BUCKET}/warehouse
spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
spark.eventLog.enabled=true
spark.eventLog.dir=s3a://${LAKEHOUSE_BUCKET}/spark-events
lakehouse.conf: |
LAKEHOUSE_BUCKET=${LAKEHOUSE_BUCKET}
AWS_REGION=${AWS_REGION}
GLUE_DATABASE=lakehouse
EOF
Example Output:
$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 10:55:00] configmap/spark-config created
Phase 5: Apache Iceberg Integration
Step 5.1: Build Custom Spark Image with Iceberg
# Create directory for custom Spark image
mkdir -p spark-iceberg
cd spark-iceberg
# Create Dockerfile
cat > Dockerfile <<'DOCKERFILE'
FROM gcr.io/spark-operator/spark:v3.5.0
USER root
# Install AWS dependencies and Iceberg
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
-o /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.4.2.jar
RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
-o /opt/spark/jars/hadoop-aws-3.3.4.jar
RUN curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar \
-o /opt/spark/jars/aws-java-sdk-bundle-1.12.262.jar
RUN curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar \
-o /opt/spark/jars/bundle-2.20.18.jar
RUN curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.20.18/url-connection-client-2.20.18.jar \
-o /opt/spark/jars/url-connection-client-2.20.18.jar
USER 185
ENTRYPOINT ["/opt/entrypoint.sh"]
DOCKERFILE
# Build and push to a container registry
# For this example, we'll use OpenShift internal registry
oc create imagestream spark-iceberg -n spark-jobs
# Build image using OpenShift build
cat > BuildConfig.yaml <<EOF
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
name: spark-iceberg
namespace: spark-jobs
spec:
output:
to:
kind: ImageStreamTag
name: spark-iceberg:latest
source:
dockerfile: |
$(cat Dockerfile | sed 's/^/ /')
type: Dockerfile
strategy:
dockerStrategy: {}
type: Docker
EOF
oc apply -f BuildConfig.yaml
# Start build
oc start-build spark-iceberg -n spark-jobs --follow
# Get image reference
export SPARK_IMAGE=$(oc get is spark-iceberg -n spark-jobs -o jsonpath='{.status.dockerImageRepository}'):latest
cd ..
echo "Custom Spark image with Iceberg built: $SPARK_IMAGE"
Example Output:
$ mkdir -p spark-iceberg
[2026-01-13 11:00:00]
$ cd spark-iceberg
[2026-01-13 11:00:05]
$ cat > Dockerfile <<'DOCKERFILE'
[content omitted for brevity]
DOCKERFILE
[2026-01-13 11:00:30]
$ oc create imagestream spark-iceberg -n spark-jobs
[2026-01-13 11:01:00] imagestream.image.openshift.io/spark-iceberg created
$ cat > BuildConfig.yaml <<EOF
[content omitted for brevity]
EOF
[2026-01-13 11:01:15]
$ oc apply -f BuildConfig.yaml
[2026-01-13 11:01:30] buildconfig.build.openshift.io/spark-iceberg created
$ oc start-build spark-iceberg -n spark-jobs --follow
[2026-01-13 11:01:45] build.build.openshift.io/spark-iceberg-1 started
Cloning "https://github.com/..." ...
Commit: abc123def456 (Initial commit)
Author: DataEngineer <engineer@example.com>
Date: Mon Jan 13 11:01:00 2026 -0500
Receiving objects: 100% (3/3), done.
Resolving deltas: 100% (1/1), done.
Step 1/7 : FROM gcr.io/spark-operator/spark:v3.5.0
---> 1a2b3c4d5e6f
Step 2/7 : USER root
---> Running in 7g8h9i0j1k2l
Removing intermediate container 7g8h9i0j1k2l
---> 3m4n5o6p7q8r
Step 3/7 : RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar -o /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.4.2.jar
---> Running in 9s0t1u2v3w4x
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 45.2M 100 45.2M 0 0 15.3M 0 0:00:02 0:00:02 --:--:-- 15.3M
Removing intermediate container 9s0t1u2v3w4x
---> 5y6z7a8b9c0d
Step 4/7 : RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -o /opt/spark/jars/hadoop-aws-3.3.4.jar
---> Running in 1e2f3g4h5i6j
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 789k 100 789k 0 0 2145k 0 --:--:-- --:--:-- --:--:-- 2145k
Removing intermediate container 1e2f3g4h5i6j
---> 7k8l9m0n1o2p
Step 5/7 : RUN curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar -o /opt/spark/jars/aws-java-sdk-bundle-1.12.262.jar
---> Running in 3q4r5s6t7u8v
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 289M 100 289M 0 0 45.2M 0 0:00:06 0:00:06 --:--:-- 52.1M
Removing intermediate container 3q4r5s6t7u8v
---> 9w0x1y2z3a4b
Step 6/7 : USER 185
---> Running in 5c6d7e8f9g0h
Removing intermediate container 5c6d7e8f9g0h
---> 1i2j3k4l5m6n
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
---> Running in 7o8p9q0r1s2t
Removing intermediate container 7o8p9q0r1s2t
---> 3u4v5w6x7y8z
Successfully built 3u4v5w6x7y8z
Successfully tagged image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest
Pushing image image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest ...
Getting image source signatures
Copying blob sha256:9a0b1c2d3e4f...
Copying blob sha256:5f6e7d8c9b0a...
Copying blob sha256:1g2h3i4j5k6l...
Copying config sha256:3u4v5w6x7y8z...
Writing manifest to image destination
Storing signatures
Successfully pushed image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg@sha256:7m8n9o0p1q2r3s4t5u6v7w8x9y0z1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p
Push successful
$ export SPARK_IMAGE=$(oc get is spark-iceberg -n spark-jobs -o jsonpath='{.status.dockerImageRepository}'):latest
[2026-01-13 11:08:30]
$ cd ..
[2026-01-13 11:08:35]
$ echo "Custom Spark image with Iceberg built: $SPARK_IMAGE"
[2026-01-13 11:08:40] Custom Spark image with Iceberg built: image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest
Phase 6: Spark-Glue Catalog Integration
Step 6.1: Create Sample Spark Application
# Create PySpark script for data processing
mkdir -p spark-jobs
cd spark-jobs
cat > process_sales.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, sum as _sum, avg, count
import sys
def main():
# Create Spark session with Iceberg and Glue Catalog
spark = SparkSession.builder \
.appName("ProcessSalesData") \
.config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.getOrCreate()
spark.sparkContext.setLogLevel("INFO")
# Get configuration from environment
bucket = sys.argv[1] if len(sys.argv) > 1 else "lakehouse-data"
print(f"Reading data from s3a://{bucket}/bronze/sales/")
# Read raw CSV data
df_raw = spark.read.csv(
f"s3a://{bucket}/bronze/sales/sales_data.csv",
header=True,
inferSchema=True
)
print(f"Raw data count: {df_raw.count()}")
df_raw.show(5)
# Create bronze table in Glue Catalog (if not exists)
df_raw.write \
.format("iceberg") \
.mode("overwrite") \
.option("path", f"s3a://{bucket}/warehouse/bronze.db/sales") \
.saveAsTable("glue_catalog.bronze.sales")
print("Bronze table created in Glue Catalog")
# Transform data for silver layer
df_silver = df_raw \
.withColumn("year", year(col("date"))) \
.withColumn("month", month(col("date"))) \
.filter(col("amount") > 0) \
.dropDuplicates(["transaction_id"])
# Write to silver layer
df_silver.write \
.format("iceberg") \
.mode("overwrite") \
.partitionBy("year", "month") \
.option("path", f"s3a://{bucket}/warehouse/silver.db/sales_clean") \
.saveAsTable("glue_catalog.silver.sales_clean")
print("Silver table created with partitioning")
# Create aggregated gold layer
df_gold = df_silver.groupBy("year", "month", "category", "region") \
.agg(
_sum("amount").alias("total_revenue"),
_sum("quantity").alias("total_quantity"),
avg("amount").alias("avg_transaction_value"),
count("transaction_id").alias("transaction_count")
)
# Write to gold layer
df_gold.write \
.format("iceberg") \
.mode("overwrite") \
.option("path", f"s3a://{bucket}/warehouse/gold.db/sales_summary") \
.saveAsTable("glue_catalog.gold.sales_summary")
print("Gold table created with aggregations")
# Show sample results
print("\n=== Bronze Layer Sample ===")
spark.sql("SELECT * FROM glue_catalog.bronze.sales LIMIT 5").show()
print("\n=== Silver Layer Sample ===")
spark.sql("SELECT * FROM glue_catalog.silver.sales_clean LIMIT 5").show()
print("\n=== Gold Layer Sample ===")
spark.sql("SELECT * FROM glue_catalog.gold.sales_summary ORDER BY total_revenue DESC LIMIT 10").show()
# Verify tables in Glue Catalog
print("\n=== Tables in Glue Catalog ===")
spark.sql("SHOW TABLES IN glue_catalog.bronze").show()
spark.sql("SHOW TABLES IN glue_catalog.silver").show()
spark.sql("SHOW TABLES IN glue_catalog.gold").show()
spark.stop()
if __name__ == "__main__":
main()
PYTHON
# Upload script to S3
aws s3 cp process_sales.py s3://$LAKEHOUSE_BUCKET/scripts/
cd ..
Example Output:
$ mkdir -p spark-jobs
[2026-01-13 11:10:00]
$ cd spark-jobs
[2026-01-13 11:10:05]
$ cat > process_sales.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:12:00]
$ aws s3 cp process_sales.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:12:15] upload: ./process_sales.py to s3://lakehouse-data-123456789012/scripts/process_sales.py
$ cd ..
[2026-01-13 11:12:20]
Step 6.2: Create SparkApplication Custom Resource
# Create SparkApplication manifest
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
name: process-sales-data
namespace: spark-jobs
spec:
type: Python
pythonVersion: "3"
mode: cluster
image: $SPARK_IMAGE
imagePullPolicy: Always
mainApplicationFile: s3a://$LAKEHOUSE_BUCKET/scripts/process_sales.py
arguments:
- "$LAKEHOUSE_BUCKET"
sparkVersion: "3.5.0"
restartPolicy:
type: Never
driver:
cores: 1
coreLimit: "1200m"
memory: "2g"
labels:
version: "3.5.0"
serviceAccount: spark-sa
env:
- name: AWS_REGION
value: "$AWS_REGION"
- name: AWS_ROLE_ARN
value: "$SPARK_ROLE_ARN"
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
volumeMounts:
- name: aws-iam-token
mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
readOnly: true
executor:
cores: 2
instances: 3
memory: "4g"
labels:
version: "3.5.0"
env:
- name: AWS_REGION
value: "$AWS_REGION"
- name: AWS_ROLE_ARN
value: "$SPARK_ROLE_ARN"
- name: AWS_WEB_IDENTITY_TOKEN_FILE
value: "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
volumeMounts:
- name: aws-iam-token
mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
readOnly: true
volumes:
- name: aws-iam-token
projected:
sources:
- serviceAccountToken:
audience: sts.amazonaws.com
expirationSeconds: 86400
path: token
sparkConf:
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
"spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
"spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
"spark.sql.catalog.glue_catalog.warehouse": "s3a://$LAKEHOUSE_BUCKET/warehouse"
"spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
"spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
"spark.kubernetes.allocation.batch.size": "3"
EOF
Example Output:
$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 11:15:00] sparkapplication.sparkoperator.k8s.io/process-sales-data created
Phase 7: Sample Data Pipelines
Step 7.1: Create Incremental Processing Pipeline
# Create incremental processing script
cat > spark-jobs/incremental_pipeline.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, lit
from datetime import datetime
import sys
def main():
spark = SparkSession.builder \
.appName("IncrementalPipeline") \
.getOrCreate()
bucket = sys.argv[1]
batch_date = sys.argv[2] if len(sys.argv) > 2 else datetime.now().strftime('%Y-%m-%d')
print(f"Processing incremental data for date: {batch_date}")
# Read existing silver table
df_existing = spark.read \
.format("iceberg") \
.load(f"glue_catalog.silver.sales_clean")
# Read new data (simulate incremental load)
df_new = spark.read.csv(
f"s3a://{bucket}/bronze/sales/sales_data.csv",
header=True,
inferSchema=True
).filter(col("date") == batch_date) \
.withColumn("processed_timestamp", current_timestamp())
# Append to silver table using Iceberg merge
df_new.writeTo("glue_catalog.silver.sales_clean") \
.append()
print(f"Appended {df_new.count()} records to silver table")
# Update gold aggregations
df_updated = spark.read \
.format("iceberg") \
.load("glue_catalog.silver.sales_clean") \
.filter(col("date") == batch_date)
# Recalculate aggregations for affected partitions
from pyspark.sql.functions import year, month, sum as _sum, avg, count
df_agg = df_updated \
.withColumn("year", year(col("date"))) \
.withColumn("month", month(col("date"))) \
.groupBy("year", "month", "category", "region") \
.agg(
_sum("amount").alias("total_revenue"),
_sum("quantity").alias("total_quantity"),
avg("amount").alias("avg_transaction_value"),
count("transaction_id").alias("transaction_count")
)
# Merge into gold table
df_agg.writeTo("glue_catalog.gold.sales_summary") \
.using("iceberg") \
.tableProperty("write.merge.mode", "merge-on-read") \
.append()
print("Gold table updated with incremental aggregations")
spark.stop()
if __name__ == "__main__":
main()
PYTHON
# Upload to S3
aws s3 cp spark-jobs/incremental_pipeline.py s3://$LAKEHOUSE_BUCKET/scripts/
Example Output:
$ cat > spark-jobs/incremental_pipeline.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:20:00]
$ aws s3 cp spark-jobs/incremental_pipeline.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:20:15] upload: spark-jobs/incremental_pipeline.py to s3://lakehouse-data-123456789012/scripts/incremental_pipeline.py
Step 7.2: Create Time Travel Query Example
# Create time travel demonstration script
cat > spark-jobs/time_travel.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import sys
def main():
spark = SparkSession.builder \
.appName("IcebergTimeTravel") \
.getOrCreate()
bucket = sys.argv[1]
# Read current version
print("=== Current Version ===")
df_current = spark.read \
.format("iceberg") \
.load("glue_catalog.silver.sales_clean")
print(f"Current record count: {df_current.count()}")
df_current.show(5)
# Show table history
print("\n=== Table History ===")
spark.sql("SELECT * FROM glue_catalog.silver.sales_clean.history").show()
# Show table snapshots
print("\n=== Table Snapshots ===")
spark.sql("SELECT * FROM glue_catalog.silver.sales_clean.snapshots").show()
# Query specific snapshot (if exists)
snapshots = spark.sql("SELECT snapshot_id FROM glue_catalog.silver.sales_clean.snapshots ORDER BY committed_at LIMIT 1").collect()
if snapshots:
snapshot_id = snapshots[0][0]
print(f"\n=== Data at Snapshot {snapshot_id} ===")
df_snapshot = spark.read \
.format("iceberg") \
.option("snapshot-id", snapshot_id) \
.load("glue_catalog.silver.sales_clean")
print(f"Snapshot record count: {df_snapshot.count()}")
df_snapshot.show(5)
# Show table metadata
print("\n=== Table Metadata ===")
spark.sql("DESCRIBE EXTENDED glue_catalog.silver.sales_clean").show(100, False)
spark.stop()
if __name__ == "__main__":
main()
PYTHON
# Upload to S3
aws s3 cp spark-jobs/time_travel.py s3://$LAKEHOUSE_BUCKET/scripts/
Example Output:
$ cat > spark-jobs/time_travel.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:22:00]
$ aws s3 cp spark-jobs/time_travel.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:22:15] upload: spark-jobs/time_travel.py to s3://lakehouse-data-123456789012/scripts/time_travel.py
Testing and Validation
Test 1: Monitor Spark Application
# Check SparkApplication status
kubectl get sparkapplication -n spark-jobs
# Describe application
kubectl describe sparkapplication process-sales-data -n spark-jobs
# Watch driver pod logs
export DRIVER_POD=$(kubectl get pods -n spark-jobs -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')
kubectl logs -f $DRIVER_POD -n spark-jobs
# Check executor pods
kubectl get pods -n spark-jobs -l spark-role=executor
Example Output:
$ kubectl get sparkapplication -n spark-jobs
[2026-01-13 11:25:00] NAME STATUS ATTEMPTS START FINISH AGE
process-sales-data RUNNING 1 2026-01-13T11:24:30Z 3m
$ kubectl describe sparkapplication process-sales-data -n spark-jobs
[2026-01-13 11:25:15] Name: process-sales-data
Namespace: spark-jobs
Labels: <none>
Annotations: <none>
API Version: sparkoperator.k8s.io/v1beta2
Kind: SparkApplication
Metadata:
Creation Timestamp: 2026-01-13T16:24:15Z
Generation: 1
Resource Version: 234567
UID: f1g2h3i4-j5k6-7l8m-9n0o-p1q2r3s4t5u6
Spec:
Driver:
Cores: 1
Core Limit: 1200m
Memory: 2g
Service Account: spark-sa
Executor:
Cores: 2
Instances: 3
Memory: 4g
Image: image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest
Main Application File: s3a://lakehouse-data-123456789012/scripts/process_sales.py
Mode: cluster
Python Version: 3
Spark Version: 3.5.0
Type: Python
Status:
Application State:
State: RUNNING
Driver Info:
Pod Name: process-sales-data-driver
Web UI Service Name: process-sales-data-ui-svc
Execution Attempts: 1
Last Submission Attempt Time: 2026-01-13T16:24:30Z
Spark Application Id: spark-application-1705165470123-456789
Submission Attempts: 1
Termination Time: <nil>
$ export DRIVER_POD=$(kubectl get pods -n spark-jobs -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')
[2026-01-13 11:25:30]
$ kubectl logs -f process-sales-data-driver -n spark-jobs
[2026-01-13 11:25:45] 26/01/13 16:25:45 INFO SparkContext: Running Spark version 3.5.0
26/01/13 16:25:46 INFO ResourceUtils: ==============================================================
26/01/13 16:25:46 INFO ResourceUtils: No custom resources configured for spark.driver.
26/01/13 16:25:46 INFO ResourceUtils: ==============================================================
26/01/13 16:25:46 INFO SparkContext: Submitted application: ProcessSalesData
26/01/13 16:25:47 INFO SecurityManager: Changing view acls to: 185
26/01/13 16:25:47 INFO SecurityManager: Changing modify acls to: 185
26/01/13 16:25:47 INFO SecurityManager: Changing view acls groups to:
26/01/13 16:25:47 INFO SecurityManager: Changing modify acls groups to:
26/01/13 16:25:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(185); groups with view permissions: Set(); users with modify permissions: Set(185); groups with modify permissions: Set()
26/01/13 16:25:48 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
26/01/13 16:25:49 INFO SparkEnv: Registering MapOutputTracker
26/01/13 16:25:49 INFO SparkEnv: Registering BlockManagerMaster
26/01/13 16:25:50 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
26/01/13 16:25:50 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
26/01/13 16:26:15 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
26/01/13 16:26:15 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
26/01/13 16:26:15 INFO SharedState: Warehouse path is 's3a://lakehouse-data-123456789012/warehouse'.
Reading data from s3a://lakehouse-data-123456789012/bronze/sales/
26/01/13 16:26:30 INFO FileSourceStrategy: Pushed Filters: []
26/01/13 16:26:30 INFO FileSourceStrategy: Post-Scan Filters: []
26/01/13 16:26:30 INFO CodeGenerator: Code generated in 156.234567 ms
26/01/13 16:26:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
Raw data count: 10000
26/01/13 16:26:45 INFO CodeGenerator: Code generated in 23.456789 ms
+---------------+----------+----------+------------+-------+--------+------+
|transaction_id| date| product| category| amount|quantity|region|
+---------------+----------+----------+------------+-------+--------+------+
| TXN000000|2024-03-15| Laptop|Electronics|1245.67| 3| North|
| TXN000001|2024-07-22| Mouse|Accessories| 23.45| 5| East|
| TXN000002|2024-01-08| Keyboard|Accessories| 67.89| 2| South|
| TXN000003|2024-11-30| Monitor|Electronics| 345.00| 1| West|
| TXN000004|2024-05-12|Headphones|Accessories| 125.50| 4| North|
+---------------+----------+----------+------------+-------+--------+------+
only showing top 5 rows
26/01/13 16:27:00 INFO GlueCatalog: Glue catalog initialized
26/01/13 16:27:15 INFO BaseTable: Creating Iceberg table bronze.sales
Bronze table created in Glue Catalog
26/01/13 16:28:30 INFO BaseTable: Creating Iceberg table silver.sales_clean with partitioning
Silver table created with partitioning
26/01/13 16:29:45 INFO BaseTable: Creating Iceberg table gold.sales_summary
Gold table created with aggregations
=== Bronze Layer Sample ===
+---------------+----------+----------+------------+-------+--------+------+
|transaction_id| date| product| category| amount|quantity|region|
+---------------+----------+----------+------------+-------+--------+------+
| TXN000000|2024-03-15| Laptop|Electronics|1245.67| 3| North|
| TXN000001|2024-07-22| Mouse|Accessories| 23.45| 5| East|
| TXN000002|2024-01-08| Keyboard|Accessories| 67.89| 2| South|
| TXN000003|2024-11-30| Monitor|Electronics| 345.00| 1| West|
| TXN000004|2024-05-12|Headphones|Accessories| 125.50| 4| North|
+---------------+----------+----------+------------+-------+--------+------+
=== Silver Layer Sample ===
+---------------+----------+----------+------------+-------+--------+------+----+-----+
|transaction_id| date| product| category| amount|quantity|region|year|month|
+---------------+----------+----------+------------+-------+--------+------+----+-----+
| TXN000000|2024-03-15| Laptop|Electronics|1245.67| 3| North|2024| 3|
| TXN000001|2024-07-22| Mouse|Accessories| 23.45| 5| East|2024| 7|
| TXN000002|2024-01-08| Keyboard|Accessories| 67.89| 2| South|2024| 1|
| TXN000003|2024-11-30| Monitor|Electronics| 345.00| 1| West|2024| 11|
| TXN000004|2024-05-12|Headphones|Accessories| 125.50| 4| North|2024| 5|
+---------------+----------+----------+------------+-------+--------+------+----+-----+
=== Gold Layer Sample ===
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+
|year|month| category|region|total_revenue|total_quantity|avg_transaction_value|transaction_count|
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+
|2024| 7|Electronics| North| 987654.32| 4523| 218.45 | 4521|
|2024| 3|Electronics| East| 876543.21| 3892| 225.23 | 3891|
|2024| 11|Accessories| South| 765432.10| 5234| 146.32 | 5231|
|2024| 5|Electronics| West| 654321.09| 2987| 219.05 | 2988|
|2024| 1|Accessories| North| 543210.98| 4123| 131.78 | 4124|
|2024| 8|Electronics| South| 432109.87| 2156| 200.42 | 2157|
|2024| 6|Accessories| East| 321098.76| 3567| 90.01 | 3568|
|2024| 9|Electronics| North| 210987.65| 1876| 112.45 | 1877|
|2024| 2|Accessories| West| 109876.54| 2345| 46.84 | 2346|
|2024| 10|Electronics| East| 98765.43| 1234| 80.02 | 1235|
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+
=== Tables in Glue Catalog ===
+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
| bronze| sales| false|
+---------+----------+-----------+
+---------+-----------+-----------+
|namespace| tableName|isTemporary|
+---------+-----------+-----------+
| silver|sales_clean| false|
+---------+-----------+-----------+
+---------+-------------+-----------+
|namespace| tableName|isTemporary|
+---------+-------------+-----------+
| gold|sales_summary| false|
+---------+-------------+-----------+
26/01/13 16:30:15 INFO SparkContext: Successfully stopped SparkContext
26/01/13 16:30:16 INFO ShutdownHookManager: Shutdown hook called
$ kubectl get pods -n spark-jobs -l spark-role=executor
[2026-01-13 11:31:00] NAME READY STATUS RESTARTS AGE
process-sales-data-1705165470-exec-1 1/1 Running 0 5m
process-sales-data-1705165470-exec-2 1/1 Running 0 5m
process-sales-data-1705165470-exec-3 1/1 Running 0 5m
Test 2: Verify Glue Catalog Tables
# List databases
aws glue get-databases --region $AWS_REGION
# List tables in bronze database
aws glue get-tables --database-name bronze --region $AWS_REGION
# Get table details
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION
# Check table location and format
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION \
--query 'Table.StorageDescriptor.Location'
Example Output:
$ aws glue get-tables --database-name bronze --region us-east-1
[2026-01-13 11:35:00] {
"TableList": [
{
"Name": "sales",
"DatabaseName": "bronze",
"CreateTime": "2026-01-13T16:27:15.123000-05:00",
"UpdateTime": "2026-01-13T16:27:15.123000-05:00",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "transaction_id",
"Type": "string"
},
{
"Name": "date",
"Type": "string"
},
{
"Name": "product",
"Type": "string"
},
{
"Name": "category",
"Type": "string"
},
{
"Name": "amount",
"Type": "double"
},
{
"Name": "quantity",
"Type": "bigint"
},
{
"Name": "region",
"Type": "string"
}
],
"Location": "s3://lakehouse-data-123456789012/warehouse/bronze.db/sales",
"InputFormat": "org.apache.iceberg.mr.hive.HiveIcebergInputFormat",
"OutputFormat": "org.apache.iceberg.mr.hive.HiveIcebergOutputFormat",
"SerdeInfo": {
"SerializationLibrary": "org.apache.iceberg.mr.hive.HiveIcebergSerDe"
}
},
"Parameters": {
"table_type": "ICEBERG",
"metadata_location": "s3://lakehouse-data-123456789012/warehouse/bronze.db/sales/metadata/00001-a1b2c3d4-e5f6-7890-abcd-ef1234567890.metadata.json"
},
"CatalogId": "123456789012"
}
]
}
$ aws glue get-table --database-name silver --name sales_clean --region us-east-1 --query 'Table.StorageDescriptor.Location'
[2026-01-13 11:35:30] "s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean"
Test 3: Verify Data in S3
# List warehouse contents
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/ --recursive --human-readable
# Check Iceberg metadata
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/sales_clean/metadata/
# List data files
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/sales_clean/data/
Example Output:
$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/ --recursive --human-readable
[2026-01-13 11:40:00] 2026-01-13 11:27:30 45.2 MiB warehouse/bronze.db/sales/data/00000-0-a1b2c3d4-e5f6-7890-abcd-ef1234567890-00001.parquet
2026-01-13 11:27:31 3.2 KiB warehouse/bronze.db/sales/metadata/00000-12345678-90ab-cdef-1234-567890abcdef.metadata.json
2026-01-13 11:27:31 5.1 KiB warehouse/bronze.db/sales/metadata/00001-a1b2c3d4-e5f6-7890-abcd-ef1234567890.metadata.json
2026-01-13 11:27:31 2.8 KiB warehouse/bronze.db/sales/metadata/snap-1234567890123456789-1-a1b2c3d4.avro
2026-01-13 11:28:45 42.1 MiB warehouse/silver.db/sales_clean/data/year=2024/month=1/00000-0-b2c3d4e5-f6g7-8901-bcde-f12345678901-00001.parquet
2026-01-13 11:28:46 38.7 MiB warehouse/silver.db/sales_clean/data/year=2024/month=2/00001-0-c3d4e5f6-g7h8-9012-cdef-123456789012-00001.parquet
2026-01-13 11:28:47 41.3 MiB warehouse/silver.db/sales_clean/data/year=2024/month=3/00002-0-d4e5f6g7-h8i9-0123-defg-234567890123-00001.parquet
2026-01-13 11:28:47 3.5 KiB warehouse/silver.db/sales_clean/metadata/00000-23456789-01bc-defg-2345-678901bcdefg.metadata.json
2026-01-13 11:28:47 6.2 KiB warehouse/silver.db/sales_clean/metadata/00001-b2c3d4e5-f6g7-8901-bcde-f12345678901.metadata.json
2026-01-13 11:29:50 512.3 KiB warehouse/gold.db/sales_summary/data/00000-0-e5f6g7h8-i9j0-1234-efgh-345678901234-00001.parquet
2026-01-13 11:29:50 3.1 KiB warehouse/gold.db/sales_summary/metadata/00000-34567890-12cd-efgh-3456-789012cdefgh.metadata.json
2026-01-13 11:29:50 4.8 KiB warehouse/gold.db/sales_summary/metadata/00001-c3d4e5f6-g7h8-9012-cdef-123456789012.metadata.json
$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean/metadata/
[2026-01-13 11:40:15] 2026-01-13 11:28:47 3542 00000-23456789-01bc-defg-2345-678901bcdefg.metadata.json
2026-01-13 11:28:47 6234 00001-b2c3d4e5-f6g7-8901-bcde-f12345678901.metadata.json
2026-01-13 11:28:47 2876 snap-2345678901234567890-1-b2c3d4e5.avro
2026-01-13 11:28:47 4123 v1.metadata.json
2026-01-13 11:28:47 42 version-hint.text
$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean/data/
[2026-01-13 11:40:30]
PRE year=2024/
Test 4: Query Data with Athena
# Create Athena workgroup (optional)
aws athena create-work-group \
--name lakehouse-queries \
--configuration "ResultConfigurationUpdates={OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/}" \
--region $AWS_REGION
# Query silver table using Athena
aws athena start-query-execution \
--query-string "SELECT * FROM silver.sales_clean LIMIT 10" \
--result-configuration "OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/" \
--region $AWS_REGION
# Query gold aggregations
aws athena start-query-execution \
--query-string "SELECT category, region, SUM(total_revenue) as revenue FROM gold.sales_summary GROUP BY category, region ORDER BY revenue DESC" \
--result-configuration "OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/" \
--region $AWS_REGION
Example Output:
$ aws athena create-work-group --name lakehouse-queries --configuration "ResultConfigurationUpdates={OutputLocation=s3://lakehouse-data-123456789012/athena-results/}" --region us-east-1
[2026-01-13 11:45:00] (No output indicates success)
$ aws athena start-query-execution --query-string "SELECT * FROM silver.sales_clean LIMIT 10" --result-configuration "OutputLocation=s3://lakehouse-data-123456789012/athena-results/" --region us-east-1
[2026-01-13 11:45:15] {
"QueryExecutionId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}
$ aws athena start-query-execution --query-string "SELECT category, region, SUM(total_revenue) as revenue FROM gold.sales_summary GROUP BY category, region ORDER BY revenue DESC" --result-configuration "OutputLocation=s3://lakehouse-data-123456789012/athena-results/" --region us-east-1
[2026-01-13 11:45:30] {
"QueryExecutionId": "b2c3d4e5-f6g7-8901-bcde-f12345678901"
}
Test 5: Stateless Compute Validation
# Step 1: Note current table state
echo "=== Before Cluster Deletion ==="
aws glue get-tables --database-name silver --region $AWS_REGION --query 'TableList[*].Name'
# Step 2: Delete ROSA cluster
echo "Deleting ROSA cluster..."
rosa delete cluster --cluster=$CLUSTER_NAME --yes
# Wait for deletion (or do this async)
# rosa logs uninstall --cluster=$CLUSTER_NAME --watch
# Step 3: Verify data persists in S3
echo "=== Data Still Exists in S3 ==="
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/ --recursive | wc -l
# Step 4: Verify metadata persists in Glue
echo "=== Metadata Still Exists in Glue ==="
aws glue get-tables --database-name silver --region $AWS_REGION --query 'TableList[*].Name'
# Step 5: Recreate cluster and verify access
# (Follow Phase 1 steps to recreate cluster)
# Then resubmit Spark job to prove data is accessible
echo "=== Stateless Compute Validated ==="
echo "All data and metadata persisted despite cluster deletion!"
Example Output:
$ echo "=== Before Cluster Deletion ==="
[2026-01-13 12:00:00] === Before Cluster Deletion ===
$ aws glue get-tables --database-name silver --region us-east-1 --query 'TableList[*].Name'
[2026-01-13 12:00:05] [
"sales_clean"
]
$ echo "Deleting ROSA cluster..."
[2026-01-13 12:00:10] Deleting ROSA cluster...
$ rosa delete cluster --cluster=data-lakehouse --yes
[2026-01-13 12:00:15] I: Cluster 'data-lakehouse' will start uninstalling now
I: To watch the cluster uninstallation logs, run 'rosa logs uninstall -c data-lakehouse --watch'
$ echo "=== Data Still Exists in S3 ==="
[2026-01-13 12:35:00] === Data Still Exists in S3 ===
$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/ --recursive | wc -l
[2026-01-13 12:35:15] 42
$ echo "=== Metadata Still Exists in Glue ==="
[2026-01-13 12:35:20] === Metadata Still Exists in Glue ===
$ aws glue get-tables --database-name silver --region us-east-1 --query 'TableList[*].Name'
[2026-01-13 12:35:25] [
"sales_clean"
]
$ echo "=== Stateless Compute Validated ==="
[2026-01-13 12:35:30] === Stateless Compute Validated ===
$ echo "All data and metadata persisted despite cluster deletion!"
[2026-01-13 12:35:35] All data and metadata persisted despite cluster deletion!
Resource Cleanup
To avoid ongoing AWS charges, follow these steps to clean up all resources.
Step 1: Delete Spark Applications
# Delete all Spark applications
kubectl delete sparkapplication --all -n spark-jobs
# Wait for pods to terminate
kubectl get pods -n spark-jobs
Example Output:
$ kubectl delete sparkapplication --all -n spark-jobs
[2026-01-13 13:00:00] sparkapplication.sparkoperator.k8s.io "process-sales-data" deleted
$ kubectl get pods -n spark-jobs
[2026-01-13 13:00:15] No resources found in spark-jobs namespace.
Step 2: Delete Spark Operator
# Uninstall Spark Operator
helm uninstall spark-operator -n spark-operator
# Delete namespace
kubectl delete namespace spark-operator
kubectl delete namespace spark-jobs
Example Output:
$ helm uninstall spark-operator -n spark-operator
[2026-01-13 13:02:00] release "spark-operator" uninstalled
$ kubectl delete namespace spark-operator
[2026-01-13 13:02:15] namespace "spark-operator" deleted
$ kubectl delete namespace spark-jobs
[2026-01-13 13:02:30] namespace "spark-jobs" deleted
Step 3: Delete ROSA Cluster
# Delete ROSA cluster
rosa delete cluster --cluster=$CLUSTER_NAME --yes
# Wait for deletion
rosa logs uninstall --cluster=$CLUSTER_NAME --watch
# Verify deletion
rosa list clusters
Example Output:
$ rosa delete cluster --cluster=data-lakehouse --yes
[2026-01-13 13:05:00] I: Cluster 'data-lakehouse' will start uninstalling now
I: To watch the cluster uninstallation logs, run 'rosa logs uninstall -c data-lakehouse --watch'
$ rosa logs uninstall --cluster=data-lakehouse --watch
[2026-01-13 13:05:15] time="2026-01-13T13:05:15Z" level=info msg="Destroying cluster resources"
time="2026-01-13T13:06:30Z" level=info msg="Deleting worker nodes"
time="2026-01-13T13:10:45Z" level=info msg="Deleting control plane"
time="2026-01-13T13:25:20Z" level=info msg="Removing load balancers"
time="2026-01-13T13:30:00Z" level=info msg="Deleting VPC and subnets"
time="2026-01-13T13:35:45Z" level=info msg="Cluster uninstallation complete"
$ rosa list clusters
[2026-01-13 13:36:00] ID NAME STATE TOPOLOGY
(No clusters found)
Step 4: Delete Glue Catalog Resources
# Delete tables from all databases
for db in bronze silver gold lakehouse; do
echo "Deleting tables from database: $db"
# Get table names
TABLES=$(aws glue get-tables --database-name $db --region $AWS_REGION --query 'TableList[*].Name' --output text)
# Delete each table
for table in $TABLES; do
echo " Deleting table: $table"
aws glue delete-table --database-name $db --name $table --region $AWS_REGION
done
# Delete database
echo "Deleting database: $db"
aws glue delete-database --name $db --region $AWS_REGION
done
echo "Glue Catalog resources deleted"
Example Output:
$ for db in bronze silver gold lakehouse; do
[output for each database]
done
[2026-01-13 13:40:00] Deleting tables from database: bronze
Deleting table: sales
Deleting database: bronze
Deleting tables from database: silver
Deleting table: sales_clean
Deleting database: silver
Deleting tables from database: gold
Deleting table: sales_summary
Deleting database: gold
Deleting tables from database: lakehouse
Deleting database: lakehouse
$ echo "Glue Catalog resources deleted"
[2026-01-13 13:41:00] Glue Catalog resources deleted
Step 5: Delete S3 Bucket
# Delete all objects in bucket
aws s3 rm s3://$LAKEHOUSE_BUCKET --recursive --region $AWS_REGION
# Delete bucket
aws s3 rb s3://$LAKEHOUSE_BUCKET --region $AWS_REGION
echo "S3 bucket deleted"
Example Output:
$ aws s3 rm s3://lakehouse-data-123456789012 --recursive --region us-east-1
[2026-01-13 13:45:00] delete: s3://lakehouse-data-123456789012/bronze/
delete: s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv
delete: s3://lakehouse-data-123456789012/gold/
delete: s3://lakehouse-data-123456789012/scripts/incremental_pipeline.py
delete: s3://lakehouse-data-123456789012/scripts/process_sales.py
delete: s3://lakehouse-data-123456789012/scripts/time_travel.py
delete: s3://lakehouse-data-123456789012/silver/
delete: s3://lakehouse-data-123456789012/warehouse/
[... 42 more deletions ...]
$ aws s3 rb s3://lakehouse-data-123456789012 --region us-east-1
[2026-01-13 13:46:00] remove_bucket: lakehouse-data-123456789012
$ echo "S3 bucket deleted"
[2026-01-13 13:46:05] S3 bucket deleted
Step 6: Delete IAM Resources
# Delete IAM role policy
aws iam delete-role-policy \
--role-name SparkGlueCatalogRole \
--policy-name GlueS3Access
# Delete IAM role
aws iam delete-role --role-name SparkGlueCatalogRole
echo "IAM resources deleted"
Example Output:
$ aws iam delete-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access
[2026-01-13 13:48:00] (No output indicates success)
$ aws iam delete-role --role-name SparkGlueCatalogRole
[2026-01-13 13:48:15] (No output indicates success)
$ echo "IAM resources deleted"
[2026-01-13 13:48:20] IAM resources deleted
Step 7: Clean Up Local Files
# Remove temporary files
rm -f spark-glue-trust-policy.json
rm -f spark-glue-policy.json
rm -f lakehouse-bucket-policy.json
rm -rf sample-data/
rm -rf spark-jobs/
rm -rf spark-iceberg/
echo "Local files cleaned up"
Example Output:
$ rm -f spark-glue-trust-policy.json spark-glue-policy.json lakehouse-bucket-policy.json
[2026-01-13 13:50:00]
$ rm -rf sample-data/ spark-jobs/ spark-iceberg/
[2026-01-13 13:50:05]
$ echo "Local files cleaned up"
[2026-01-13 13:50:10] Local files cleaned up
Verification
# Verify ROSA cluster is deleted
rosa list clusters
# Verify S3 bucket is deleted
aws s3 ls | grep lakehouse
# Verify Glue databases are deleted
aws glue get-databases --region $AWS_REGION | grep -E "bronze|silver|gold|lakehouse"
# Verify IAM role is deleted
aws iam get-role --role-name SparkGlueCatalogRole 2>&1 | grep NoSuchEntity
echo "Cleanup verification complete"
Example Output:
$ rosa list clusters
[2026-01-13 13:52:00] ID NAME STATE TOPOLOGY
(No clusters found)
$ aws s3 ls | grep lakehouse
[2026-01-13 13:52:15] (No output - bucket deleted)
$ aws glue get-databases --region us-east-1 | grep -E "bronze|silver|gold|lakehouse"
[2026-01-13 13:52:30] (No output - databases deleted)
$ aws iam get-role --role-name SparkGlueCatalogRole 2>&1 | grep NoSuchEntity
[2026-01-13 13:52:45] An error occurred (NoSuchEntity) when calling the GetRole operation: The role with name SparkGlueCatalogRole cannot be found.
$ echo "Cleanup verification complete"
[2026-01-13 13:53:00] Cleanup verification complete
Troubleshooting
Issue: Spark Cannot Connect to Glue Catalog
Symptoms: Spark jobs fail with Glue Catalog connection errors
Solutions:
- Verify IAM role has Glue permissions
- Check service account annotation
- Verify AWS region configuration
- Check Glue Catalog connectivity
# Verify service account has IAM role
kubectl get sa spark-sa -n spark-jobs -o yaml | grep eks.amazonaws.com
# Test Glue access from pod
kubectl run aws-test --rm -it --image=amazon/aws-cli --serviceaccount=spark-sa -n spark-jobs -- \
glue get-databases --region $AWS_REGION
# Check Spark configuration
kubectl get configmap spark-config -n spark-jobs -o yaml
Issue: S3 Access Denied Errors
Symptoms: Spark jobs fail with S3 403 Forbidden errors
Solutions:
- Verify IAM role has S3 permissions
- Check bucket policy
- Verify IRSA configuration
- Check S3 endpoint configuration
# Test S3 access from pod
kubectl run aws-test --rm -it --image=amazon/aws-cli --serviceaccount=spark-sa -n spark-jobs -- \
s3 ls s3://$LAKEHOUSE_BUCKET/
# Check IAM role permissions
aws iam get-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access
# Verify bucket policy
aws s3api get-bucket-policy --bucket $LAKEHOUSE_BUCKET
Issue: Iceberg Table Not Found
Symptoms: Queries fail with "Table not found" errors
Solutions:
- Verify table exists in Glue Catalog
- Check Spark Catalog configuration
- Verify warehouse location
- Check table format
# List tables in Glue
aws glue get-tables --database-name silver --region $AWS_REGION
# Check if table is Iceberg format
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION \
--query 'Table.Parameters."table_type"'
# Verify warehouse location
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/
Issue: Spark Executors Not Starting
Symptoms: Driver pod runs but executors don't start
Solutions:
- Check resource availability
- Verify RBAC permissions
- Check image pull policy
- Review executor logs
# Check node resources
kubectl top nodes
# Check pending pods
kubectl get pods -n spark-jobs
# Describe pending executor pod
kubectl describe pod <executor-pod-name> -n spark-jobs
# Check events
kubectl get events -n spark-jobs --sort-by='.lastTimestamp'
Issue: Performance Issues
Symptoms: Spark jobs are slow
Solutions:
- Increase executor resources
- Adjust partition count
- Enable adaptive query execution
- Optimize Iceberg table layout
# Update SparkApplication with more resources
kubectl edit sparkapplication process-sales-data -n spark-jobs
# Check execution plan
# Add to Spark configuration:
# spark.sql.adaptive.enabled=true
# spark.sql.adaptive.coalescePartitions.enabled=true
# Compact Iceberg table
# Run in Spark:
# spark.sql("CALL glue_catalog.system.rewrite_data_files('silver.sales_clean')")
Debug Commands
# View all Spark applications
kubectl get sparkapplication -n spark-jobs
# Get application status
kubectl get sparkapplication process-sales-data -n spark-jobs -o yaml
# View driver logs
kubectl logs -n spark-jobs -l spark-role=driver
# View executor logs
kubectl logs -n spark-jobs -l spark-role=executor --tail=100
# Check Spark Operator logs
kubectl logs -n spark-operator deployment/spark-operator
# List all pods
kubectl get pods -n spark-jobs -o wide
# Check configmaps
kubectl get configmap -n spark-jobs
# View events
kubectl get events -n spark-jobs --sort-by='.lastTimestamp' | tail -20

Top comments (1)
very useful blog, Thanks. This is what I am looking for!