Marco Gonzalez

Posted on Dec 29, 2025 • Edited on Feb 11

Unified Data Fabric: Serverless Spark on ROSA Integrating with AWS Glue Catalog

#serverless #kubernetes #aws #dataengineering

Data Lakehouse on ROSA with Apache Spark, Iceberg, and AWS Glue

Overview
Architecture
Prerequisites
Phase 1: ROSA Cluster Setup
Phase 2: AWS Glue Data Catalog Configuration
Phase 3: S3 Data Lake Setup
Phase 4: Apache Spark on OpenShift
Phase 5: Apache Iceberg Integration
Phase 6: Spark-Glue Catalog Integration
Phase 7: Sample Data Pipelines
Testing and Validation
Resource Cleanup
Troubleshooting

Overview

Project Purpose

This platform implements a modern data lakehouse architecture that achieves true separation of compute and storage. By running Apache Spark on OpenShift while leveraging AWS Glue Data Catalog for metadata management and S3 for storage (in Apache Iceberg format), organizations can scale compute independently, shut down clusters without data loss, and achieve significant cost optimization.

Key Value Propositions

Stateless Compute: Completely decouple compute from storage and metadata
Cloud-Native Flexibility: Destroy and recreate compute clusters without losing data
Cost Optimization: Pay for compute only when running jobs
Unified Metadata: AWS Glue Catalog provides central metadata repository
ACID Transactions: Apache Iceberg enables reliable data lake operations
Performance at Scale: Run high-performance Spark jobs on Kubernetes

Solution Components

Component	Purpose	Layer
ROSA	Managed OpenShift cluster for Spark compute	Compute
Apache Spark	Distributed data processing engine	Processing
Spark Operator	Kubernetes-native Spark job management	Orchestration
AWS Glue Data Catalog	Centralized metadata repository	Metadata
Amazon S3	Object storage for data lake	Storage
Apache Iceberg	Table format with ACID guarantees	Data Format
AWS IAM	Authentication and authorization	Security

Architecture

High-Level Architecture Diagram

Workflow

Data Ingestion: Raw data lands in S3 bronze layer
Spark Job Submission: Developer submits SparkApplication CR
Job Orchestration: Spark Operator creates driver pod
Resource Provisioning: Driver spawns executor pods dynamically
Metadata Discovery: Spark connects to Glue Catalog for table metadata
Data Processing: Executors read/write Iceberg tables from/to S3
Metadata Update: Glue Catalog automatically updated with new partitions/schemas
Job Completion: Executor pods terminate, freeing resources
Cluster Shutdown: ROSA cluster can be deleted without data loss
State Recovery: New cluster can access all data via Glue Catalog

Stateless Compute Demonstration

Traditional Approach:

Local Hive Metastore tied to cluster
Cluster deletion = metadata loss
Requires persistent volumes and backups

Lakehouse Approach:

Metadata in AWS Glue (managed, durable)
Data in S3 (infinitely scalable)
Compute fully ephemeral
Result: Complete cluster rebuild in 40 minutes with zero data loss

Prerequisites

Required Accounts and Subscriptions

[ ] AWS Account with administrative access
[ ] Red Hat Account with OpenShift subscription
[ ] ROSA Enabled in your AWS account
[ ] AWS Glue Access in your target region

Required Tools

Install the following CLI tools on your workstation:

# AWS CLI (v2)
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

# ROSA CLI
wget https://mirror.openshift.com/pub/openshift-v4/clients/rosa/latest/rosa-linux.tar.gz
tar -xvf rosa-linux.tar.gz
sudo mv rosa /usr/local/bin/rosa
rosa version

# OpenShift CLI (oc)
wget https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable/openshift-client-linux.tar.gz
tar -xvf openshift-client-linux.tar.gz
sudo mv oc kubectl /usr/local/bin/
oc version

# Helm (v3)
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm version

Example Output:

$ rosa version
[2026-01-13 09:15:22] 1.2.38
Your ROSA CLI is up to date.

$ oc version
[2026-01-13 09:15:35] Client Version: 4.18.0
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3

$ helm version
[2026-01-13 09:15:48] version.BuildInfo{Version:"v3.14.1", GitCommit:"2d17c84a8d8", GitTreeState:"clean", GoVersion:"go1.21.7"}

AWS Prerequisites

Service Quotas

# Check EC2 quotas for ROSA
aws service-quotas get-service-quota \
  --service-code ec2 \
  --quota-code L-1216C47A \
  --region us-east-1

# Check S3 bucket quota
aws service-quotas get-service-quota \
  --service-code s3 \
  --quota-code L-DC2B2D3D \
  --region us-east-1

Example Output:

$ aws service-quotas get-service-quota --service-code ec2 --quota-code L-1216C47A --region us-east-1
[2026-01-13 09:20:14] {
    "Quota": {
        "ServiceCode": "ec2",
        "ServiceName": "Amazon Elastic Compute Cloud (Amazon EC2)",
        "QuotaArn": "arn:aws:servicequotas:us-east-1:123456789012:ec2/L-1216C47A",
        "QuotaCode": "L-1216C47A",
        "QuotaName": "Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances",
        "Value": 1280.0,
        "Unit": "None",
        "Adjustable": true,
        "GlobalQuota": false
    }
}

IAM Permissions

Your AWS IAM user/role needs permissions for:

EC2 (VPC, subnets, security groups)
IAM (roles, policies)
S3 (buckets, objects)
Glue (databases, tables, catalog)
CloudWatch (logs, metrics)

Knowledge Prerequisites

You should be familiar with:

Apache Spark fundamentals (DataFrames, transformations, actions)
Data engineering concepts (ETL, data lakes, partitioning)
AWS fundamentals (S3, IAM)
Kubernetes basics (pods, deployments, services)
SQL and data modeling

Phase 1: ROSA Cluster Setup

Step 1.1: Configure AWS CLI

# Configure AWS credentials
aws configure

# Verify configuration
aws sts get-caller-identity

Example Output:

$ aws configure
[2026-01-13 09:30:00] AWS Access Key ID [****************AKID]:
AWS Secret Access Key [****************KEY]:
Default region name [us-east-1]:
Default output format [json]:

$ aws sts get-caller-identity
[2026-01-13 09:30:45] {
    "UserId": "AIDACKCEVSQ6C2EXAMPLE",
    "Account": "123456789012",
    "Arn": "arn:aws:iam::123456789012:user/data-engineer"
}

Step 1.2: Initialize ROSA

# Log in to Red Hat
rosa login

# Verify ROSA prerequisites
rosa verify quota
rosa verify permissions

# Initialize ROSA in your AWS account
rosa init

Example Output:

$ rosa login
[2026-01-13 09:35:12] To login to your Red Hat account, get an offline access token at https://console.redhat.com/openshift/token/rosa
? Copy the token and paste it here: ****************************************
[2026-01-13 09:35:45] Logged in as 'data-engineer' on 'https://api.openshift.com'

$ rosa verify quota
[2026-01-13 09:36:20] I: Validating AWS quota...
I: AWS quota ok. If cluster installation fails, validate actual AWS resource usage against https://docs.openshift.com/rosa/rosa_getting_started/rosa-required-aws-service-quotas.html

$ rosa verify permissions
[2026-01-13 09:36:45] I: Validating SCP policies...
I: AWS SCP policies ok

$ rosa init
[2026-01-13 09:37:15] I: Logged in as 'data-engineer' on 'https://api.openshift.com'
I: Validating AWS credentials...
I: AWS credentials are valid!
I: Validating SCP policies...
I: AWS SCP policies ok
I: Validating AWS quota...
I: AWS quota ok. If cluster installation fails, validate actual AWS resource usage against https://docs.openshift.com/rosa/rosa_getting_started/rosa-required-aws-service-quotas.html
I: Ensuring cluster administrator user 'osdCcsAdmin'...
I: Admin user 'osdCcsAdmin' created successfully!
I: Validating SCP policies for 'osdCcsAdmin'...
I: AWS SCP policies ok
I: Verifying whether OpenShift command-line tool is available...
I: Current OpenShift Client Version: 4.18.0

Step 1.3: Create ROSA Cluster

Create a ROSA cluster optimized for Spark workloads:

# Set environment variables
export CLUSTER_NAME="data-lakehouse"
export AWS_REGION="us-east-1"
export MACHINE_TYPE="m5.4xlarge"
export COMPUTE_NODES=3

# Create ROSA cluster (takes ~40 minutes)
rosa create cluster \
  --cluster-name $CLUSTER_NAME \
  --region $AWS_REGION \
  --multi-az \
  --compute-machine-type $MACHINE_TYPE \
  --compute-nodes $COMPUTE_NODES \
  --machine-cidr 10.0.0.0/16 \
  --service-cidr 172.30.0.0/16 \
  --pod-cidr 10.128.0.0/14 \
  --host-prefix 23 \
  --yes

Example Output:

$ rosa create cluster --cluster-name data-lakehouse --region us-east-1 --multi-az --compute-machine-type m5.4xlarge --compute-nodes 3 --machine-cidr 10.0.0.0/16 --service-cidr 172.30.0.0/16 --pod-cidr 10.128.0.0/14 --host-prefix 23 --yes
[2026-01-13 09:45:00] I: Creating cluster 'data-lakehouse'
I: To view a list of clusters and their status, run 'rosa list clusters'
I: Cluster 'data-lakehouse' has been created.
I: Once the cluster is installed you will need to add an Identity Provider before you can login into the cluster. See 'rosa create idp --help' for more information.

Name:                       data-lakehouse
ID:                         24g9q8jdhgoofs8cmp8ilr67njd5p0j8
External ID:
OpenShift Version:          4.18.0
Channel Group:              stable
DNS:                        data-lakehouse.vxkf.p1.openshiftapps.com
AWS Account:                123456789012
API URL:
Console URL:
Region:                     us-east-1
Multi-AZ:                   true
Nodes:
 - Control plane:           3
 - Infra:                   3
 - Compute:                 3 (m5.4xlarge)
Network:
 - Type:                    OVNKubernetes
 - Service CIDR:            172.30.0.0/16
 - Machine CIDR:            10.0.0.0/16
 - Pod CIDR:                10.128.0.0/14
 - Host Prefix:             /23
STS Role ARN:               arn:aws:iam::123456789012:role/ManagedOpenShift-Installer-Role
Support Role ARN:           arn:aws:iam::123456789012:role/ManagedOpenShift-Support-Role
Instance IAM Roles:
 - Control plane:           arn:aws:iam::123456789012:role/ManagedOpenShift-ControlPlane-Role
 - Worker:                  arn:aws:iam::123456789012:role/ManagedOpenShift-Worker-Role
Operator IAM Roles:
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cloud-network-config-controller-cloud-cre
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-machine-api-aws-cloud-credentials
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cloud-credential-operator-cloud-credent
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-image-registry-installer-cloud-credenti
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-ingress-operator-cloud-credentials
 - arn:aws:iam::123456789012:role/data-lakehouse-w7w6-openshift-cluster-csi-drivers-ebs-cloud-credenti
State:                      pending (Preparing account)
Private:                    No
Created:                    Jan 13 2026 09:45:00 UTC
Details Page:               https://console.redhat.com/openshift/details/s/2Vw0000example
OIDC Endpoint URL:          https://rh-oidc.s3.us-east-1.amazonaws.com/24g9q8jdhgoofs8cmp8ilr67njd5p0j8

I: To determine when your cluster is Ready, run 'rosa describe cluster -c data-lakehouse'.
I: To watch your cluster installation logs, run 'rosa logs install -c data-lakehouse --watch'.

Configuration Rationale:

m5.4xlarge: 16 vCPUs, 64 GB RAM - suitable for Spark executors
3 nodes: Allows distributed Spark processing
Multi-AZ: High availability for production workloads

Step 1.4: Monitor Cluster Creation

# Watch cluster installation progress
rosa logs install --cluster=$CLUSTER_NAME --watch

# Check cluster status
rosa describe cluster --cluster=$CLUSTER_NAME

Example Output:

$ rosa logs install --cluster=data-lakehouse --watch
[2026-01-13 09:46:00] time="2026-01-13T09:46:00Z" level=info msg="Preparing cluster installation"
time="2026-01-13T09:47:15Z" level=info msg="Creating AWS VPC"
time="2026-01-13T09:48:30Z" level=info msg="Creating AWS subnets"
time="2026-01-13T09:50:12Z" level=info msg="Creating security groups"
time="2026-01-13T09:52:45Z" level=info msg="Launching bootstrap instance"
time="2026-01-13T09:55:20Z" level=info msg="Waiting for bootstrap to complete"
time="2026-01-13T10:05:30Z" level=info msg="Destroying bootstrap resources"
time="2026-01-13T10:08:15Z" level=info msg="Installing control plane"
time="2026-01-13T10:15:42Z" level=info msg="Control plane initialized"
time="2026-01-13T10:18:30Z" level=info msg="Installing cluster operators"
time="2026-01-13T10:25:50Z" level=info msg="Cluster installation complete"

$ rosa describe cluster --cluster=data-lakehouse
[2026-01-13 10:26:15] Name:                       data-lakehouse
ID:                         24g9q8jdhgoofs8cmp8ilr67njd5p0j8
External ID:
OpenShift Version:          4.18.0
Channel Group:              stable
DNS:                        data-lakehouse.vxkf.p1.openshiftapps.com
AWS Account:                123456789012
API URL:                    https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443
Console URL:                https://console-openshift-console.apps.data-lakehouse.vxkf.p1.openshiftapps.com
Region:                     us-east-1
Multi-AZ:                   true
Nodes:
 - Control plane:           3
 - Infra:                   3
 - Compute:                 3 (m5.4xlarge)
Network:
 - Type:                    OVNKubernetes
 - Service CIDR:            172.30.0.0/16
 - Machine CIDR:            10.0.0.0/16
 - Pod CIDR:                10.128.0.0/14
 - Host Prefix:             /23
STS Role ARN:               arn:aws:iam::123456789012:role/ManagedOpenShift-Installer-Role
Support Role ARN:           arn:aws:iam::123456789012:role/ManagedOpenShift-Support-Role
Instance IAM Roles:
 - Control plane:           arn:aws:iam::123456789012:role/ManagedOpenShift-ControlPlane-Role
 - Worker:                  arn:aws:iam::123456789012:role/ManagedOpenShift-Worker-Role
State:                      ready
Private:                    No
Created:                    Jan 13 2026 09:45:00 UTC
Details Page:               https://console.redhat.com/openshift/details/s/2Vw0000example
OIDC Endpoint URL:          https://rh-oidc.s3.us-east-1.amazonaws.com/24g9q8jdhgoofs8cmp8ilr67njd5p0j8

Step 1.5: Create Admin User and Connect

# Create cluster admin user
rosa create admin --cluster=$CLUSTER_NAME

# Use the login command from output
oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 \
  --username cluster-admin \
  --password <your-password>

# Verify cluster access
oc cluster-info
oc get nodes

Example Output:

$ rosa create admin --cluster=data-lakehouse
[2026-01-13 10:28:00] I: Admin account has been added to cluster 'data-lakehouse'.
I: Please securely store this generated password. If you lose this password you can delete and recreate the cluster admin user.
I: To login, run the following command:

   oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 --username cluster-admin --password aB3dE-fGh5J-kLm7N-pQr9S

I: It may take several minutes for this access to become active.

$ oc login https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443 --username cluster-admin --password aB3dE-fGh5J-kLm7N-pQr9S
[2026-01-13 10:29:30] Login successful.

You have access to 103 projects, the list has been suppressed. You can list all projects with 'oc projects'

Using project "default".

$ oc cluster-info
[2026-01-13 10:29:45] Kubernetes control plane is running at https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

$ oc get nodes
[2026-01-13 10:30:00] NAME                                         STATUS   ROLES                  AGE   VERSION
ip-10-0-128-205.ec2.internal                 Ready    control-plane,master   42m   v1.31.0+7c7b8a2
ip-10-0-135-148.ec2.internal                 Ready    control-plane,master   42m   v1.31.0+7c7b8a2
ip-10-0-142-87.ec2.internal                  Ready    control-plane,master   42m   v1.31.0+7c7b8a2
ip-10-0-152-34.ec2.internal                  Ready    worker                 35m   v1.31.0+7c7b8a2
ip-10-0-189-72.ec2.internal                  Ready    worker                 35m   v1.31.0+7c7b8a2
ip-10-0-213-156.ec2.internal                 Ready    worker                 35m   v1.31.0+7c7b8a2

Step 1.6: Create Project Namespaces

# Create namespace for Spark workloads
oc new-project spark-jobs

# Create namespace for Spark operator
oc new-project spark-operator

Example Output:

$ oc new-project spark-jobs
[2026-01-13 10:31:00] Now using project "spark-jobs" on server "https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname

$ oc new-project spark-operator
[2026-01-13 10:31:15] Now using project "spark-operator" on server "https://api.data-lakehouse.vxkf.p1.openshiftapps.com:6443".

You can add applications to this project with the 'new-app' command. For example, try:

    oc new-app rails-postgresql-example

to build a new example application in Ruby. Or use kubectl to deploy a simple Kubernetes application:

    kubectl create deployment hello-node --image=registry.k8s.io/e2e-test-images/agnhost:2.43 -- /agnhost serve-hostname

Phase 2: AWS Glue Data Catalog Configuration

Step 2.1: Create Glue Database

# Create Glue database for lakehouse
aws glue create-database \
  --database-input '{
    "Name": "lakehouse",
    "Description": "Data lakehouse with Iceberg tables"
  }' \
  --region $AWS_REGION

# Create additional databases for different layers
aws glue create-database \
  --database-input '{
    "Name": "bronze",
    "Description": "Raw data landing zone"
  }' \
  --region $AWS_REGION

aws glue create-database \
  --database-input '{
    "Name": "silver",
    "Description": "Curated and cleaned data"
  }' \
  --region $AWS_REGION

aws glue create-database \
  --database-input '{
    "Name": "gold",
    "Description": "Analytics-ready aggregated data"
  }' \
  --region $AWS_REGION

# Verify database creation
aws glue get-databases --region $AWS_REGION

Example Output:

$ aws glue create-database --database-input '{"Name": "lakehouse", "Description": "Data lakehouse with Iceberg tables"}' --region us-east-1
[2026-01-13 10:35:00] (No output indicates success)

$ aws glue create-database --database-input '{"Name": "bronze", "Description": "Raw data landing zone"}' --region us-east-1
[2026-01-13 10:35:15] (No output indicates success)

$ aws glue create-database --database-input '{"Name": "silver", "Description": "Curated and cleaned data"}' --region us-east-1
[2026-01-13 10:35:30] (No output indicates success)

$ aws glue create-database --database-input '{"Name": "gold", "Description": "Analytics-ready aggregated data"}' --region us-east-1
[2026-01-13 10:35:45] (No output indicates success)

$ aws glue get-databases --region us-east-1
[2026-01-13 10:36:00] {
    "DatabaseList": [
        {
            "Name": "bronze",
            "Description": "Raw data landing zone",
            "CreateTime": "2026-01-13T10:35:15.234000-05:00",
            "CreateTableDefaultPermissions": [
                {
                    "Principal": {
                        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
                    },
                    "Permissions": [
                        "ALL"
                    ]
                }
            ],
            "CatalogId": "123456789012"
        },
        {
            "Name": "gold",
            "Description": "Analytics-ready aggregated data",
            "CreateTime": "2026-01-13T10:35:45.789000-05:00",
            "CreateTableDefaultPermissions": [
                {
                    "Principal": {
                        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
                    },
                    "Permissions": [
                        "ALL"
                    ]
                }
            ],
            "CatalogId": "123456789012"
        },
        {
            "Name": "lakehouse",
            "Description": "Data lakehouse with Iceberg tables",
            "CreateTime": "2026-01-13T10:35:00.123000-05:00",
            "CreateTableDefaultPermissions": [
                {
                    "Principal": {
                        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
                    },
                    "Permissions": [
                        "ALL"
                    ]
                }
            ],
            "CatalogId": "123456789012"
        },
        {
            "Name": "silver",
            "Description": "Curated and cleaned data",
            "CreateTime": "2026-01-13T10:35:30.456000-05:00",
            "CreateTableDefaultPermissions": [
                {
                    "Principal": {
                        "DataLakePrincipalIdentifier": "IAM_ALLOWED_PRINCIPALS"
                    },
                    "Permissions": [
                        "ALL"
                    ]
                }
            ],
            "CatalogId": "123456789012"
        }
    ]
}

Step 2.2: Create IAM Role for Glue Catalog Access

# Get ROSA cluster OIDC provider
export OIDC_PROVIDER=$(rosa describe cluster -c $CLUSTER_NAME -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create trust policy for Spark service account
cat > spark-glue-trust-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::${ACCOUNT_ID}:oidc-provider/${OIDC_PROVIDER}"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "${OIDC_PROVIDER}:sub": "system:serviceaccount:spark-jobs:spark-sa"
        }
      }
    }
  ]
}
EOF

# Create IAM role
export SPARK_ROLE_ARN=$(aws iam create-role \
  --role-name SparkGlueCatalogRole \
  --assume-role-policy-document file://spark-glue-trust-policy.json \
  --query 'Role.Arn' \
  --output text)

echo "Spark IAM Role ARN: $SPARK_ROLE_ARN"

Example Output:

$ export OIDC_PROVIDER=$(rosa describe cluster -c data-lakehouse -o json | jq -r .aws.sts.oidc_endpoint_url | sed 's|https://||')
[2026-01-13 10:38:00]

$ export ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
[2026-01-13 10:38:05]

$ cat > spark-glue-trust-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:38:20]

$ export SPARK_ROLE_ARN=$(aws iam create-role --role-name SparkGlueCatalogRole --assume-role-policy-document file://spark-glue-trust-policy.json --query 'Role.Arn' --output text)
[2026-01-13 10:38:35]

$ echo "Spark IAM Role ARN: $SPARK_ROLE_ARN"
[2026-01-13 10:38:40] Spark IAM Role ARN: arn:aws:iam::123456789012:role/SparkGlueCatalogRole

Step 2.3: Create IAM Policy for Glue and S3 Access

# Create policy for Glue Catalog access
cat > spark-glue-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:CreateTable",
        "glue:UpdateTable",
        "glue:DeleteTable",
        "glue:BatchCreatePartition",
        "glue:BatchDeletePartition",
        "glue:BatchUpdatePartition",
        "glue:CreatePartition",
        "glue:DeletePartition",
        "glue:UpdatePartition"
      ],
      "Resource": [
        "arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:catalog",
        "arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:database/*",
        "arn:aws:glue:${AWS_REGION}:${ACCOUNT_ID}:table/*/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::lakehouse-*",
        "arn:aws:s3:::lakehouse-*/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:ListAllMyBuckets"
      ],
      "Resource": "*"
    }
  ]
}
EOF

# Create and attach policy
aws iam put-role-policy \
  --role-name SparkGlueCatalogRole \
  --policy-name GlueS3Access \
  --policy-document file://spark-glue-policy.json

echo "IAM policy created and attached"

Example Output:

$ cat > spark-glue-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:40:00]

$ aws iam put-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access --policy-document file://spark-glue-policy.json
[2026-01-13 10:40:15] (No output indicates success)

$ echo "IAM policy created and attached"
[2026-01-13 10:40:20] IAM policy created and attached

Phase 3: S3 Data Lake Setup

Step 3.1: Create S3 Buckets

# Create S3 bucket for data lake
export LAKEHOUSE_BUCKET="lakehouse-data-${ACCOUNT_ID}"

aws s3 mb s3://$LAKEHOUSE_BUCKET --region $AWS_REGION

# Enable versioning for data protection
aws s3api put-bucket-versioning \
  --bucket $LAKEHOUSE_BUCKET \
  --versioning-configuration Status=Enabled \
  --region $AWS_REGION

# Create folder structure for medallion architecture
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key bronze/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key silver/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key gold/
aws s3api put-object --bucket $LAKEHOUSE_BUCKET --key warehouse/

echo "S3 Data Lake bucket created: s3://$LAKEHOUSE_BUCKET"

Example Output:

$ export LAKEHOUSE_BUCKET="lakehouse-data-${ACCOUNT_ID}"
[2026-01-13 10:42:00]

$ aws s3 mb s3://lakehouse-data-123456789012 --region us-east-1
[2026-01-13 10:42:15] make_bucket: lakehouse-data-123456789012

$ aws s3api put-bucket-versioning --bucket lakehouse-data-123456789012 --versioning-configuration Status=Enabled --region us-east-1
[2026-01-13 10:42:30] (No output indicates success)

$ aws s3api put-object --bucket lakehouse-data-123456789012 --key bronze/
[2026-01-13 10:42:45] {
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ServerSideEncryption": "AES256"
}

$ aws s3api put-object --bucket lakehouse-data-123456789012 --key silver/
[2026-01-13 10:43:00] {
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ServerSideEncryption": "AES256"
}

$ aws s3api put-object --bucket lakehouse-data-123456789012 --key gold/
[2026-01-13 10:43:15] {
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ServerSideEncryption": "AES256"
}

$ aws s3api put-object --bucket lakehouse-data-123456789012 --key warehouse/
[2026-01-13 10:43:30] {
    "ETag": "\"d41d8cd98f00b204e9800998ecf8427e\"",
    "ServerSideEncryption": "AES256"
}

$ echo "S3 Data Lake bucket created: s3://$LAKEHOUSE_BUCKET"
[2026-01-13 10:43:45] S3 Data Lake bucket created: s3://lakehouse-data-123456789012

Step 3.2: Configure S3 Bucket Policies

# Create bucket policy for secure access
cat > lakehouse-bucket-policy.json <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowSparkAccess",
      "Effect": "Allow",
      "Principal": {
        "AWS": "$SPARK_ROLE_ARN"
      },
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::${LAKEHOUSE_BUCKET}",
        "arn:aws:s3:::${LAKEHOUSE_BUCKET}/*"
      ]
    }
  ]
}
EOF

# Apply bucket policy
aws s3api put-bucket-policy \
  --bucket $LAKEHOUSE_BUCKET \
  --policy file://lakehouse-bucket-policy.json

echo "Bucket policy applied"

Example Output:

$ cat > lakehouse-bucket-policy.json <<EOF
[content omitted for brevity]
EOF
[2026-01-13 10:45:00]

$ aws s3api put-bucket-policy --bucket lakehouse-data-123456789012 --policy file://lakehouse-bucket-policy.json
[2026-01-13 10:45:15] (No output indicates success)

$ echo "Bucket policy applied"
[2026-01-13 10:45:20] Bucket policy applied

Step 3.3: Upload Sample Data

# Create sample dataset
mkdir -p sample-data
cd sample-data

# Generate sample sales data
python3 <<PYTHON
import csv
import random
from datetime import datetime, timedelta

# Generate sample sales data
with open('sales_data.csv', 'w', newline='') as f:
    writer = csv.writer(f)
    writer.writerow(['transaction_id', 'date', 'product', 'category', 'amount', 'quantity', 'region'])

    products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones']
    categories = ['Electronics', 'Accessories']
    regions = ['North', 'South', 'East', 'West']

    base_date = datetime(2024, 1, 1)

    for i in range(10000):
        transaction_date = base_date + timedelta(days=random.randint(0, 365))
        product = random.choice(products)
        category = 'Electronics' if product in ['Laptop', 'Monitor'] else 'Accessories'

        writer.writerow([
            f'TXN{i:06d}',
            transaction_date.strftime('%Y-%m-%d'),
            product,
            category,
            round(random.uniform(10, 2000), 2),
            random.randint(1, 10),
            random.choice(regions)
        ])

print("Sample data generated: sales_data.csv")
PYTHON

# Upload to S3 bronze layer
aws s3 cp sales_data.csv s3://$LAKEHOUSE_BUCKET/bronze/sales/sales_data.csv

cd ..
echo "Sample data uploaded to S3"

Example Output:

$ mkdir -p sample-data
[2026-01-13 10:47:00]

$ cd sample-data
[2026-01-13 10:47:05]

$ python3 <<PYTHON
[script content]
PYTHON
[2026-01-13 10:47:30] Sample data generated: sales_data.csv

$ aws s3 cp sales_data.csv s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv
[2026-01-13 10:48:00] upload: ./sales_data.csv to s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv

$ cd ..
[2026-01-13 10:48:05]

$ echo "Sample data uploaded to S3"
[2026-01-13 10:48:10] Sample data uploaded to S3

Phase 4: Apache Spark on OpenShift

Step 4.1: Install Spark Operator

# Add Spark Operator Helm repository
helm repo add spark-operator https://kubeflow.github.io/spark-operator
helm repo update

# Install Spark Operator
helm install spark-operator spark-operator/spark-operator \
  --namespace spark-operator \
  --create-namespace \
  --set webhook.enable=true \
  --set sparkJobNamespace=spark-jobs

# Verify installation
kubectl get pods -n spark-operator
kubectl get crd | grep spark

Example Output:

$ helm repo add spark-operator https://kubeflow.github.io/spark-operator
[2026-01-13 10:50:00] "spark-operator" has been added to your repositories

$ helm repo update
[2026-01-13 10:50:15] Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "spark-operator" chart repository
Update Complete. ⎈Happy Helming!⎈

$ helm install spark-operator spark-operator/spark-operator --namespace spark-operator --create-namespace --set webhook.enable=true --set sparkJobNamespace=spark-jobs
[2026-01-13 10:51:00] NAME: spark-operator
LAST DEPLOYED: Mon Jan 13 10:51:00 2026
NAMESPACE: spark-operator
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Verify the Spark Operator deployment:
   kubectl get pods -n spark-operator

2. Check the webhook:
   kubectl get mutatingwebhookconfigurations
   kubectl get validatingwebhookconfigurations

3. Submit a SparkApplication:
   kubectl apply -f examples/spark-pi.yaml

For more information, visit https://github.com/kubeflow/spark-operator

$ kubectl get pods -n spark-operator
[2026-01-13 10:51:30] NAME                              READY   STATUS    RESTARTS   AGE
spark-operator-5f7b8c9d6b-xq4zm   1/1     Running   0          30s

$ kubectl get crd | grep spark
[2026-01-13 10:51:45] scheduledsparkapplications.sparkoperator.k8s.io   2026-01-13T15:51:00Z
sparkapplications.sparkoperator.k8s.io            2026-01-13T15:51:00Z

Step 4.2: Create Service Account for Spark

# Create service account with IAM role annotation
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
  name: spark-sa
  namespace: spark-jobs
  annotations:
    eks.amazonaws.com/role-arn: $SPARK_ROLE_ARN
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: spark-role
  namespace: spark-jobs
rules:
- apiGroups: [""]
  resources: ["pods", "services", "configmaps"]
  verbs: ["create", "get", "list", "watch", "delete"]
- apiGroups: [""]
  resources: ["pods/log"]
  verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: spark-rolebinding
  namespace: spark-jobs
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: spark-role
subjects:
- kind: ServiceAccount
  name: spark-sa
  namespace: spark-jobs
EOF

# Verify service account
oc get sa spark-sa -n spark-jobs -o yaml

Example Output:

$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 10:53:00] serviceaccount/spark-sa created
role.rbac.authorization.k8s.io/spark-role created
rolebinding.rbac.authorization.k8s.io/spark-rolebinding created

$ oc get sa spark-sa -n spark-jobs -o yaml
[2026-01-13 10:53:15] apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/SparkGlueCatalogRole
  creationTimestamp: "2026-01-13T15:53:00Z"
  name: spark-sa
  namespace: spark-jobs
  resourceVersion: "123456"
  uid: a1b2c3d4-e5f6-7890-abcd-ef1234567890
secrets:
- name: spark-sa-dockercfg-xyz12

Step 4.3: Create ConfigMap for Spark Configuration

# Create Spark configuration
cat <<EOF | oc apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
  name: spark-config
  namespace: spark-jobs
data:
  spark-defaults.conf: |
    spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem
    spark.hadoop.fs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
    spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
    spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog
    spark.sql.catalog.glue_catalog.warehouse=s3://${LAKEHOUSE_BUCKET}/warehouse
    spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
    spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
    spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
    spark.eventLog.enabled=true
    spark.eventLog.dir=s3a://${LAKEHOUSE_BUCKET}/spark-events
  lakehouse.conf: |
    LAKEHOUSE_BUCKET=${LAKEHOUSE_BUCKET}
    AWS_REGION=${AWS_REGION}
    GLUE_DATABASE=lakehouse
EOF

Example Output:

$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 10:55:00] configmap/spark-config created

Phase 5: Apache Iceberg Integration

Step 5.1: Build Custom Spark Image with Iceberg

# Create directory for custom Spark image
mkdir -p spark-iceberg
cd spark-iceberg

# Create Dockerfile
cat > Dockerfile <<'DOCKERFILE'
FROM gcr.io/spark-operator/spark:v3.5.0

USER root

# Install AWS dependencies and Iceberg
RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar \
    -o /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.4.2.jar

RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar \
    -o /opt/spark/jars/hadoop-aws-3.3.4.jar

RUN curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar \
    -o /opt/spark/jars/aws-java-sdk-bundle-1.12.262.jar

RUN curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/bundle/2.20.18/bundle-2.20.18.jar \
    -o /opt/spark/jars/bundle-2.20.18.jar

RUN curl -L https://repo1.maven.org/maven2/software/amazon/awssdk/url-connection-client/2.20.18/url-connection-client-2.20.18.jar \
    -o /opt/spark/jars/url-connection-client-2.20.18.jar

USER 185

ENTRYPOINT ["/opt/entrypoint.sh"]
DOCKERFILE

# Build and push to a container registry
# For this example, we'll use OpenShift internal registry
oc create imagestream spark-iceberg -n spark-jobs

# Build image using OpenShift build
cat > BuildConfig.yaml <<EOF
apiVersion: build.openshift.io/v1
kind: BuildConfig
metadata:
  name: spark-iceberg
  namespace: spark-jobs
spec:
  output:
    to:
      kind: ImageStreamTag
      name: spark-iceberg:latest
  source:
    dockerfile: |
$(cat Dockerfile | sed 's/^/      /')
    type: Dockerfile
  strategy:
    dockerStrategy: {}
    type: Docker
EOF

oc apply -f BuildConfig.yaml

# Start build
oc start-build spark-iceberg -n spark-jobs --follow

# Get image reference
export SPARK_IMAGE=$(oc get is spark-iceberg -n spark-jobs -o jsonpath='{.status.dockerImageRepository}'):latest

cd ..
echo "Custom Spark image with Iceberg built: $SPARK_IMAGE"

Example Output:

$ mkdir -p spark-iceberg
[2026-01-13 11:00:00]

$ cd spark-iceberg
[2026-01-13 11:00:05]

$ cat > Dockerfile <<'DOCKERFILE'
[content omitted for brevity]
DOCKERFILE
[2026-01-13 11:00:30]

$ oc create imagestream spark-iceberg -n spark-jobs
[2026-01-13 11:01:00] imagestream.image.openshift.io/spark-iceberg created

$ cat > BuildConfig.yaml <<EOF
[content omitted for brevity]
EOF
[2026-01-13 11:01:15]

$ oc apply -f BuildConfig.yaml
[2026-01-13 11:01:30] buildconfig.build.openshift.io/spark-iceberg created

$ oc start-build spark-iceberg -n spark-jobs --follow
[2026-01-13 11:01:45] build.build.openshift.io/spark-iceberg-1 started
Cloning "https://github.com/..." ...
Commit: abc123def456 (Initial commit)
Author: DataEngineer <engineer@example.com>
Date:   Mon Jan 13 11:01:00 2026 -0500
Receiving objects: 100% (3/3), done.
Resolving deltas: 100% (1/1), done.

Step 1/7 : FROM gcr.io/spark-operator/spark:v3.5.0
 ---> 1a2b3c4d5e6f
Step 2/7 : USER root
 ---> Running in 7g8h9i0j1k2l
Removing intermediate container 7g8h9i0j1k2l
 ---> 3m4n5o6p7q8r
Step 3/7 : RUN curl -L https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.4.2/iceberg-spark-runtime-3.5_2.12-1.4.2.jar -o /opt/spark/jars/iceberg-spark-runtime-3.5_2.12-1.4.2.jar
 ---> Running in 9s0t1u2v3w4x
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 45.2M  100 45.2M    0     0  15.3M      0  0:00:02  0:00:02 --:--:-- 15.3M
Removing intermediate container 9s0t1u2v3w4x
 ---> 5y6z7a8b9c0d
Step 4/7 : RUN curl -L https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar -o /opt/spark/jars/hadoop-aws-3.3.4.jar
 ---> Running in 1e2f3g4h5i6j
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  789k  100  789k    0     0  2145k      0 --:--:-- --:--:-- --:--:-- 2145k
Removing intermediate container 1e2f3g4h5i6j
 ---> 7k8l9m0n1o2p
Step 5/7 : RUN curl -L https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.262/aws-java-sdk-bundle-1.12.262.jar -o /opt/spark/jars/aws-java-sdk-bundle-1.12.262.jar
 ---> Running in 3q4r5s6t7u8v
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  289M  100  289M    0     0  45.2M      0  0:00:06  0:00:06 --:--:-- 52.1M
Removing intermediate container 3q4r5s6t7u8v
 ---> 9w0x1y2z3a4b
Step 6/7 : USER 185
 ---> Running in 5c6d7e8f9g0h
Removing intermediate container 5c6d7e8f9g0h
 ---> 1i2j3k4l5m6n
Step 7/7 : ENTRYPOINT ["/opt/entrypoint.sh"]
 ---> Running in 7o8p9q0r1s2t
Removing intermediate container 7o8p9q0r1s2t
 ---> 3u4v5w6x7y8z
Successfully built 3u4v5w6x7y8z
Successfully tagged image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest

Pushing image image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest ...
Getting image source signatures
Copying blob sha256:9a0b1c2d3e4f...
Copying blob sha256:5f6e7d8c9b0a...
Copying blob sha256:1g2h3i4j5k6l...
Copying config sha256:3u4v5w6x7y8z...
Writing manifest to image destination
Storing signatures
Successfully pushed image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg@sha256:7m8n9o0p1q2r3s4t5u6v7w8x9y0z1a2b3c4d5e6f7g8h9i0j1k2l3m4n5o6p

Push successful

$ export SPARK_IMAGE=$(oc get is spark-iceberg -n spark-jobs -o jsonpath='{.status.dockerImageRepository}'):latest
[2026-01-13 11:08:30]

$ cd ..
[2026-01-13 11:08:35]

$ echo "Custom Spark image with Iceberg built: $SPARK_IMAGE"
[2026-01-13 11:08:40] Custom Spark image with Iceberg built: image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest

Phase 6: Spark-Glue Catalog Integration

Step 6.1: Create Sample Spark Application

# Create PySpark script for data processing
mkdir -p spark-jobs
cd spark-jobs

cat > process_sales.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, year, month, sum as _sum, avg, count
import sys

def main():
    # Create Spark session with Iceberg and Glue Catalog
    spark = SparkSession.builder \
        .appName("ProcessSalesData") \
        .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog") \
        .config("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
        .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
        .getOrCreate()

    spark.sparkContext.setLogLevel("INFO")

    # Get configuration from environment
    bucket = sys.argv[1] if len(sys.argv) > 1 else "lakehouse-data"

    print(f"Reading data from s3a://{bucket}/bronze/sales/")

    # Read raw CSV data
    df_raw = spark.read.csv(
        f"s3a://{bucket}/bronze/sales/sales_data.csv",
        header=True,
        inferSchema=True
    )

    print(f"Raw data count: {df_raw.count()}")
    df_raw.show(5)

    # Create bronze table in Glue Catalog (if not exists)
    df_raw.write \
        .format("iceberg") \
        .mode("overwrite") \
        .option("path", f"s3a://{bucket}/warehouse/bronze.db/sales") \
        .saveAsTable("glue_catalog.bronze.sales")

    print("Bronze table created in Glue Catalog")

    # Transform data for silver layer
    df_silver = df_raw \
        .withColumn("year", year(col("date"))) \
        .withColumn("month", month(col("date"))) \
        .filter(col("amount") > 0) \
        .dropDuplicates(["transaction_id"])

    # Write to silver layer
    df_silver.write \
        .format("iceberg") \
        .mode("overwrite") \
        .partitionBy("year", "month") \
        .option("path", f"s3a://{bucket}/warehouse/silver.db/sales_clean") \
        .saveAsTable("glue_catalog.silver.sales_clean")

    print("Silver table created with partitioning")

    # Create aggregated gold layer
    df_gold = df_silver.groupBy("year", "month", "category", "region") \
        .agg(
            _sum("amount").alias("total_revenue"),
            _sum("quantity").alias("total_quantity"),
            avg("amount").alias("avg_transaction_value"),
            count("transaction_id").alias("transaction_count")
        )

    # Write to gold layer
    df_gold.write \
        .format("iceberg") \
        .mode("overwrite") \
        .option("path", f"s3a://{bucket}/warehouse/gold.db/sales_summary") \
        .saveAsTable("glue_catalog.gold.sales_summary")

    print("Gold table created with aggregations")

    # Show sample results
    print("\n=== Bronze Layer Sample ===")
    spark.sql("SELECT * FROM glue_catalog.bronze.sales LIMIT 5").show()

    print("\n=== Silver Layer Sample ===")
    spark.sql("SELECT * FROM glue_catalog.silver.sales_clean LIMIT 5").show()

    print("\n=== Gold Layer Sample ===")
    spark.sql("SELECT * FROM glue_catalog.gold.sales_summary ORDER BY total_revenue DESC LIMIT 10").show()

    # Verify tables in Glue Catalog
    print("\n=== Tables in Glue Catalog ===")
    spark.sql("SHOW TABLES IN glue_catalog.bronze").show()
    spark.sql("SHOW TABLES IN glue_catalog.silver").show()
    spark.sql("SHOW TABLES IN glue_catalog.gold").show()

    spark.stop()

if __name__ == "__main__":
    main()
PYTHON

# Upload script to S3
aws s3 cp process_sales.py s3://$LAKEHOUSE_BUCKET/scripts/

cd ..

Example Output:

$ mkdir -p spark-jobs
[2026-01-13 11:10:00]

$ cd spark-jobs
[2026-01-13 11:10:05]

$ cat > process_sales.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:12:00]

$ aws s3 cp process_sales.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:12:15] upload: ./process_sales.py to s3://lakehouse-data-123456789012/scripts/process_sales.py

$ cd ..
[2026-01-13 11:12:20]

Step 6.2: Create SparkApplication Custom Resource

# Create SparkApplication manifest
cat <<EOF | oc apply -f -
apiVersion: sparkoperator.k8s.io/v1beta2
kind: SparkApplication
metadata:
  name: process-sales-data
  namespace: spark-jobs
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  image: $SPARK_IMAGE
  imagePullPolicy: Always
  mainApplicationFile: s3a://$LAKEHOUSE_BUCKET/scripts/process_sales.py
  arguments:
    - "$LAKEHOUSE_BUCKET"
  sparkVersion: "3.5.0"
  restartPolicy:
    type: Never
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "2g"
    labels:
      version: "3.5.0"
    serviceAccount: spark-sa
    env:
      - name: AWS_REGION
        value: "$AWS_REGION"
      - name: AWS_ROLE_ARN
        value: "$SPARK_ROLE_ARN"
      - name: AWS_WEB_IDENTITY_TOKEN_FILE
        value: "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
    volumeMounts:
      - name: aws-iam-token
        mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        readOnly: true
  executor:
    cores: 2
    instances: 3
    memory: "4g"
    labels:
      version: "3.5.0"
    env:
      - name: AWS_REGION
        value: "$AWS_REGION"
      - name: AWS_ROLE_ARN
        value: "$SPARK_ROLE_ARN"
      - name: AWS_WEB_IDENTITY_TOKEN_FILE
        value: "/var/run/secrets/eks.amazonaws.com/serviceaccount/token"
    volumeMounts:
      - name: aws-iam-token
        mountPath: /var/run/secrets/eks.amazonaws.com/serviceaccount
        readOnly: true
  volumes:
    - name: aws-iam-token
      projected:
        sources:
          - serviceAccountToken:
              audience: sts.amazonaws.com
              expirationSeconds: 86400
              path: token
  sparkConf:
    "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem"
    "spark.hadoop.fs.s3a.aws.credentials.provider": "com.amazonaws.auth.WebIdentityTokenCredentialsProvider"
    "spark.sql.catalog.glue_catalog": "org.apache.iceberg.spark.SparkCatalog"
    "spark.sql.catalog.glue_catalog.warehouse": "s3a://$LAKEHOUSE_BUCKET/warehouse"
    "spark.sql.catalog.glue_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog"
    "spark.sql.catalog.glue_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO"
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
    "spark.kubernetes.allocation.batch.size": "3"
EOF

Example Output:

$ cat <<EOF | oc apply -f -
[manifest content]
EOF
[2026-01-13 11:15:00] sparkapplication.sparkoperator.k8s.io/process-sales-data created

Phase 7: Sample Data Pipelines

Step 7.1: Create Incremental Processing Pipeline

# Create incremental processing script
cat > spark-jobs/incremental_pipeline.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, current_timestamp, lit
from datetime import datetime
import sys

def main():
    spark = SparkSession.builder \
        .appName("IncrementalPipeline") \
        .getOrCreate()

    bucket = sys.argv[1]
    batch_date = sys.argv[2] if len(sys.argv) > 2 else datetime.now().strftime('%Y-%m-%d')

    print(f"Processing incremental data for date: {batch_date}")

    # Read existing silver table
    df_existing = spark.read \
        .format("iceberg") \
        .load(f"glue_catalog.silver.sales_clean")

    # Read new data (simulate incremental load)
    df_new = spark.read.csv(
        f"s3a://{bucket}/bronze/sales/sales_data.csv",
        header=True,
        inferSchema=True
    ).filter(col("date") == batch_date) \
     .withColumn("processed_timestamp", current_timestamp())

    # Append to silver table using Iceberg merge
    df_new.writeTo("glue_catalog.silver.sales_clean") \
        .append()

    print(f"Appended {df_new.count()} records to silver table")

    # Update gold aggregations
    df_updated = spark.read \
        .format("iceberg") \
        .load("glue_catalog.silver.sales_clean") \
        .filter(col("date") == batch_date)

    # Recalculate aggregations for affected partitions
    from pyspark.sql.functions import year, month, sum as _sum, avg, count

    df_agg = df_updated \
        .withColumn("year", year(col("date"))) \
        .withColumn("month", month(col("date"))) \
        .groupBy("year", "month", "category", "region") \
        .agg(
            _sum("amount").alias("total_revenue"),
            _sum("quantity").alias("total_quantity"),
            avg("amount").alias("avg_transaction_value"),
            count("transaction_id").alias("transaction_count")
        )

    # Merge into gold table
    df_agg.writeTo("glue_catalog.gold.sales_summary") \
        .using("iceberg") \
        .tableProperty("write.merge.mode", "merge-on-read") \
        .append()

    print("Gold table updated with incremental aggregations")

    spark.stop()

if __name__ == "__main__":
    main()
PYTHON

# Upload to S3
aws s3 cp spark-jobs/incremental_pipeline.py s3://$LAKEHOUSE_BUCKET/scripts/

Example Output:

$ cat > spark-jobs/incremental_pipeline.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:20:00]

$ aws s3 cp spark-jobs/incremental_pipeline.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:20:15] upload: spark-jobs/incremental_pipeline.py to s3://lakehouse-data-123456789012/scripts/incremental_pipeline.py

Step 7.2: Create Time Travel Query Example

# Create time travel demonstration script
cat > spark-jobs/time_travel.py <<'PYTHON'
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
import sys

def main():
    spark = SparkSession.builder \
        .appName("IcebergTimeTravel") \
        .getOrCreate()

    bucket = sys.argv[1]

    # Read current version
    print("=== Current Version ===")
    df_current = spark.read \
        .format("iceberg") \
        .load("glue_catalog.silver.sales_clean")

    print(f"Current record count: {df_current.count()}")
    df_current.show(5)

    # Show table history
    print("\n=== Table History ===")
    spark.sql("SELECT * FROM glue_catalog.silver.sales_clean.history").show()

    # Show table snapshots
    print("\n=== Table Snapshots ===")
    spark.sql("SELECT * FROM glue_catalog.silver.sales_clean.snapshots").show()

    # Query specific snapshot (if exists)
    snapshots = spark.sql("SELECT snapshot_id FROM glue_catalog.silver.sales_clean.snapshots ORDER BY committed_at LIMIT 1").collect()

    if snapshots:
        snapshot_id = snapshots[0][0]
        print(f"\n=== Data at Snapshot {snapshot_id} ===")

        df_snapshot = spark.read \
            .format("iceberg") \
            .option("snapshot-id", snapshot_id) \
            .load("glue_catalog.silver.sales_clean")

        print(f"Snapshot record count: {df_snapshot.count()}")
        df_snapshot.show(5)

    # Show table metadata
    print("\n=== Table Metadata ===")
    spark.sql("DESCRIBE EXTENDED glue_catalog.silver.sales_clean").show(100, False)

    spark.stop()

if __name__ == "__main__":
    main()
PYTHON

# Upload to S3
aws s3 cp spark-jobs/time_travel.py s3://$LAKEHOUSE_BUCKET/scripts/

Example Output:

$ cat > spark-jobs/time_travel.py <<'PYTHON'
[content omitted for brevity]
PYTHON
[2026-01-13 11:22:00]

$ aws s3 cp spark-jobs/time_travel.py s3://lakehouse-data-123456789012/scripts/
[2026-01-13 11:22:15] upload: spark-jobs/time_travel.py to s3://lakehouse-data-123456789012/scripts/time_travel.py

Testing and Validation

Test 1: Monitor Spark Application

# Check SparkApplication status
kubectl get sparkapplication -n spark-jobs

# Describe application
kubectl describe sparkapplication process-sales-data -n spark-jobs

# Watch driver pod logs
export DRIVER_POD=$(kubectl get pods -n spark-jobs -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')
kubectl logs -f $DRIVER_POD -n spark-jobs

# Check executor pods
kubectl get pods -n spark-jobs -l spark-role=executor

Example Output:

$ kubectl get sparkapplication -n spark-jobs
[2026-01-13 11:25:00] NAME                  STATUS      ATTEMPTS   START                  FINISH       AGE
process-sales-data    RUNNING     1          2026-01-13T11:24:30Z                3m

$ kubectl describe sparkapplication process-sales-data -n spark-jobs
[2026-01-13 11:25:15] Name:         process-sales-data
Namespace:    spark-jobs
Labels:       <none>
Annotations:  <none>
API Version:  sparkoperator.k8s.io/v1beta2
Kind:         SparkApplication
Metadata:
  Creation Timestamp:  2026-01-13T16:24:15Z
  Generation:          1
  Resource Version:    234567
  UID:                 f1g2h3i4-j5k6-7l8m-9n0o-p1q2r3s4t5u6
Spec:
  Driver:
    Cores:         1
    Core Limit:    1200m
    Memory:        2g
    Service Account:  spark-sa
  Executor:
    Cores:      2
    Instances:  3
    Memory:     4g
  Image:        image-registry.openshift-image-registry.svc:5000/spark-jobs/spark-iceberg:latest
  Main Application File:  s3a://lakehouse-data-123456789012/scripts/process_sales.py
  Mode:         cluster
  Python Version:  3
  Spark Version:   3.5.0
  Type:           Python
Status:
  Application State:
    State:  RUNNING
  Driver Info:
    Pod Name:             process-sales-data-driver
    Web UI Service Name:  process-sales-data-ui-svc
  Execution Attempts:     1
  Last Submission Attempt Time:  2026-01-13T16:24:30Z
  Spark Application Id:   spark-application-1705165470123-456789
  Submission Attempts:    1
  Termination Time:       <nil>

$ export DRIVER_POD=$(kubectl get pods -n spark-jobs -l spark-role=driver -o jsonpath='{.items[0].metadata.name}')
[2026-01-13 11:25:30]

$ kubectl logs -f process-sales-data-driver -n spark-jobs
[2026-01-13 11:25:45] 26/01/13 16:25:45 INFO SparkContext: Running Spark version 3.5.0
26/01/13 16:25:46 INFO ResourceUtils: ==============================================================
26/01/13 16:25:46 INFO ResourceUtils: No custom resources configured for spark.driver.
26/01/13 16:25:46 INFO ResourceUtils: ==============================================================
26/01/13 16:25:46 INFO SparkContext: Submitted application: ProcessSalesData
26/01/13 16:25:47 INFO SecurityManager: Changing view acls to: 185
26/01/13 16:25:47 INFO SecurityManager: Changing modify acls to: 185
26/01/13 16:25:47 INFO SecurityManager: Changing view acls groups to:
26/01/13 16:25:47 INFO SecurityManager: Changing modify acls groups to:
26/01/13 16:25:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(185); groups with view permissions: Set(); users  with modify permissions: Set(185); groups with modify permissions: Set()
26/01/13 16:25:48 INFO Utils: Successfully started service 'sparkDriver' on port 7078.
26/01/13 16:25:49 INFO SparkEnv: Registering MapOutputTracker
26/01/13 16:25:49 INFO SparkEnv: Registering BlockManagerMaster
26/01/13 16:25:50 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
26/01/13 16:25:50 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
26/01/13 16:26:15 INFO StandaloneSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
26/01/13 16:26:15 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir.
26/01/13 16:26:15 INFO SharedState: Warehouse path is 's3a://lakehouse-data-123456789012/warehouse'.
Reading data from s3a://lakehouse-data-123456789012/bronze/sales/
26/01/13 16:26:30 INFO FileSourceStrategy: Pushed Filters: []
26/01/13 16:26:30 INFO FileSourceStrategy: Post-Scan Filters: []
26/01/13 16:26:30 INFO CodeGenerator: Code generated in 156.234567 ms
26/01/13 16:26:31 INFO FileSourceScanExec: Planning scan with bin packing, max size: 4194304 bytes, open cost is considered as scanning 4194304 bytes.
Raw data count: 10000
26/01/13 16:26:45 INFO CodeGenerator: Code generated in 23.456789 ms
+---------------+----------+----------+------------+-------+--------+------+
|transaction_id|      date|   product|    category| amount|quantity|region|
+---------------+----------+----------+------------+-------+--------+------+
|      TXN000000|2024-03-15|    Laptop|Electronics|1245.67|       3| North|
|      TXN000001|2024-07-22|     Mouse|Accessories|  23.45|       5|  East|
|      TXN000002|2024-01-08|  Keyboard|Accessories|  67.89|       2| South|
|      TXN000003|2024-11-30|   Monitor|Electronics| 345.00|       1|  West|
|      TXN000004|2024-05-12|Headphones|Accessories| 125.50|       4| North|
+---------------+----------+----------+------------+-------+--------+------+
only showing top 5 rows

26/01/13 16:27:00 INFO GlueCatalog: Glue catalog initialized
26/01/13 16:27:15 INFO BaseTable: Creating Iceberg table bronze.sales
Bronze table created in Glue Catalog
26/01/13 16:28:30 INFO BaseTable: Creating Iceberg table silver.sales_clean with partitioning
Silver table created with partitioning
26/01/13 16:29:45 INFO BaseTable: Creating Iceberg table gold.sales_summary
Gold table created with aggregations

=== Bronze Layer Sample ===
+---------------+----------+----------+------------+-------+--------+------+
|transaction_id|      date|   product|    category| amount|quantity|region|
+---------------+----------+----------+------------+-------+--------+------+
|      TXN000000|2024-03-15|    Laptop|Electronics|1245.67|       3| North|
|      TXN000001|2024-07-22|     Mouse|Accessories|  23.45|       5|  East|
|      TXN000002|2024-01-08|  Keyboard|Accessories|  67.89|       2| South|
|      TXN000003|2024-11-30|   Monitor|Electronics| 345.00|       1|  West|
|      TXN000004|2024-05-12|Headphones|Accessories| 125.50|       4| North|
+---------------+----------+----------+------------+-------+--------+------+

=== Silver Layer Sample ===
+---------------+----------+----------+------------+-------+--------+------+----+-----+
|transaction_id|      date|   product|    category| amount|quantity|region|year|month|
+---------------+----------+----------+------------+-------+--------+------+----+-----+
|      TXN000000|2024-03-15|    Laptop|Electronics|1245.67|       3| North|2024|    3|
|      TXN000001|2024-07-22|     Mouse|Accessories|  23.45|       5|  East|2024|    7|
|      TXN000002|2024-01-08|  Keyboard|Accessories|  67.89|       2| South|2024|    1|
|      TXN000003|2024-11-30|   Monitor|Electronics| 345.00|       1|  West|2024|   11|
|      TXN000004|2024-05-12|Headphones|Accessories| 125.50|       4| North|2024|    5|
+---------------+----------+----------+------------+-------+--------+------+----+-----+

=== Gold Layer Sample ===
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+
|year|month|    category|region|total_revenue|total_quantity|avg_transaction_value|transaction_count|
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+
|2024|    7|Electronics| North|    987654.32|          4523|              218.45   |             4521|
|2024|    3|Electronics|  East|    876543.21|          3892|              225.23   |             3891|
|2024|   11|Accessories| South|    765432.10|          5234|               146.32  |             5231|
|2024|    5|Electronics|  West|    654321.09|          2987|               219.05  |             2988|
|2024|    1|Accessories| North|    543210.98|          4123|               131.78  |             4124|
|2024|    8|Electronics| South|    432109.87|          2156|               200.42  |             2157|
|2024|    6|Accessories|  East|    321098.76|          3567|               90.01   |             3568|
|2024|    9|Electronics| North|    210987.65|          1876|               112.45  |             1877|
|2024|    2|Accessories|  West|    109876.54|          2345|               46.84   |             2346|
|2024|   10|Electronics|  East|     98765.43|          1234|               80.02   |             1235|
+----+-----+------------+------+-------------+--------------+-----------------------+-----------------+

=== Tables in Glue Catalog ===
+---------+----------+-----------+
|namespace| tableName|isTemporary|
+---------+----------+-----------+
|    bronze|     sales|      false|
+---------+----------+-----------+

+---------+-----------+-----------+
|namespace|  tableName|isTemporary|
+---------+-----------+-----------+
|   silver|sales_clean|      false|
+---------+-----------+-----------+

+---------+-------------+-----------+
|namespace|    tableName|isTemporary|
+---------+-------------+-----------+
|     gold|sales_summary|      false|
+---------+-------------+-----------+

26/01/13 16:30:15 INFO SparkContext: Successfully stopped SparkContext
26/01/13 16:30:16 INFO ShutdownHookManager: Shutdown hook called

$ kubectl get pods -n spark-jobs -l spark-role=executor
[2026-01-13 11:31:00] NAME                                  READY   STATUS      RESTARTS   AGE
process-sales-data-1705165470-exec-1  1/1     Running     0          5m
process-sales-data-1705165470-exec-2  1/1     Running     0          5m
process-sales-data-1705165470-exec-3  1/1     Running     0          5m

Test 2: Verify Glue Catalog Tables

# List databases
aws glue get-databases --region $AWS_REGION

# List tables in bronze database
aws glue get-tables --database-name bronze --region $AWS_REGION

# Get table details
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION

# Check table location and format
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION \
  --query 'Table.StorageDescriptor.Location'

Example Output:

$ aws glue get-tables --database-name bronze --region us-east-1
[2026-01-13 11:35:00] {
    "TableList": [
        {
            "Name": "sales",
            "DatabaseName": "bronze",
            "CreateTime": "2026-01-13T16:27:15.123000-05:00",
            "UpdateTime": "2026-01-13T16:27:15.123000-05:00",
            "Retention": 0,
            "StorageDescriptor": {
                "Columns": [
                    {
                        "Name": "transaction_id",
                        "Type": "string"
                    },
                    {
                        "Name": "date",
                        "Type": "string"
                    },
                    {
                        "Name": "product",
                        "Type": "string"
                    },
                    {
                        "Name": "category",
                        "Type": "string"
                    },
                    {
                        "Name": "amount",
                        "Type": "double"
                    },
                    {
                        "Name": "quantity",
                        "Type": "bigint"
                    },
                    {
                        "Name": "region",
                        "Type": "string"
                    }
                ],
                "Location": "s3://lakehouse-data-123456789012/warehouse/bronze.db/sales",
                "InputFormat": "org.apache.iceberg.mr.hive.HiveIcebergInputFormat",
                "OutputFormat": "org.apache.iceberg.mr.hive.HiveIcebergOutputFormat",
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.iceberg.mr.hive.HiveIcebergSerDe"
                }
            },
            "Parameters": {
                "table_type": "ICEBERG",
                "metadata_location": "s3://lakehouse-data-123456789012/warehouse/bronze.db/sales/metadata/00001-a1b2c3d4-e5f6-7890-abcd-ef1234567890.metadata.json"
            },
            "CatalogId": "123456789012"
        }
    ]
}

$ aws glue get-table --database-name silver --name sales_clean --region us-east-1 --query 'Table.StorageDescriptor.Location'
[2026-01-13 11:35:30] "s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean"

Test 3: Verify Data in S3

# List warehouse contents
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/ --recursive --human-readable

# Check Iceberg metadata
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/sales_clean/metadata/

# List data files
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/sales_clean/data/

Example Output:

$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/ --recursive --human-readable
[2026-01-13 11:40:00] 2026-01-13 11:27:30   45.2 MiB  warehouse/bronze.db/sales/data/00000-0-a1b2c3d4-e5f6-7890-abcd-ef1234567890-00001.parquet
2026-01-13 11:27:31    3.2 KiB  warehouse/bronze.db/sales/metadata/00000-12345678-90ab-cdef-1234-567890abcdef.metadata.json
2026-01-13 11:27:31    5.1 KiB  warehouse/bronze.db/sales/metadata/00001-a1b2c3d4-e5f6-7890-abcd-ef1234567890.metadata.json
2026-01-13 11:27:31    2.8 KiB  warehouse/bronze.db/sales/metadata/snap-1234567890123456789-1-a1b2c3d4.avro
2026-01-13 11:28:45   42.1 MiB  warehouse/silver.db/sales_clean/data/year=2024/month=1/00000-0-b2c3d4e5-f6g7-8901-bcde-f12345678901-00001.parquet
2026-01-13 11:28:46   38.7 MiB  warehouse/silver.db/sales_clean/data/year=2024/month=2/00001-0-c3d4e5f6-g7h8-9012-cdef-123456789012-00001.parquet
2026-01-13 11:28:47   41.3 MiB  warehouse/silver.db/sales_clean/data/year=2024/month=3/00002-0-d4e5f6g7-h8i9-0123-defg-234567890123-00001.parquet
2026-01-13 11:28:47    3.5 KiB  warehouse/silver.db/sales_clean/metadata/00000-23456789-01bc-defg-2345-678901bcdefg.metadata.json
2026-01-13 11:28:47    6.2 KiB  warehouse/silver.db/sales_clean/metadata/00001-b2c3d4e5-f6g7-8901-bcde-f12345678901.metadata.json
2026-01-13 11:29:50  512.3 KiB  warehouse/gold.db/sales_summary/data/00000-0-e5f6g7h8-i9j0-1234-efgh-345678901234-00001.parquet
2026-01-13 11:29:50    3.1 KiB  warehouse/gold.db/sales_summary/metadata/00000-34567890-12cd-efgh-3456-789012cdefgh.metadata.json
2026-01-13 11:29:50    4.8 KiB  warehouse/gold.db/sales_summary/metadata/00001-c3d4e5f6-g7h8-9012-cdef-123456789012.metadata.json

$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean/metadata/
[2026-01-13 11:40:15] 2026-01-13 11:28:47       3542 00000-23456789-01bc-defg-2345-678901bcdefg.metadata.json
2026-01-13 11:28:47       6234 00001-b2c3d4e5-f6g7-8901-bcde-f12345678901.metadata.json
2026-01-13 11:28:47       2876 snap-2345678901234567890-1-b2c3d4e5.avro
2026-01-13 11:28:47       4123 v1.metadata.json
2026-01-13 11:28:47         42 version-hint.text

$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/silver.db/sales_clean/data/
[2026-01-13 11:40:30]
                           PRE year=2024/

Test 4: Query Data with Athena

# Create Athena workgroup (optional)
aws athena create-work-group \
  --name lakehouse-queries \
  --configuration "ResultConfigurationUpdates={OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/}" \
  --region $AWS_REGION

# Query silver table using Athena
aws athena start-query-execution \
  --query-string "SELECT * FROM silver.sales_clean LIMIT 10" \
  --result-configuration "OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/" \
  --region $AWS_REGION

# Query gold aggregations
aws athena start-query-execution \
  --query-string "SELECT category, region, SUM(total_revenue) as revenue FROM gold.sales_summary GROUP BY category, region ORDER BY revenue DESC" \
  --result-configuration "OutputLocation=s3://$LAKEHOUSE_BUCKET/athena-results/" \
  --region $AWS_REGION

Example Output:

$ aws athena create-work-group --name lakehouse-queries --configuration "ResultConfigurationUpdates={OutputLocation=s3://lakehouse-data-123456789012/athena-results/}" --region us-east-1
[2026-01-13 11:45:00] (No output indicates success)

$ aws athena start-query-execution --query-string "SELECT * FROM silver.sales_clean LIMIT 10" --result-configuration "OutputLocation=s3://lakehouse-data-123456789012/athena-results/" --region us-east-1
[2026-01-13 11:45:15] {
    "QueryExecutionId": "a1b2c3d4-e5f6-7890-abcd-ef1234567890"
}

$ aws athena start-query-execution --query-string "SELECT category, region, SUM(total_revenue) as revenue FROM gold.sales_summary GROUP BY category, region ORDER BY revenue DESC" --result-configuration "OutputLocation=s3://lakehouse-data-123456789012/athena-results/" --region us-east-1
[2026-01-13 11:45:30] {
    "QueryExecutionId": "b2c3d4e5-f6g7-8901-bcde-f12345678901"
}

Test 5: Stateless Compute Validation

# Step 1: Note current table state
echo "=== Before Cluster Deletion ==="
aws glue get-tables --database-name silver --region $AWS_REGION --query 'TableList[*].Name'

# Step 2: Delete ROSA cluster
echo "Deleting ROSA cluster..."
rosa delete cluster --cluster=$CLUSTER_NAME --yes

# Wait for deletion (or do this async)
# rosa logs uninstall --cluster=$CLUSTER_NAME --watch

# Step 3: Verify data persists in S3
echo "=== Data Still Exists in S3 ==="
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/ --recursive | wc -l

# Step 4: Verify metadata persists in Glue
echo "=== Metadata Still Exists in Glue ==="
aws glue get-tables --database-name silver --region $AWS_REGION --query 'TableList[*].Name'

# Step 5: Recreate cluster and verify access
# (Follow Phase 1 steps to recreate cluster)
# Then resubmit Spark job to prove data is accessible

echo "=== Stateless Compute Validated ==="
echo "All data and metadata persisted despite cluster deletion!"

Example Output:

$ echo "=== Before Cluster Deletion ==="
[2026-01-13 12:00:00] === Before Cluster Deletion ===

$ aws glue get-tables --database-name silver --region us-east-1 --query 'TableList[*].Name'
[2026-01-13 12:00:05] [
    "sales_clean"
]

$ echo "Deleting ROSA cluster..."
[2026-01-13 12:00:10] Deleting ROSA cluster...

$ rosa delete cluster --cluster=data-lakehouse --yes
[2026-01-13 12:00:15] I: Cluster 'data-lakehouse' will start uninstalling now
I: To watch the cluster uninstallation logs, run 'rosa logs uninstall -c data-lakehouse --watch'

$ echo "=== Data Still Exists in S3 ==="
[2026-01-13 12:35:00] === Data Still Exists in S3 ===

$ aws s3 ls s3://lakehouse-data-123456789012/warehouse/ --recursive | wc -l
[2026-01-13 12:35:15] 42

$ echo "=== Metadata Still Exists in Glue ==="
[2026-01-13 12:35:20] === Metadata Still Exists in Glue ===

$ aws glue get-tables --database-name silver --region us-east-1 --query 'TableList[*].Name'
[2026-01-13 12:35:25] [
    "sales_clean"
]

$ echo "=== Stateless Compute Validated ==="
[2026-01-13 12:35:30] === Stateless Compute Validated ===

$ echo "All data and metadata persisted despite cluster deletion!"
[2026-01-13 12:35:35] All data and metadata persisted despite cluster deletion!

Resource Cleanup

To avoid ongoing AWS charges, follow these steps to clean up all resources.

Step 1: Delete Spark Applications

# Delete all Spark applications
kubectl delete sparkapplication --all -n spark-jobs

# Wait for pods to terminate
kubectl get pods -n spark-jobs

Example Output:

$ kubectl delete sparkapplication --all -n spark-jobs
[2026-01-13 13:00:00] sparkapplication.sparkoperator.k8s.io "process-sales-data" deleted

$ kubectl get pods -n spark-jobs
[2026-01-13 13:00:15] No resources found in spark-jobs namespace.

Step 2: Delete Spark Operator

# Uninstall Spark Operator
helm uninstall spark-operator -n spark-operator

# Delete namespace
kubectl delete namespace spark-operator
kubectl delete namespace spark-jobs

Example Output:

$ helm uninstall spark-operator -n spark-operator
[2026-01-13 13:02:00] release "spark-operator" uninstalled

$ kubectl delete namespace spark-operator
[2026-01-13 13:02:15] namespace "spark-operator" deleted

$ kubectl delete namespace spark-jobs
[2026-01-13 13:02:30] namespace "spark-jobs" deleted

Step 3: Delete ROSA Cluster

# Delete ROSA cluster
rosa delete cluster --cluster=$CLUSTER_NAME --yes

# Wait for deletion
rosa logs uninstall --cluster=$CLUSTER_NAME --watch

# Verify deletion
rosa list clusters

Example Output:

$ rosa delete cluster --cluster=data-lakehouse --yes
[2026-01-13 13:05:00] I: Cluster 'data-lakehouse' will start uninstalling now
I: To watch the cluster uninstallation logs, run 'rosa logs uninstall -c data-lakehouse --watch'

$ rosa logs uninstall --cluster=data-lakehouse --watch
[2026-01-13 13:05:15] time="2026-01-13T13:05:15Z" level=info msg="Destroying cluster resources"
time="2026-01-13T13:06:30Z" level=info msg="Deleting worker nodes"
time="2026-01-13T13:10:45Z" level=info msg="Deleting control plane"
time="2026-01-13T13:25:20Z" level=info msg="Removing load balancers"
time="2026-01-13T13:30:00Z" level=info msg="Deleting VPC and subnets"
time="2026-01-13T13:35:45Z" level=info msg="Cluster uninstallation complete"

$ rosa list clusters
[2026-01-13 13:36:00] ID  NAME  STATE  TOPOLOGY
(No clusters found)

Step 4: Delete Glue Catalog Resources

# Delete tables from all databases
for db in bronze silver gold lakehouse; do
  echo "Deleting tables from database: $db"

  # Get table names
  TABLES=$(aws glue get-tables --database-name $db --region $AWS_REGION --query 'TableList[*].Name' --output text)

  # Delete each table
  for table in $TABLES; do
    echo "  Deleting table: $table"
    aws glue delete-table --database-name $db --name $table --region $AWS_REGION
  done

  # Delete database
  echo "Deleting database: $db"
  aws glue delete-database --name $db --region $AWS_REGION
done

echo "Glue Catalog resources deleted"

Example Output:

$ for db in bronze silver gold lakehouse; do
  [output for each database]
done
[2026-01-13 13:40:00] Deleting tables from database: bronze
  Deleting table: sales
Deleting database: bronze
Deleting tables from database: silver
  Deleting table: sales_clean
Deleting database: silver
Deleting tables from database: gold
  Deleting table: sales_summary
Deleting database: gold
Deleting tables from database: lakehouse
Deleting database: lakehouse

$ echo "Glue Catalog resources deleted"
[2026-01-13 13:41:00] Glue Catalog resources deleted

Step 5: Delete S3 Bucket

# Delete all objects in bucket
aws s3 rm s3://$LAKEHOUSE_BUCKET --recursive --region $AWS_REGION

# Delete bucket
aws s3 rb s3://$LAKEHOUSE_BUCKET --region $AWS_REGION

echo "S3 bucket deleted"

Example Output:

$ aws s3 rm s3://lakehouse-data-123456789012 --recursive --region us-east-1
[2026-01-13 13:45:00] delete: s3://lakehouse-data-123456789012/bronze/
delete: s3://lakehouse-data-123456789012/bronze/sales/sales_data.csv
delete: s3://lakehouse-data-123456789012/gold/
delete: s3://lakehouse-data-123456789012/scripts/incremental_pipeline.py
delete: s3://lakehouse-data-123456789012/scripts/process_sales.py
delete: s3://lakehouse-data-123456789012/scripts/time_travel.py
delete: s3://lakehouse-data-123456789012/silver/
delete: s3://lakehouse-data-123456789012/warehouse/
[... 42 more deletions ...]

$ aws s3 rb s3://lakehouse-data-123456789012 --region us-east-1
[2026-01-13 13:46:00] remove_bucket: lakehouse-data-123456789012

$ echo "S3 bucket deleted"
[2026-01-13 13:46:05] S3 bucket deleted

Step 6: Delete IAM Resources

# Delete IAM role policy
aws iam delete-role-policy \
  --role-name SparkGlueCatalogRole \
  --policy-name GlueS3Access

# Delete IAM role
aws iam delete-role --role-name SparkGlueCatalogRole

echo "IAM resources deleted"

Example Output:

$ aws iam delete-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access
[2026-01-13 13:48:00] (No output indicates success)

$ aws iam delete-role --role-name SparkGlueCatalogRole
[2026-01-13 13:48:15] (No output indicates success)

$ echo "IAM resources deleted"
[2026-01-13 13:48:20] IAM resources deleted

Step 7: Clean Up Local Files

# Remove temporary files
rm -f spark-glue-trust-policy.json
rm -f spark-glue-policy.json
rm -f lakehouse-bucket-policy.json
rm -rf sample-data/
rm -rf spark-jobs/
rm -rf spark-iceberg/

echo "Local files cleaned up"

Example Output:

$ rm -f spark-glue-trust-policy.json spark-glue-policy.json lakehouse-bucket-policy.json
[2026-01-13 13:50:00]

$ rm -rf sample-data/ spark-jobs/ spark-iceberg/
[2026-01-13 13:50:05]

$ echo "Local files cleaned up"
[2026-01-13 13:50:10] Local files cleaned up

Verification

# Verify ROSA cluster is deleted
rosa list clusters

# Verify S3 bucket is deleted
aws s3 ls | grep lakehouse

# Verify Glue databases are deleted
aws glue get-databases --region $AWS_REGION | grep -E "bronze|silver|gold|lakehouse"

# Verify IAM role is deleted
aws iam get-role --role-name SparkGlueCatalogRole 2>&1 | grep NoSuchEntity

echo "Cleanup verification complete"

Example Output:

$ rosa list clusters
[2026-01-13 13:52:00] ID  NAME  STATE  TOPOLOGY
(No clusters found)

$ aws s3 ls | grep lakehouse
[2026-01-13 13:52:15] (No output - bucket deleted)

$ aws glue get-databases --region us-east-1 | grep -E "bronze|silver|gold|lakehouse"
[2026-01-13 13:52:30] (No output - databases deleted)

$ aws iam get-role --role-name SparkGlueCatalogRole 2>&1 | grep NoSuchEntity
[2026-01-13 13:52:45] An error occurred (NoSuchEntity) when calling the GetRole operation: The role with name SparkGlueCatalogRole cannot be found.

$ echo "Cleanup verification complete"
[2026-01-13 13:53:00] Cleanup verification complete

Troubleshooting

Issue: Spark Cannot Connect to Glue Catalog

Symptoms: Spark jobs fail with Glue Catalog connection errors

Solutions:

Verify IAM role has Glue permissions
Check service account annotation
Verify AWS region configuration
Check Glue Catalog connectivity

# Verify service account has IAM role
kubectl get sa spark-sa -n spark-jobs -o yaml | grep eks.amazonaws.com

# Test Glue access from pod
kubectl run aws-test --rm -it --image=amazon/aws-cli --serviceaccount=spark-sa -n spark-jobs -- \
  glue get-databases --region $AWS_REGION

# Check Spark configuration
kubectl get configmap spark-config -n spark-jobs -o yaml

Issue: S3 Access Denied Errors

Symptoms: Spark jobs fail with S3 403 Forbidden errors

Solutions:

Verify IAM role has S3 permissions
Check bucket policy
Verify IRSA configuration
Check S3 endpoint configuration

# Test S3 access from pod
kubectl run aws-test --rm -it --image=amazon/aws-cli --serviceaccount=spark-sa -n spark-jobs -- \
  s3 ls s3://$LAKEHOUSE_BUCKET/

# Check IAM role permissions
aws iam get-role-policy --role-name SparkGlueCatalogRole --policy-name GlueS3Access

# Verify bucket policy
aws s3api get-bucket-policy --bucket $LAKEHOUSE_BUCKET

Issue: Iceberg Table Not Found

Symptoms: Queries fail with "Table not found" errors

Solutions:

Verify table exists in Glue Catalog
Check Spark Catalog configuration
Verify warehouse location
Check table format

# List tables in Glue
aws glue get-tables --database-name silver --region $AWS_REGION

# Check if table is Iceberg format
aws glue get-table --database-name silver --name sales_clean --region $AWS_REGION \
  --query 'Table.Parameters."table_type"'

# Verify warehouse location
aws s3 ls s3://$LAKEHOUSE_BUCKET/warehouse/silver.db/

Issue: Spark Executors Not Starting

Symptoms: Driver pod runs but executors don't start

Solutions:

Check resource availability
Verify RBAC permissions
Check image pull policy
Review executor logs

# Check node resources
kubectl top nodes

# Check pending pods
kubectl get pods -n spark-jobs

# Describe pending executor pod
kubectl describe pod <executor-pod-name> -n spark-jobs

# Check events
kubectl get events -n spark-jobs --sort-by='.lastTimestamp'

Issue: Performance Issues

Symptoms: Spark jobs are slow

Solutions:

Increase executor resources
Adjust partition count
Enable adaptive query execution
Optimize Iceberg table layout

# Update SparkApplication with more resources
kubectl edit sparkapplication process-sales-data -n spark-jobs

# Check execution plan
# Add to Spark configuration:
# spark.sql.adaptive.enabled=true
# spark.sql.adaptive.coalescePartitions.enabled=true

# Compact Iceberg table
# Run in Spark:
# spark.sql("CALL glue_catalog.system.rewrite_data_files('silver.sales_clean')")

Debug Commands

# View all Spark applications
kubectl get sparkapplication -n spark-jobs

# Get application status
kubectl get sparkapplication process-sales-data -n spark-jobs -o yaml

# View driver logs
kubectl logs -n spark-jobs -l spark-role=driver

# View executor logs
kubectl logs -n spark-jobs -l spark-role=executor --tail=100

# Check Spark Operator logs
kubectl logs -n spark-operator deployment/spark-operator

# List all pods
kubectl get pods -n spark-jobs -o wide

# Check configmaps
kubectl get configmap -n spark-jobs

# View events
kubectl get events -n spark-jobs --sort-by='.lastTimestamp' | tail -20

Top comments (1)

Prakash Rao • Feb 10

very useful blog, Thanks. This is what I am looking for!

Data Lakehouse on ROSA with Apache Spark, Iceberg, and AWS Glue

Table of Contents

Overview

Project Purpose

Key Value Propositions

Solution Components

Architecture

High-Level Architecture Diagram

Workflow

Stateless Compute Demonstration

Prerequisites

Required Accounts and Subscriptions

Required Tools

AWS Prerequisites

Service Quotas

IAM Permissions

Knowledge Prerequisites

Phase 1: ROSA Cluster Setup

Step 1.1: Configure AWS CLI

Step 1.2: Initialize ROSA

Step 1.3: Create ROSA Cluster

Step 1.4: Monitor Cluster Creation

Step 1.5: Create Admin User and Connect

Step 1.6: Create Project Namespaces

Phase 2: AWS Glue Data Catalog Configuration

Step 2.1: Create Glue Database

Step 2.2: Create IAM Role for Glue Catalog Access

Step 2.3: Create IAM Policy for Glue and S3 Access

Phase 3: S3 Data Lake Setup

Step 3.1: Create S3 Buckets

Step 3.2: Configure S3 Bucket Policies

Step 3.3: Upload Sample Data

Phase 4: Apache Spark on OpenShift

Step 4.1: Install Spark Operator

Step 4.2: Create Service Account for Spark

Step 4.3: Create ConfigMap for Spark Configuration

Phase 5: Apache Iceberg Integration

Step 5.1: Build Custom Spark Image with Iceberg

Phase 6: Spark-Glue Catalog Integration

Step 6.1: Create Sample Spark Application

Step 6.2: Create SparkApplication Custom Resource

Phase 7: Sample Data Pipelines

Step 7.1: Create Incremental Processing Pipeline

Step 7.2: Create Time Travel Query Example

Testing and Validation

Test 1: Monitor Spark Application

Test 2: Verify Glue Catalog Tables

Test 3: Verify Data in S3

Test 4: Query Data with Athena

Test 5: Stateless Compute Validation

Resource Cleanup

Step 1: Delete Spark Applications

Step 2: Delete Spark Operator

Step 3: Delete ROSA Cluster

Step 4: Delete Glue Catalog Resources

Step 5: Delete S3 Bucket

Step 6: Delete IAM Resources

Step 7: Clean Up Local Files

Verification

Troubleshooting

Issue: Spark Cannot Connect to Glue Catalog

Issue: S3 Access Denied Errors

Issue: Iceberg Table Not Found

Issue: Spark Executors Not Starting

Issue: Performance Issues

Debug Commands